Note: This is the first in a series of documents and notebooks that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, Digital Humanities and the Search for a Method. I have published a critique of some existing methods (Dobson 2015) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities. Each notebook displays code, data, results, interpretation, and critique. I attempt to provide as much explanation of the individual steps and documentation (along with citations of related papers) of the concepts and justification of choices made.
import sys, os import nltk from nltk import pos_tag, ne_chunk from nltk.tokenize import wordpunct_tokenize
# load local library sys.path.append("lib") import docsouth_utils
# each dictionary entry in the 'list' object returned by load_narratives # contains the following keys: # 'author' = Author of the text (first name, last name) # 'title' = Title of the text # 'year' = Year published as integer or False if not simple four-digit year # 'file' = Filename of text # 'text' = NLTK Text object neh_slave_archive = docsouth_utils.load_narratives()
This function takes a NLTK Text object and returns another. It removes a number of common stopwords and drops words that are not alphabetical (i.e., numbers).
# This function defines a set of steps to be used in each input text. # It does the following: # - removes all non-alpha characters (numbers, stray punctuation, etc) # - converts all words ot lowercase # - removes the above 127 NLTK-defined stopwords # - removes an additional set of stopwords def preprocess(text): pp_text = [word for word in text if word.isalpha() ] pp_text = [word.lower() for word in pp_text] from nltk.corpus import stopwords stopwords = stopwords.words('english') custom_stopwords="""like go going gone one would got still really get""" stopwords += custom_stopwords.split() pp_text = [word for word in pp_text if word not in stopwords] pp_text = nltk.Text(pp_text) return pp_text
dear Sally; dear friends; Mrs. Lockwood; solemnly declare; dear brethern; earnestly exhort; heavenly father; Henry Ivens; dear brethren; earthly considerations; fellow creature; Creator face; human beings; Enoch Sharp; July 1797; United States; first place; presence God; smoak house; freely forgive; beyond doubt; John Williams; natural rights; well know; jail July
dear sally; dear friends; solemnly declare; earnestly exhort; dear brethern; henry ivens; heavenly father; dear brethren; freely forgive; earthly considerations; fellow creature; awe reverence; creator face; guinea negro; enoch sharp; presence god; abraham johnstone; human beings; may bless; beyond doubt; actuated motives; smoak house; john williams; first place; natural rights