Text Mining DocSouth Slave Narrative Archive


Note: This is the first in a series of documents and notebooks that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, Digital Humanities and the Search for a Method. I have published a critique of some existing methods (Dobson 2015) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities. Each notebook displays code, data, results, interpretation, and critique. I attempt to provide as much explanation of the individual steps and documentation (along with citations of related papers) of the concepts and justification of choices made.

In [1]:
import sys, os
import nltk

from nltk import pos_tag, ne_chunk
from nltk.tokenize import wordpunct_tokenize
In [2]:
# load local library
sys.path.append("lib")
import docsouth_utils
In [3]:
# each dictionary entry in the 'list' object returned by load_narratives 
# contains the following keys:
#  'author' = Author of the text (first name, last name)
#  'title' = Title of the text
#  'year' = Year published as integer or False if not simple four-digit year
#  'file' = Filename of text
#  'text' = NLTK Text object

neh_slave_archive = docsouth_utils.load_narratives()

Preprocessing Step

This function takes a NLTK Text object and returns another. It removes a number of common stopwords and drops words that are not alphabetical (i.e., numbers).

In [4]:
# This function defines a set of steps to be used in each input text. 
# It does the following: 
#  - removes all non-alpha characters (numbers, stray punctuation, etc)
#  - converts all words ot lowercase
#  - removes the above 127 NLTK-defined stopwords
#  - removes an additional set of stopwords
def preprocess(text):
        pp_text = [word for word in text if word.isalpha() ]
        pp_text = [word.lower() for word in pp_text]
        from nltk.corpus import stopwords
        stopwords = stopwords.words('english')
        custom_stopwords="""like go going gone one would got still really get"""
        stopwords += custom_stopwords.split()
        pp_text = [word for word in pp_text if word not in stopwords]
        pp_text = nltk.Text(pp_text)
        return pp_text
In [5]:
neh_slave_archive[0]['text'].collocations(num=25, window_size=4)
dear Sally; dear friends; Mrs. Lockwood; solemnly declare; dear
brethern; earnestly exhort; heavenly father; Henry Ivens; dear
brethren; earthly considerations; fellow creature; Creator face; human
beings; Enoch Sharp; July 1797; United States; first place; presence
God; smoak house; freely forgive; beyond doubt; John Williams; natural
rights; well know; jail July
In [6]:
preprocess(neh_slave_archive[0]['text']).collocations(num=25, window_size=4)
dear sally; dear friends; solemnly declare; earnestly exhort; dear
brethern; henry ivens; heavenly father; dear brethren; freely forgive;
earthly considerations; fellow creature; awe reverence; creator face;
guinea negro; enoch sharp; presence god; abraham johnstone; human
beings; may bless; beyond doubt; actuated motives; smoak house; john
williams; first place; natural rights
In [ ]: