Text Mining DocSouth Slave Narrative Archive

by James E. Dobson (James.E.Dobson@Dartmouth.edu)


Note: This is one in a series of documents and notebooks that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, Digital Humanities and the Search for a Method. I have published a critique of some existing methods (Dobson 2016) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities.

Revision Date and Notes:

07/31/2017: Initial version (james.e.dobson@dartmouth.edu)
01/18/2018: Updated with explanation of isalpha() test

Notes on Text Pre-Processing for the Humanities

The pre-processing of text is a language and potentially genre-dependent activity.

After tokenizing with NLTK, testing a string to see if it is alphabetic is a common way of removing a number of textual artifacts, such as punctuation, that might add “noise” to our data:

Here, for instance, we see a common way of testing for the presence of quotation marks:

In [136]: '”'.isalpha()
Out[136]: False

The following iterates through a source input list object, preserving only words that contain alphabetical characters, and returning back as output a new list for further steps in a preprocessing workflow:

alpha_words = [word for word in input_text if word.isalpha() ]

This also removes numbers, for example, those used in ordinals:

In [139]: '2nd'.isalpha()
Out[139]: False
In [140]: 'second'.isalpha()
Out[140]: True

And any word that contains non-alphabetical characters will be removed with this method:

In [141]: "self-emancipated".isalpha()
Out[141]: False

We can, instead, conduct the above test only on single character words. This might, however, require some additional steps to strip stray punctuation from the beginning and endings of words. The following ordered list segment demonstrates the advantages and problems of this method. We have now preserved the dates, the title ‘Mr.’, and the last name. The simple isalpha() test would have dropped five of the following nine words:

'association', 'march', '31', '1845', '“no', 'matter', 'said', 'mr.', "o'connell"

We want now to make sure that we remove the leading ‘“’ from ‘no’ but not remove the period that follows “mr.”

'association', 'march', '31', '1845', 'no', 'matter', 'said', 'mr.', "o'connell"

Pre-processing is nothing if not an iterative process.

Pre-Processing of DocSouth Texts

In [1]:
# Display the count and list of stopwords in the NTLK English language stopword list
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print("total stopwords:",len(stopwords))
print(stopwords)
total stopwords: 127
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
In [2]:
# This function defines a set of steps to be used in each input text. 
# It does the following: 
#  - removes all non-alpha characters (numbers, stray punctuation, etc)
#  - converts all words ot lowercase
#  - removes the above 127 NLTK-defined stopwords
#  - removes an additional set of stopwords
def preprocess(text):
        pp_text = [word for word in text if word.isalpha() ]
        pp_text = [word.lower() for word in pp_text]
        from nltk.corpus import stopwords
        stopwords = stopwords.words('english')
        custom_stopwords="""like go going gone one would got still really get"""
        stopwords += custom_stopwords.split()
        pp_text = [word for word in pp_text if word not in stopwords]
        return pp_text
In [3]:
import nltk
import os

data_dir="na-slave-narratives/data/texts"
slave_narrative_archive={}

for file in os.listdir(data_dir):
    slave_narrative_archive[file]={}
    path = data_dir + "/" + file
    fstream = open(path,encoding='utf-8')
    text = fstream.read()
    tokens = nltk.word_tokenize(text)
    slave_narrative_archive[file]['raw_token_count'] = len(tokens)
    tokens = preprocess(tokens)
    slave_narrative_archive[file]['preprocessed_token_count'] = len(tokens)
    slave_narrative_archive[file]['tokens'] = nltk.Text(tokens)

The following Jupyter cell shows the percentage of the texts (in terms of total word count). available after preprocessing.

In [4]:
%matplotlib inline  
import matplotlib.pyplot as plt
values=[]
for text in slave_narrative_archive:
    values.append(int(100 - (slave_narrative_archive[text]['preprocessed_token_count'] / 
                             slave_narrative_archive[text]['raw_token_count'] * 100)))
plt.plot(values)
plt.ylabel('%Text Post-Preprocessing')
plt.show()