by James E. Dobson (James.E.Dobson@Dartmouth.edu)
Note: This is one in a series of documents and notebooks that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, Digital Humanities and the Search for a Method. I have published a critique of some existing methods (Dobson 2016) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities.
07/31/2017: Initial version (james.e.dobson@dartmouth.edu) 01/18/2018: Updated with explanation of isalpha() test
After tokenizing with NLTK, testing a string to see if it is alphabetic is a common way of removing a number of textual artifacts, such as punctuation, that might add “noise” to our data:
Here, for instance, we see a common way of testing for the presence of quotation marks:
In [136]: '”'.isalpha() Out[136]: False
The following iterates through a source input list object, preserving only words that contain alphabetical characters, and returning back as output a new list for further steps in a preprocessing workflow:
alpha_words = [word for word in input_text if word.isalpha() ]
This also removes numbers, for example, those used in ordinals:
In [139]: '2nd'.isalpha() Out[139]: False In [140]: 'second'.isalpha() Out[140]: True
And any word that contains non-alphabetical characters will be removed with this method:
In [141]: "self-emancipated".isalpha() Out[141]: False
We can, instead, conduct the above test only on single character words. This might, however, require some additional steps to strip stray punctuation from the beginning and endings of words. The following ordered list segment demonstrates the advantages and problems of this method. We have now preserved the dates, the title ‘Mr.’, and the last name. The simple isalpha() test would have dropped five of the following nine words:
'association', 'march', '31', '1845', '“no', 'matter', 'said', 'mr.', "o'connell"
We want now to make sure that we remove the leading ‘“’ from ‘no’ but not remove the period that follows “mr.”
'association', 'march', '31', '1845', 'no', 'matter', 'said', 'mr.', "o'connell"
Pre-processing is nothing if not an iterative process.
# Display the count and list of stopwords in the NTLK English language stopword list
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print("total stopwords:",len(stopwords))
print(stopwords)
# This function defines a set of steps to be used in each input text.
# It does the following:
# - removes all non-alpha characters (numbers, stray punctuation, etc)
# - converts all words ot lowercase
# - removes the above 127 NLTK-defined stopwords
# - removes an additional set of stopwords
def preprocess(text):
pp_text = [word for word in text if word.isalpha() ]
pp_text = [word.lower() for word in pp_text]
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
custom_stopwords="""like go going gone one would got still really get"""
stopwords += custom_stopwords.split()
pp_text = [word for word in pp_text if word not in stopwords]
return pp_text
import nltk
import os
data_dir="na-slave-narratives/data/texts"
slave_narrative_archive={}
for file in os.listdir(data_dir):
slave_narrative_archive[file]={}
path = data_dir + "/" + file
fstream = open(path,encoding='utf-8')
text = fstream.read()
tokens = nltk.word_tokenize(text)
slave_narrative_archive[file]['raw_token_count'] = len(tokens)
tokens = preprocess(tokens)
slave_narrative_archive[file]['preprocessed_token_count'] = len(tokens)
slave_narrative_archive[file]['tokens'] = nltk.Text(tokens)
The following Jupyter cell shows the percentage of the texts (in terms of total word count). available after preprocessing.
%matplotlib inline
import matplotlib.pyplot as plt
values=[]
for text in slave_narrative_archive:
values.append(int(100 - (slave_narrative_archive[text]['preprocessed_token_count'] /
slave_narrative_archive[text]['raw_token_count'] * 100)))
plt.plot(values)
plt.ylabel('%Text Post-Preprocessing')
plt.show()