Note: This is the first in a series of documents and notebooks that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, Digital Humanities and the Search for a Method. I have published a critique of some existing methods (Dobson 2015) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities. Each notebook displays code, data, results, interpretation, and critique. I attempt to provide as much explanation of the individual steps and documentation (along with citations of related papers) of the concepts and justification of choices made.
A simple check to see if the dates in the table of contents ("toc.csv") for the DocSouth "North American Slave Narratives" can be converted to an integer (date as year) is used to assign one of these two classes:
These period categories are rough and by no means not perfect. Publication year may have little relation to the content of the text, the source for the vectorizing process and eventual categorization. These dates are what Matthew Jockers calls, within the digital humanities context, catalog metadata (Jockers 2013, 35-62). Recently, critics have challenged such divisions (Marrs 2015) that are central to the understanding of field of nineteenth-century American literary studies with concepts like "transbellum" that might be capable of helping to better understand works that address the Civil War and its attendant anxities through the "long nineteenth century." The majority of the texts included in the DocSouth archive are first-person autobiographical narratives of lives lived during the antebellum and Civil War years and published in the years leading up to, including, and after the war.
|unknown or ambiguous||40|
There are 252 texts with four digit years and eighteen texts with ambiguous or unknown publication dates. This script will attempt to classify these texts into one of these two periods following the "fitting" of the labeled training texts. I split the 252 texts with known and certain publication dates into two groups: a training set and a testing test. After "fitting" the training set and establishing the neighbors, the code attempts to categorize the testing set. Many questions can and should be asked about the creation of the training set and the labeling of the data. This labeling practice introduces many subjective decisions into what is perceived as an objective (machine and algorithmically generated) process (Dobson 2015, Gillespie 2016).
The training set (the first 252 texts, preserving the order in "toc.csv") over-represents the antebellum period and may account for the ability of the classifier to make good predictions for this class.
The "testing" dataset is used to validate the classifier. This dataset contains seventy-five texts with known year of publication. This dataset, like the training dataset, overrepresents the antebellum period.
The texts are all used/imported as found in the zip file provided by the DocSouth "North American Slave Narratives" collection. The texts have been encoded in a combination of UTF-8 Unicode and ASCII. Scikit-learn's HashingVectorizer performs some additional pre-processing and that will be examined in the sections below.
The kNN algorithm is a non-parametric algorithm, meaning that it does not require detailed knowledge of the input data and its distribution (Cover and Hart 1967). This algorithm is known as reliable and it is quite simple, especially when compared to some of the more complex machine learning algorithms used as present, to implement and understand. It was originally conceived of as a response to what is called a “discrimination problem”: the categorization of a large number of input points into discrete "boxes." Data are eventually organized into categories, in the case of this script, the three categories of antebellum, postbellum, and twentieth-century.
The algorithm functions in space and produces each input text as a "neighbor" and has each text "vote" for membership into parcellated neighborhoods. Cover and Hart explain: "If the number of samples is large it makes good sense to use, instead of the single nearest neighbor, the majority vote of the nearest k neighbors" (22). The following code uses the value of "12" for the number of neighbors or the 'k' of kNN.
The kNN algorithm may give better results for smaller numbers of classes. The performance of particular implementation of kNN and the feature selection algorithm (HashingVectorizer) was better with just the antebellum and postbellum class. Alternative boundaries for the classes (year markers) might also improve results.
While it is non-parametics, the kNN algorithm does require a set of features in order to categorize the input data, the texts. This script operates according to the "bag of words" method in which each text is treated not as a narrative but a collection of unordered and otherwise undiferentiated words. This means that multiple word phrases (aka ngrams) are ignored and much meaning will be removed from the comparative method because of a loss of context.
In order to select the features by which a text can be compared to another, we need some sort of method that can produce numerical data. I have selected the HashingVectorizer, which is a fast method to generate a list of words/tokens from a file. This returns a numpy compressed sparse row (CSR) matrix that scikit-learn will use in the creation of the neighborhood "map."
The HashingVectorizer removes a standard 318 English-language stop words and by default does not alter or remove any accents or accented characters in the encoded (UTF-8) format. It also converts all words to lowercase, potentially introducing false positives.
Issues with HashingVectorizer This vectorizer works well, but it limits the questions we can ask after it has been run. We cannot, for example, interrogate why a certain text might have been misclassified by examining the words/tokens returned by the vectorizer. This is because the HashingVectorizer returns only indices to features and does not keep the string representation of specific words.
# load required packages import sys, os import re import operator import nltk from nltk import pos_tag, ne_chunk from nltk.tokenize import wordpunct_tokenize
# load local library sys.path.append("lib") import docsouth_utils
# each dictionary entry in the 'list' object returned by load_narratives # contains the following keys: # 'author' = Author of the text (first name, last name) # 'title' = Title of the text # 'year' = Year published as integer or False if not simple four-digit year # 'file' = Filename of text # 'text' = NLTK Text object neh_slave_archive = docsouth_utils.load_narratives()
# # establish two simple classes for kNN classification # the "date" field has already been converted to an integer # all texts published before 1865, we'll call "antebellum" # "postbellum" for those after. # period_classes=list() for entry in neh_slave_archive: file = entry['file'] if entry['year'] != False and entry['year'] < 1865: period_classes.append([file,"antebellum"]) if entry['year'] != False and entry['year'] > 1865: period_classes.append([file,"postbellum"]) # create labels as a list labels=[i for i in period_classes] # create list of filenames files=[i for i in period_classes] # # create training and test datasets # leave out fifty files, the last fifty with integer dates from the toc, for testing # test_size=75 train_labels=labels[:-test_size] train_files=files[:-test_size] # the last set of texts (test_size) are the "test" dataset (for validation) test_labels=labels[-test_size:] test_files=files[-test_size:]
from sklearn.feature_extraction.text import HashingVectorizer from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics # intialize the vectorizer vectorizer = HashingVectorizer(stop_words='english', input='filename', non_negative=True) training_data = vectorizer.transform(train_files) test_data=vectorizer.transform(test_files)
# display sizes print("training data:") for period in ['postbellum', 'antebellum']: print(" ",period,":",train_labels.count(period)) print("test data:") for period in ['postbellum', 'antebellum']: print(" ",period,":",test_labels.count(period))
training data: postbellum : 81 antebellum : 96 test data: postbellum : 28 antebellum : 47
# run kNN and fit training data knn = KNeighborsClassifier(n_neighbors=13) knn.fit(training_data,train_labels)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=13, p=2, weights='uniform')
# Predict results from the test data and check accuracy pred = knn.predict(test_data) score = metrics.accuracy_score(test_labels, pred) print("accuracy: %0.3f" % score) print(metrics.classification_report(test_labels, pred)) print("confusion matrix:") print(metrics.confusion_matrix(test_labels, pred))
accuracy: 0.813 precision recall f1-score support antebellum 0.81 0.91 0.86 47 postbellum 0.82 0.64 0.72 28 avg / total 0.81 0.81 0.81 75 confusion matrix: [[43 4] [10 18]]
The following cell loads and vectorizes (using the above HashingVectorizing method, with the exact same parameters used for the training set) and tests against the trained classifier, all the algorithmically uncategorized and ambiguously dated (in the toc.csv) input files.
|neh-carolinatwin-carolinatwin.txt||[between 1902 and 1912]|
# predict class or period membership for all texts without # four digit years for entry in neh_slave_archive: file = entry['file'] if entry['year'] == False: print(entry['author'],", ",entry['title']) print(" ",knn.predict(vectorizer.transform([entry['file']])))
William S. White , The African Preacher. An Authentic Narrative ['antebellum'] Henry Parker , Autobiography of Henry Parker ['antebellum'] Thomas W. Henry , Autobiography of Rev. Thomas W. Henry, of the A. M. E. Church ['postbellum'] Booker T. Washington , An Autobiography: The Story of My Life and Work ['postbellum'] No Author , Biographical Sketch of Millie Christine, the Carolina Twin, Surnamed the Two-Headed Nightingale and the Eighth Wonder of the World ['antebellum'] Josephine Brown , Biography of an American Bondman, by His Daughter ['antebellum'] Pomp , Dying Confession of Pomp, A Negro Man, Who Was Executed at Ipswich, on the 6th August, 1795, for Murdering Capt. Charles Furbush, of Andover, Taken from the Mouth of the Prisoner, and Penned by Jonathan Plummer, Jun. ['antebellum'] Thomas H. Jones , Experience and Personal Narrative of Uncle Tom Jones; Who Was for Forty Years a Slave. Also the Surprising Adventures of Wild Tom, of the Island Retreat, a Fugitive Negro from South Carolina ['antebellum'] William Parker , The Freedman's Story: In Two Parts ['antebellum'] Lucy A. Delaney , From the Darkness Cometh the Light or Struggles for Freedom ['antebellum'] M. L. Latta , The History of My Life and Work. Autobiography by Rev. M. L. Latta, A.M., D.D. ['postbellum'] Millie-Christine , The History of the Carolina Twins: Told in "Their Own Peculiar Way" By "One of Them" ['antebellum'] William Mack Lee , History of the Life of Rev. Wm. Mack Lee: Body Servant of General Robert E. Lee Through the Civil War: Cook from 1861 to 1865 ['postbellum'] Selim Aga , Incidents Connected with the Life of Selim Aga, a Native of Central Africa ['antebellum'] Harriet A. Jacobs , Incidents in the Life of a Slave Girl. Written by Herself ['postbellum'] Thomas Anderson , Interesting Account of Thomas Anderson, a Slave, Taken from His Own Lips. Ed. J. P. Clark ['antebellum'] Olaudah Equiano , The Interesting Narrative of the Life of Olaudah Equiano, or Gustavus Vassa, the African. Written by Himself. Vol. I. ['antebellum'] Olaudah Equiano , The Interesting Narrative of the Life of Olaudah Equiano, or Gustavus Vassa, the African. Written by Himself. Vol. II. ['antebellum'] William E. Hatcher , John Jasper: The Unmatched Negro Philosopher and Preacher ['postbellum'] Arthur , The Life, and Dying Speech of Arthur, a Negro Man; Who Was Executed at Worcester, October 10, 1768. For a Rape Committed on the Body of One Deborah Metcalfe ['antebellum'] John Jea , The Life, History, and Unparalleled Sufferings of John Jea, the African Preacher. Compiled and Written by Himself ['antebellum'] Stephen Smith , Life, Last Words and Dying Speech of Stephen Smith, a Black Man, Who Was Executed at Boston This Day Being Thursday, October 12, 1797 for Burglary ['antebellum'] Alexander Walters , My Life and Work ['postbellum'] Andrew Jackson , Narrative and Writings of Andrew Jackson, of Kentucky; Containing an Account of His Birth, and Twenty-Six Years of His Life While a Slave; His Escape; Five Years of Freedom, Together with Anecdotes Relating to Slavery; Journal of One Year's Travels; Sketches, etc. Narrated by Himself; Written by a Friend ['antebellum'] James Williams , A Narrative of Events Since the First of August, 1834, By James Williams, an Apprenticed Labourer in Jamaica ['postbellum'] James Curry , Narrative of James Curry, A Fugitive Slave ['antebellum'] T. C. Upham , Narrative of Phebe Ann Jacobs ['postbellum'] W. Mallory , Old Plantation Days ['antebellum'] Walter L. Fleming , "Pap" Singleton, The Moses of the Colored Exodus ['postbellum'] No Author , Recollections of Slavery by a Runaway Slave ['antebellum'] Charles Stuart , Reuben Maddison: A True Story ['antebellum'] George F. Bragg , Richard Allen and Absalom Jones, by the Rev. George F. Bragg, in Honor of the Centennial of the African Methodist Episcopal Church, Which Occurs in the Year 1916 ['postbellum'] No Author , The Royal African: or, Memoirs of the Young Prince of Annamaboe. Comprehending a Distinct Account of His Country and Family; His Elder Brother's Voyage to France, and Reception there; the Manner in Which Himself Was Confided by His Father to the Captain Who Sold Him; His Condition While a Slave in Barbadoes; the True Cause of His Bring Redeemed; His Voyage from Thence; and Reception Here in England. Interspers'd Throughout with Several Historical Remarks on the Commerce of the European Nations, Whose Subjects Frequent the Coast of Guinea. To which is Prefixed a Letter from the Author to a Person of Distinction, in Reference to Some Natural Curiosities in Africa; as Well as Explaining the Motives which Induced Him to Compose These Memoirs. ['antebellum'] No Author , A Sketch of Henry Franklin and Family. ['postbellum'] Lewis Charlton , Sketch of the Life of Mr. Lewis Charlton, and Reminiscences of Slavery ['antebellum'] Mark Twain , A True Story, Repeated Word for Word As I Heard It. From The Atlantic Monthly. Nov. 1874: 591-594 ['postbellum'] Emma J. Ray , Twice Sold, Twice Ransomed: Autobiography of Mr. and Mrs. L. P. Ray ['postbellum'] Gustavus L. Foster , Uncle Johnson, the Pilgrim of Six Score Years ['postbellum'] Booker T. Washington , Up from Slavery: An Autobiography ['postbellum'] Thomas William Burton , What Experience Has Taught Me: An Autobiography of Thomas William Burton ['postbellum']
Cover T.M. and P. E. Hart. 1967. "Nearest Neighbor Pattern Classification." IEEE Transactions on Information Theory 13, no. 1: 21-27.
Dobson, James E. 2016. “Can an Algorithm be Disturbed? Machine Learning, Intrinsic Criticism, and the Digital Humanities.” College Literature 42, no. 4: 543-564.
Gillespie, Tarleton. 2016. “Algorithm.” In Digital Keywords: A Vocabulary of Information Society and Culture. Edited by Benjamin Peters. Princeton: Princeton University Press.
Jockers, Matthew. 2013. Macroanalysis: Digital Methods & Literary History Urbana: University of Illinois Press.
Marrs, Cody. 2015. Nineteenth-Century American Literature and the Long Civil War. New York: Cambridge University Press.