This collection of Jupyter notebooks archive my code and embedded critiques to evaluate the possibilities of various contemporary text/data mining and machine learning algorithms for humanities research. I am sharing these in order to provoke some conversation and to enable others to perform a variety of analyses on their own text archives. These notebooks and my commentary also form the "archive" of my book-in-progress Digital Humanities and the Search for a Method. Please feel free to contact me if you find these useful or have any questions!
James E. "Jed" Dobson Department of English Dartmouth College Email: James.E.Dobson@Dartmouth.EDU Twitter: @jeddobson Web: http://www.dartmouth.edu/~jed
The UNC DocSouth North American Slave Narrative is a model digital archive for humanities research in that it is both typical and rich. The archive is available as a single zip file from the UNC Libraries website. It contains 294 ASCII and XML encoded slave narratives. The archive contains rich metadata (“toc.csv”) in an easy-to-read format and all the texts are included in plain (UTF-8 / ASCII) text as well as in XML formatting (with additional metadata). Since the archive is organized around a specific topic rather than a single historical moment, it is an ideal test object for the evaluation of computational methods for text mining in the humanities. Since nineteenth and early twentieth-century American literature is my subject expertise, I approach this database with enough knowledge of the themes and concerns of the included texts to judge the output of the follow tools as potentially useful or not.
Humanities research is self-critical enterprise and as such must be opposed to the notion that there are areas in which interpretive questions cannot be asked. At the project level, we can see the problem by thinking through the implications of what could be called the Fordist model of academic research. Collaborative research projects with well-defined and instrumentalized roles for each participant—such projects are certainly rather rare outside of the digital humanities within the humanities at large—partition knowledge in such a way as to enable an environment in which scrutiny cannot be equally applied to all components of the project. Within the scope of an individual “investigation” of a digital text with computational tools, we find an example of this same problem within the use of an application, or an algorithm within an application, that functions as what is frequently called a “black box” in which data goes and out without any understanding of what happened during the transformation. These components or atomized steps in a research method are black boxes because of the complexity of the algorithm or, increasingly with a type of algorithms making use of what is called machine learning, because we cannot fully account for why the algorithm made certain decisions. Despite the possibility of this uncertainty, questions about the assumptions and operation of these so-called black boxes remain pressing and answerable. All of this is to say that any computational method used in the humanities cannot contain components that are not subject to humanistic modes of questioning. This means that, on one level, there are no algorithms or codes that are free from subjectivity and at a higher level, in the case of machine learning algorithms, that any decisions or distinctions produced by an algorithm from prior decisions cannot be isolated from critique and left unquestioned. Humanities methods must be opposed to bracketing of any questions applied to the operations and procedures used within these methods.
I am only using free and open-source software in order that everything that is shown can easily be replicated, reproduced, modified, and improved. Python and Jupyter also remove the majority of the so-called "black boxes" invoked above. This does not mean that every operation will be transparent--abstraction, modularization, and high-level tools make it difficult to peer down as far as one would like (although every attempt will be made to explain default values, hidden input, and lower-level operations) and some tools (such as scikit-learn's HashingVectorizing) make it impossible to explore and tinker with the results. That said, these notebooks contain both executable code, the output of that code, and commentary. Commentary, critique, and citations are found in these "markdown" text boxes as well as comments with the code itself. I call these Jupyter files "critical notebooks" to gesture to the way in which I hope to both frame and loosen every digital transformation, input text objects, and derived output objects with some critical resources.