This notebook demonstrates how to use HTRC extracted features to build a word2vec model using Doc2Vec.
Jed Dobson
James.E.Dobson@Dartmouth.EDU
https://jeddobson.github.io
October 2020
This file is part of the htrc-vector-project, begun in June 2020 with Catherine Parnell.
Repository:
https://github.com/jeddobson/htrc-vector-project
The HTRC Extracted Features dataset include over 17 million volumes. This includes works presently still protected by copyright. These features can be distributed because they are for non-consumptive use. They are designed to be computer rather than human readable. This means that you can model large number of texts including texts from the twenty and twenty-first century. Document and page-based features are available. The HTRC feature-reader package for Python enables easy access to these features. These features include tokens and their repititions on a page-by-page basis. Word order is lost (thus the page is no longer human readable). The format used limits their use, however, and many popular applications of text mining and machine learning expect preserved word order. The popular Skipgram model used by word2vec, for example, learns by predicting words within a window surrounding a target word.
This notebook uses Doc2Vec to train a model for words appearing within a much larger window. Doc2Vec has been used to produce vectors from paragraph-length text sources. These vectors are then used for classification and other applications.
This example notebook uses a set of Toni Morrison novels found in the HATHI Trust archive. This is a much smaller dataset than we've used in other experiments. Vector models like word2vec generally require larger numbers of text sources (we've trained on 30GB archives).
Text Sources
Execution In real execution, you'll want to separate the several phases of this workflow. You can predownload the features and load these locally (this script will download as needed). To manage run time, you may also want to process each text individually. We produced a CSV file with one line for each page of the text. These were then concatenated (and compressed with bzip) and converted into a TaggedDocument before running doc2vec.
doc2vec
from htrc_features import FeatureReader, utils
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv
# data to download
documents = [
["The Bluest Eye","uc1.32106018657251"],
["Song of Solomon","mdp.39015032749130"],
["Sula","uc1.32106019072633"],
["Tar Baby","uc1.32106005767956"],
["Jazz","ien.35556029664190"],
["Beloved","mdp.49015003142743"],
["Paradise","mdp.39015066087613"],
["A Mercy","mdp.39076002787351"]
]
Browse the HathiTrust Digital Libraryand search for full-text by author and title (advanced full-text search works best). Example search for author Sarton, May and title Small Room results in links for Catalog Record and "Limited (search-only)." Click "Limited." Examine the URL for the present page. Copy the text following 'id='. In this instance, "uc1.b4088774." This is the ID that you want to obtain the extracted features using the Feature Reader. The above cell groups documents together in a list of lists and keeps titles with the IDs for later reference (although these are not used in this notebook).
# This function extracts individual pages and create string of words from tokens
# Word order is lost from HTRC features. This creates page length strings by
# multiplying tokens for each appearance. Thus, token the with count 2 will
# appear as "the the" in the returned string.
def get_pages(document):
fr = FeatureReader([document])
vol = next(fr.volumes())
ptc = vol.tokenlist(pos=False, case=False).reset_index().drop(['section'], axis=1)
page_list = set(ptc['page'])
rows=list()
for page in page_list:
page_data = str()
# operate on each token
for page_tokens in ptc.loc[ptc['page'] == page].iterrows():
if page_tokens[1][1].isalpha():
page_data += (' '.join([page_tokens[1][1]] * page_tokens[1][2])) + " "
# Doc2Vec needs comma separated list of words
rows.append(page_data.split())
return rows
The Feature Reader returns a Pandas DataFrame with token information. This can be viewed in a number of ways but we simply want the content of each page so we can train a model on page-level data. Why page level? This is as granular as we can get with this non-consumptive data. HathiTrust is allowed to distribute these data because they are not plain text and because the form of the page has been lost (i.e. original word ordering) these present data rather than text. The following is a sample of data for page 127 from the previously mentioned May Sarton novel:
Token | Count |
---|---|
! | 1 |
'' | 7 |
's | 1 |
, | 15 |
-lsb- | 1 |
. | 17 |
123 | 1 |
? | 2 |
`` | 8 |
a | 3 |
about | 1 |
absolute | 1 |
accusations | 1 |
all | 1 |
and | 7 |
anguish | 1 |
another | 1 |
any | 1 |
as | 3 |
at | 1 |
atmosphere | 1 |
back | 1 |
baffled | 1 |
because | 1 |
been | 2 |
began | 2 |
beyond | 1 |
borrowed | 1 |
breakdown | 1 |
but | 2 |
by | 1 |
carryl | 5 |
changed | 1 |
check | 1 |
christ | 1 |
comment | 1 |
cope | 5 |
could | 1 |
course | 2 |
cry | 1 |
cup | 1 |
deep | 1 |
defensiveness | 1 |
did | 1 |
difficult | 1 |
distaste | 1 |
does | 1 |
down | 2 |
dreadful | 1 |
ended | 1 |
enough | 1 |
ever | 1 |
every | 1 |
except | 1 |
exhausting | 1 |
expression | 1 |
face | 1 |
fact | 1 |
fatigue | 1 |
feel | 1 |
feeling | 1 |
felt | 1 |
followed | 1 |
forced | 1 |
forgotten | 1 |
from | 1 |
gesture | 1 |
girl | 1 |
good | 1 |
got | 1 |
had | 4 |
has | 1 |
have | 1 |
having | 1 |
her | 5 |
hers | 1 |
herself | 1 |
hesitation | 1 |
how | 4 |
i | 1 |
if | 1 |
immense | 1 |
impact | 1 |
impossible | 1 |
in | 2 |
is | 1 |
it | 4 |
jane | 3 |
like | 1 |
listened | 1 |
literally | 1 |
little | 1 |
long | 1 |
longer | 1 |
loses | 1 |
lost | 1 |
lucy | 4 |
makes | 1 |
making | 1 |
many | 1 |
me | 2 |
mere | 1 |
must | 3 |
no | 3 |
not | 3 |
nothing | 1 |
occasionally | 1 |
of | 9 |
offering | 1 |
one | 3 |
other | 1 |
out | 1 |
pace | 1 |
performances | 1 |
point | 1 |
pouring | 1 |
published | 1 |
punished | 1 |
punishment | 1 |
reality | 1 |
realized | 1 |
really | 1 |
recheck | 1 |
remote | 1 |
room | 1 |
s | 1 |
scene | 1 |
seem | 1 |
seemed | 1 |
severe | 1 |
she | 12 |
sick | 1 |
sigh | 1 |
sighed | 1 |
silence | 2 |
single | 1 |
slowly | 1 |
solve | 1 |
sort | 1 |
spare | 1 |
spoke | 1 |
still | 1 |
strength | 1 |
suppose | 1 |
swearword | 1 |
talk | 2 |
tea | 1 |
tell | 1 |
that | 3 |
the | 8 |
there | 2 |
thing | 1 |
this | 2 |
thought | 1 |
to | 7 |
track | 1 |
trust | 1 |
truth | 1 |
try | 1 |
turn | 1 |
unconscious | 1 |
understood | 1 |
up | 2 |
very | 2 |
violent | 1 |
was | 4 |
we | 2 |
were | 1 |
what | 1 |
when | 1 |
with | 1 |
without | 1 |
would | 1 |
you | 1 |
yourself | 1 |
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features
#
pages = list()
for d in documents:
for page in get_pages(d[1]):
pages.append(page)
# convert to TaggedDocument
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(pages)]
The per-page token and count data are then expanded into numbered rows for procesing with Doc2Vec. The above cells do a minimal amount of preprocessing, we only preserve tokens composed of alphabetical characters (by evaluating "isalpha()" on each token). The data are in list() form and now look like the following:
['a', 'a', 'a', 'about', 'absolute', 'accusations', 'all', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'anguish', 'another', 'any', 'as', 'as', 'as', 'at', 'atmosphere', 'back', 'baffled', 'because', 'been', 'been', 'began', 'began', 'beyond', 'borrowed', 'breakdown', 'but', 'but', 'by', 'carryl', 'carryl', 'carryl', 'carryl', 'carryl', 'changed', 'check', 'christ', 'comment', 'cope', 'cope', 'cope', 'cope', 'cope', 'could', 'course', 'course', 'cry', 'cup', 'deep', 'defensiveness', 'did', 'difficult', 'distaste', 'does', 'down', 'down', 'dreadful', 'ended', 'enough', 'ever', 'every', 'except', 'exhausting', 'expression', 'face', 'fact', 'fatigue', 'feel', 'feeling', 'felt', 'followed', 'forced', 'forgotten', 'from', 'gesture', 'girl', 'good', 'got', 'had', 'had', 'had', 'had', 'has', 'have', 'having', 'her', 'her', 'her', 'her', 'her', 'hers', 'herself', 'hesitation', 'how', 'how', 'how', 'how', 'i', 'if', 'immense', 'impact', 'impossible', 'in', 'in', 'is', 'it', 'it', 'it', 'it', 'jane', 'jane', 'jane', 'like', 'listened', 'literally', 'little', 'long', 'longer', 'loses', 'lost', 'lucy', 'lucy', 'lucy', 'lucy', 'makes', 'making', 'many', 'me', 'me', 'mere', 'must', 'must', 'must', 'no', 'no', 'no', 'not', 'not', 'not', 'nothing', 'occasionally', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'offering', 'one', 'one', 'one', 'other', 'out', 'pace', 'performances', 'point', 'pouring', 'published', 'punished', 'punishment', 'reality', 'realized', 'really', 'recheck', 'remote', 'room', 's', 'scene', 'seem', 'seemed', 'severe', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'sick', 'sigh', 'sighed', 'silence', 'silence', 'single', 'slowly', 'solve', 'sort', 'spare', 'spoke', 'still', 'strength', 'suppose', 'swearword', 'talk', 'talk', 'tea', 'tell', 'that', 'that', 'that', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'there', 'there', 'thing', 'this', 'this', 'thought', 'to', 'to', 'to', 'to', 'to', 'to', 'to', 'track', 'trust', 'truth', 'try', 'turn', 'unconscious', 'understood', 'up', 'up', 'very', 'very', 'violent', 'was', 'was', 'was', 'was', 'we', 'we', 'were', 'what', 'when', 'with', 'without', 'would', 'you', 'yourself']
print("creating model")
model = Doc2Vec(tagged_data,
dm=1, # operate on "paragraphs" (pages) with distributed memory model
vector_size=300, # larger vector size might produce better results
min_count=5, # drop words with very few repetitions
window=150, # larger window size needed because of extracted features
workers=2)
print("saving word2vec model")
model.save_word2vec_format("doc2vec-morrison-novels.w2v")
# load and verify
model = kv.KeyedVectors.load_word2vec_format("doc2vec-morrison-novels.w2v")
model.most_similar(["memory"],topn=25)