htrc-vectorize

This notebook demonstrates how to use HTRC extracted features to build a word2vec model using Doc2Vec.

Jed Dobson
James.E.Dobson@Dartmouth.EDU
https://jeddobson.github.io
October 2020

This file is part of the htrc-vector-project, begun in June 2020 with Catherine Parnell.

Repository:
https://github.com/jeddobson/htrc-vector-project

Why use HTRC Features?

The HTRC Extracted Features dataset include over 17 million volumes. This includes works presently still protected by copyright. These features can be distributed because they are for non-consumptive use. They are designed to be computer rather than human readable. This means that you can model large number of texts including texts from the twenty and twenty-first century. Document and page-based features are available. The HTRC feature-reader package for Python enables easy access to these features. These features include tokens and their repititions on a page-by-page basis. Word order is lost (thus the page is no longer human readable). The format used limits their use, however, and many popular applications of text mining and machine learning expect preserved word order. The popular Skipgram model used by word2vec, for example, learns by predicting words within a window surrounding a target word.

Doc2Vec

This notebook uses Doc2Vec to train a model for words appearing within a much larger window. Doc2Vec has been used to produce vectors from paragraph-length text sources. These vectors are then used for classification and other applications.

Tunable and Limitations

This example notebook uses a set of Toni Morrison novels found in the HATHI Trust archive. This is a much smaller dataset than we've used in other experiments. Vector models like word2vec generally require larger numbers of text sources (we've trained on 30GB archives).

Text Sources

  • input size: Depending on what you want to model, you will most likely want to work with more than words written by a single author. We successfully used this approach with thousands of texts across multiple genres.
  • preprocessing: This notebook does not provide any preprocessing and simply imports all extracted features. You may want to remove some words with high frequent use (i.e., "stopwords"). Results may vary.

Execution In real execution, you'll want to separate the several phases of this workflow. You can predownload the features and load these locally (this script will download as needed). To manage run time, you may also want to process each text individually. We produced a CSV file with one line for each page of the text. These were then concatenated (and compressed with bzip) and converted into a TaggedDocument before running doc2vec.

doc2vec

  • window: This is the primary tunable. The smaller this variable the more likely you are to find similar word vectors based on alphabetical ordering (the tokens are processed in alphabetical ordering, as received from the feature-reader). The window size should approximate the typical page length.
  • min_count: Most errors introduced during digitization or creation of the extracted features should be removed with a small integer used here.
In [1]:
from htrc_features import FeatureReader, utils  
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv
In [2]:
# data to download
documents = [
    ["The Bluest Eye","uc1.32106018657251"],
    ["Song of Solomon","mdp.39015032749130"],
    ["Sula","uc1.32106019072633"],
    ["Tar Baby","uc1.32106005767956"],
    ["Jazz","ien.35556029664190"],
    ["Beloved","mdp.49015003142743"],
    ["Paradise","mdp.39015066087613"],
    ["A Mercy","mdp.39076002787351"]
]

Finding Sources

Browse the HathiTrust Digital Libraryand search for full-text by author and title (advanced full-text search works best). Example search for author Sarton, May and title Small Room results in links for Catalog Record and "Limited (search-only)." Click "Limited." Examine the URL for the present page. Copy the text following 'id='. In this instance, "uc1.b4088774." This is the ID that you want to obtain the extracted features using the Feature Reader. The above cell groups documents together in a list of lists and keeps titles with the IDs for later reference (although these are not used in this notebook).

In [3]:
# This function extracts individual pages and create string of words from tokens
# Word order is lost from HTRC features. This creates page length strings by
# multiplying tokens for each appearance. Thus, token the with count 2 will 
# appear as "the the" in the returned string.

def get_pages(document):
    fr = FeatureReader([document])
    vol = next(fr.volumes())
    ptc = vol.tokenlist(pos=False, case=False).reset_index().drop(['section'], axis=1)
    page_list = set(ptc['page'])
    
    rows=list()
    for page in page_list:
        page_data = str()
        
        # operate on each token
        for page_tokens in ptc.loc[ptc['page'] == page].iterrows():
            if page_tokens[1][1].isalpha():
                page_data += (' '.join([page_tokens[1][1]] * page_tokens[1][2])) + " "

        # Doc2Vec needs comma separated list of words
        rows.append(page_data.split())
    return rows

Extracted Features Format

The Feature Reader returns a Pandas DataFrame with token information. This can be viewed in a number of ways but we simply want the content of each page so we can train a model on page-level data. Why page level? This is as granular as we can get with this non-consumptive data. HathiTrust is allowed to distribute these data because they are not plain text and because the form of the page has been lost (i.e. original word ordering) these present data rather than text. The following is a sample of data for page 127 from the previously mentioned May Sarton novel:

Token Count
! 1
'' 7
's 1
, 15
-lsb- 1
. 17
123 1
? 2
`` 8
a 3
about 1
absolute 1
accusations 1
all 1
and 7
anguish 1
another 1
any 1
as 3
at 1
atmosphere 1
back 1
baffled 1
because 1
been 2
began 2
beyond 1
borrowed 1
breakdown 1
but 2
by 1
carryl 5
changed 1
check 1
christ 1
comment 1
cope 5
could 1
course 2
cry 1
cup 1
deep 1
defensiveness 1
did 1
difficult 1
distaste 1
does 1
down 2
dreadful 1
ended 1
enough 1
ever 1
every 1
except 1
exhausting 1
expression 1
face 1
fact 1
fatigue 1
feel 1
feeling 1
felt 1
followed 1
forced 1
forgotten 1
from 1
gesture 1
girl 1
good 1
got 1
had 4
has 1
have 1
having 1
her 5
hers 1
herself 1
hesitation 1
how 4
i 1
if 1
immense 1
impact 1
impossible 1
in 2
is 1
it 4
jane 3
like 1
listened 1
literally 1
little 1
long 1
longer 1
loses 1
lost 1
lucy 4
makes 1
making 1
many 1
me 2
mere 1
must 3
no 3
not 3
nothing 1
occasionally 1
of 9
offering 1
one 3
other 1
out 1
pace 1
performances 1
point 1
pouring 1
published 1
punished 1
punishment 1
reality 1
realized 1
really 1
recheck 1
remote 1
room 1
s 1
scene 1
seem 1
seemed 1
severe 1
she 12
sick 1
sigh 1
sighed 1
silence 2
single 1
slowly 1
solve 1
sort 1
spare 1
spoke 1
still 1
strength 1
suppose 1
swearword 1
talk 2
tea 1
tell 1
that 3
the 8
there 2
thing 1
this 2
thought 1
to 7
track 1
trust 1
truth 1
try 1
turn 1
unconscious 1
understood 1
up 2
very 2
violent 1
was 4
we 2
were 1
what 1
when 1
with 1
without 1
would 1
you 1
yourself 1
In [4]:
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features
#

pages = list()
for d in documents:
    for page in get_pages(d[1]):
        pages.append(page)

# convert to TaggedDocument
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(pages)]

Page Data Prepared for Doc2Vec

The per-page token and count data are then expanded into numbered rows for procesing with Doc2Vec. The above cells do a minimal amount of preprocessing, we only preserve tokens composed of alphabetical characters (by evaluating "isalpha()" on each token). The data are in list() form and now look like the following:

['a', 'a', 'a', 'about', 'absolute', 'accusations', 'all', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'anguish', 'another', 'any', 'as', 'as', 'as', 'at', 'atmosphere', 'back', 'baffled', 'because', 'been', 'been', 'began', 'began', 'beyond', 'borrowed', 'breakdown', 'but', 'but', 'by', 'carryl', 'carryl', 'carryl', 'carryl', 'carryl', 'changed', 'check', 'christ', 'comment', 'cope', 'cope', 'cope', 'cope', 'cope', 'could', 'course', 'course', 'cry', 'cup', 'deep', 'defensiveness', 'did', 'difficult', 'distaste', 'does', 'down', 'down', 'dreadful', 'ended', 'enough', 'ever', 'every', 'except', 'exhausting', 'expression', 'face', 'fact', 'fatigue', 'feel', 'feeling', 'felt', 'followed', 'forced', 'forgotten', 'from', 'gesture', 'girl', 'good', 'got', 'had', 'had', 'had', 'had', 'has', 'have', 'having', 'her', 'her', 'her', 'her', 'her', 'hers', 'herself', 'hesitation', 'how', 'how', 'how', 'how', 'i', 'if', 'immense', 'impact', 'impossible', 'in', 'in', 'is', 'it', 'it', 'it', 'it', 'jane', 'jane', 'jane', 'like', 'listened', 'literally', 'little', 'long', 'longer', 'loses', 'lost', 'lucy', 'lucy', 'lucy', 'lucy', 'makes', 'making', 'many', 'me', 'me', 'mere', 'must', 'must', 'must', 'no', 'no', 'no', 'not', 'not', 'not', 'nothing', 'occasionally', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'of', 'offering', 'one', 'one', 'one', 'other', 'out', 'pace', 'performances', 'point', 'pouring', 'published', 'punished', 'punishment', 'reality', 'realized', 'really', 'recheck', 'remote', 'room', 's', 'scene', 'seem', 'seemed', 'severe', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'she', 'sick', 'sigh', 'sighed', 'silence', 'silence', 'single', 'slowly', 'solve', 'sort', 'spare', 'spoke', 'still', 'strength', 'suppose', 'swearword', 'talk', 'talk', 'tea', 'tell', 'that', 'that', 'that', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'there', 'there', 'thing', 'this', 'this', 'thought', 'to', 'to', 'to', 'to', 'to', 'to', 'to', 'track', 'trust', 'truth', 'try', 'turn', 'unconscious', 'understood', 'up', 'up', 'very', 'very', 'violent', 'was', 'was', 'was', 'was', 'we', 'we', 'were', 'what', 'when', 'with', 'without', 'would', 'you', 'yourself']
In [5]:
print("creating model")
model = Doc2Vec(tagged_data, 
                dm=1, # operate on "paragraphs" (pages) with distributed memory model
                vector_size=300, # larger vector size might produce better results
                min_count=5, # drop words with very few repetitions
                window=150, # larger window size needed because of extracted features
                workers=2)

print("saving word2vec model")
model.save_word2vec_format("doc2vec-morrison-novels.w2v")
creating model
saving word2vec model
In [6]:
# load and verify
model =  kv.KeyedVectors.load_word2vec_format("doc2vec-morrison-novels.w2v")
In [7]:
model.most_similar(["memory"],topn=25)
Out[7]:
[('painful', 0.9396734237670898),
 ('permanent', 0.9382855296134949),
 ('pieces', 0.9304176568984985),
 ('path', 0.9261106848716736),
 ('sauce', 0.9238497018814087),
 ('quite', 0.9236317873001099),
 ('months', 0.9215139746665955),
 ('rain', 0.9191372394561768),
 ('outside', 0.9182758331298828),
 ('ruby', 0.9170806407928467),
 ('order', 0.9139819741249084),
 ('secret', 0.9124130606651306),
 ('roses', 0.9104964137077332),
 ('single', 0.9092104434967041),
 ('remembered', 0.908808708190918),
 ('schoolhouse', 0.9077692031860352),
 ('rented', 0.9066518545150757),
 ('such', 0.9056478142738342),
 ('parts', 0.9044241905212402),
 ('thoroughly', 0.9029747247695923),
 ('once', 0.9015201330184937),
 ('thrill', 0.9012625813484192),
 ('perfect', 0.8979784250259399),
 ('sheets', 0.896855354309082),
 ('pressed', 0.8961389660835266)]