Tools for working with the HTRC Extracted Features dataset, a dataset of page-level text features extracted from 17 million digitized works.
This library provides a FeatureReader
for parsing files, which are handled as Volume
objects with collections of Page
objects. Volumes provide access to metadata (e.g. language), volume-wide feature information (e.g. token counts), and access to Pages. Pages allow you to easily parse page-level features, particularly token lists.
This library makes heavy use of Pandas, returning many data representations as DataFrames. This is the leading way of dealing with structured data in Python, so this library doesn't try to reinvent the wheel. Since refactoring around Pandas, the primary benefit of using the HTRC Feature Reader is performance: reading the json structures and parsing them is generally faster than custom code. You also get convenient access to common information, such as case-folded token counts or part-of-page specific character counts. Details of the public methods provided by this library can be found in the HTRC Feature Reader docs.
Table of Contents: Installation | Usage | Additional Notes
Links: HTRC Feature Reader Documentation | HTRC Extracted Features Dataset
Citation: Peter Organisciak and Boris Capitanu, "Text Mining in Python through the HTRC Feature Reader," Programming Historian, (22 November 2016), http://programminghistorian.org/lessons/text-mining-with-extracted-features.
To install,
pip install htrc-feature-reader
That's it! This library is written for Python 3.0+. For Python beginners, you'll need pip.
Alternately, if you are using Anaconda, you can install with
conda install -c htrc htrc-feature-reader
The conda
approach is recommended, because it makes sure that some of the hard-to-install dependencies are properly installed.
Given the nature of data analysis, using iPython with Jupyter notebooks for preparing your scripts interactively is a recommended convenience. Most basically, it can be installed with pip install ipython[notebook]
and run with ipython notebook
from the command line, which starts a session that you can access through your browser. If this doesn't work, consult the iPython documentation.
Optional: installing the development version.
Note: for new Python users, a more in-depth lesson is published by Programming Historian: Text Mining in Python through the HTRC Feature Reader. That lesson is also the official citation associated the HTRC Feature Reader library.
The easiest way to start using this library is to use the Volume interface, which takes a path to an Extracted Features file.
from htrc_features import Volume
vol = Volume('data/ef2-stubby/hvd/34926/hvd.32044093320364.json.bz2')
vol
The Nautilus. by Delaware Museum of Natural History. (1904, 222 pages) - hvd.32044093320364
The FeatureReader can also download files at read time, by reference to a HathiTrust volume id. For example, if I want both of volumes of Pride and Prejudice, I can see that the URLs are babel.hathitrust.org/cgi/pt?id=hvd.32044013656053 and babel.hathitrust.org/cgi/pt?id=hvd.32044013656061. In the FeatureReader, these can be called with the ids=[]
argument, as follows:
for htid in ["hvd.32044013656053", "hvd.32044013656061"]:
vol = Volume(htid)
print(vol.title, vol.enumeration_chronology)
Pride and prejudice. v.1
Pride and prejudice. v.2
This downloads the file temporarily, using the HTRC's web-based download link (e.g. https://data.analytics.hathitrust.org/features/get?download-id={{URL}}). One good pairing with this feature is the HTRC Python SDK's functionality for downloading collections.
For example, I have a small collection of knitting-related books at https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610. To read the feature files for those books:
from htrc import workset
volids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610')
FeatureReader(ids=volids).first().title
Remember that for large jobs, it is faster to download your dataset beforehand, using the rsync
method.
A Volume contains information about the current work and access to the pages of the work. All the metadata fields from the HTRC JSON file are accessible as properties of the volume object, including title, language, imprint, oclc, pubDate, and genre. The main identifier id and pageCount are also accessible, and you can find the URL for the Full View of the text in the HathiTrust Digital Library - if it exists - with vol.handle_url
.
"Volume {} is a {} page text from {} written in {}. You can doublecheck at {}".format(vol.id, vol.page_count,
vol.year, vol.language,
vol.handle_url)
'Volume hvd.32044013656061 is a 306 page text from 1903 written in eng. You can doublecheck at http://hdl.handle.net/2027/hvd.32044013656061'
This is the Extracted Features dataset, so the features are easily accessible. To most popular is token counts, which are returned as a Pandas DataFrame:
df = vol.tokenlist()
df.sample(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | ||||
---|---|---|---|---|
page | section | token | pos | |
201 | body | abode | NN | 1 |
117 | body | head | NN | 1 |
126 | body | for | IN | 1 |
210 | body | three | CD | 1 |
224 | body | would | MD | 1 |
89 | body | The | DT | 1 |
283 | body | any | DT | 1 |
63 | body | surprise | NN | 1 |
152 | body | make | VB | 1 |
170 | body | I | PRP | 3 |
Other extracted features are discussed below.
The full included metadata can be seen with vol.parser.meta
:
vol.parser.meta.keys()
dict_keys(['id', 'metadata_schema_version', 'enumeration_chronology', 'type_of_resource', 'title', 'date_created', 'pub_date', 'language', 'access_profile', 'isbn', 'issn', 'lccn', 'oclc', 'page_count', 'feature_schema_version', 'ht_bib_url', 'genre', 'handle_url', 'imprint', 'names', 'source_institution', 'classification', 'issuance', 'bibliographic_format', 'government_document', 'hathitrust_record_number', 'rights_attributes', 'pub_place', 'volume_identifier', 'source_institution_record_number', 'last_update_date'])
These fields are mapped to attributes in Volume
, so vol.oclc
will return the oclc field from that metadata. As a convenience, Volume.year
returns the pub_date
information and Volume.author
returns the contributor information
.
vol.year, vol.author
('1903', ['Austen, Jane 1775-1817 '])
If the minimal metadata included with the extracted feature files is insufficient, you can fetch HT's metadata record from the Bib API with vol.metadata
.
Remember that this calls the HTRC servers for each volume, so can add considerable overhead. The result is a MARC file, returned as a pymarc record object. For example, to get the publisher information from field 260
:
vol.metadata['260'].value()
'Boston : Little, Brown, 1903.'
At large-scales, using vol.metadata
is an impolite and inefficient amount of server pinging; there are better ways to query the API than one volume at a time. Read about the HTRC Solr Proxy.
Another source of bibliographic metadata is the HathiTrust Bib API. You can access this information through the URL returned with vol.ht_bib_url
:
vol.ht_bib_url
'http://catalog.hathitrust.org/api/volumes/full/htid/hvd.32044013656061.json'
Volumes also have direct access to volume-wide info of features stored in pages. For example, you can get a list of words per page through Volume.tokens_per_page(). We'll discuss these features below, after looking first at Pages.
Note that for the most part, the properties of the Page
and Volume
objects aligns with the names in the HTRC Extracted Features schema, except they are converted to follow Python naming conventions: converting the CamelCase
of the schema to lowercase_with_underscores
. E.g. beginLineChars
from the HTRC data is accessible as Page.begin_line_chars
.
Token counts are returned by Volume.tokenlist()
(or Page.tokenlist()
. By default, part-of-speech tagged, case-sensitive counts are returned for the body.
The token count information is returned as a DataFrame with a MultiIndex (page, section, token, and part of speech) and one column (count).
print(vol.tokenlist()[:3])
count
page section token pos
1 body Austen . 1
Pride NNP 1
and CC 1
Page.tokenlist()
can be manipulated in various ways. You can case-fold, for example:
tl = vol.tokenlist(case=False)
tl.sample(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | ||||
---|---|---|---|---|
page | section | lowercase | pos | |
218 | body | what | WP | 1 |
30 | body | pemberley | NNP | 1 |
213 | body | comes | VBZ | 2 |
183 | body | took | VBD | 1 |
51 | body | necessary | JJ | 1 |
Or, you can combine part of speech counts into a single integer.
tl = vol.tokenlist(pos=False)
tl.sample(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | |||
---|---|---|---|
page | section | token | |
264 | body | family | 2 |
47 | body | journey | 1 |
98 | body | Perhaps | 1 |
49 | body | at | 2 |
227 | body | so | 1 |
Section arguments are valid here: 'header', 'body', 'footer', 'all', and 'group'
tl = vol.tokenlist(section="header", case=False, pos=False)
tl.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | |||
---|---|---|---|
page | section | lowercase | |
9 | header | 's | 1 |
and | 1 | ||
austen | 1 | ||
jane | 1 | ||
prejudice | 1 |
You can also drop the section index altogether if you're content with the default 'body'.
vol.tokenlist(drop_section=True, case=False, pos=False).sample(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | ||
---|---|---|
page | lowercase | |
247 | suppose | 1 |
76 | would | 2 |
The MultiIndex makes it easy to slice the results, and it is althogether more memory-efficient. For example, to return just the nouns (NN
):
tl = vol.tokenlist()
tl.xs('NN', level='pos').head(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | |||
---|---|---|---|
page | section | token | |
1 | body | prejudiceJane | 1 |
9 | body | Volume | 1 |
10 | body | vol | 3 |
12 | body | ./■ | 1 |
If you are new to Pandas DataFrames, you might find it easier to learn by converting the index to columns.
simpler_tl = df.reset_index()
simpler_tl[simpler_tl.pos == 'NN']
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
page | section | token | pos | count | |
---|---|---|---|---|---|
3 | 1 | body | prejudiceJane | NN | 1 |
19 | 9 | body | Volume | NN | 1 |
40 | 10 | body | vol | NN | 3 |
51 | 12 | body | ./■ | NN | 1 |
53 | 12 | body | / | NN | 1 |
... | ... | ... | ... | ... | ... |
43178 | 297 | body | spite | NN | 1 |
43187 | 297 | body | uncle | NN | 1 |
43191 | 297 | body | warmest | NN | 1 |
43195 | 297 | body | wife | NN | 1 |
43226 | 305 | body | NON-RECEIPT | NN | 1 |
7224 rows × 5 columns
If you prefer not to use Pandas, you can always convert the object, with methods like to_dict
and to_csv
).
tl[:3].to_csv()
'page,section,token,pos,count\n1,body,Austen,.,1\n1,body,Pride,NNP,1\n1,body,and,CC,1\n'
To get just the unique tokens, Volume.tokens
provides them as a set. Here I select a specific page for brevity and a minimum count, but you can run the method without arguments.
vol.tokens(page_select=21, min_count=5)
{'"', ',', '.', 'You', 'been', 'have', 'his', 'in', 'of', 'the', 'you'}
In addition to token lists, you can also access other section features:
vol.section_features()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
tokenCount | lineCount | emptyLineCount | capAlphaSeq | sentenceCount | |
---|---|---|---|---|---|
page | |||||
1 | 4 | 1 | 0 | 1 | 1 |
2 | 15 | 10 | 4 | 2 | 1 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... |
302 | 0 | 0 | 0 | 0 | 0 |
303 | 0 | 0 | 0 | 0 | 0 |
304 | 0 | 0 | 0 | 0 | 0 |
305 | 49 | 11 | 2 | 3 | 3 |
306 | 2 | 3 | 1 | 1 | 1 |
306 rows × 5 columns
If you're working in an instance where you hope to have comparably sized document units, you can use 'chunking' to roll pages into chunks that aim for a specific length. e.g.
by_chunk = vol.tokenlist(chunk=True, chunk_target=10000)
print(by_chunk.sample(4))
# Count words per chunk
by_chunk.groupby(level='chunk').sum()
count
chunk section token pos
5 body husbands NNS 3
2 body frequently RB 3
domestic JJ 3
3 body : : 10
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | |
---|---|
chunk | |
1 | 12453 |
2 | 9888 |
3 | 9887 |
4 | 10129 |
5 | 10054 |
6 | 10065 |
7 | 12327 |
For large jobs, you'll want to use multiprocessing or multithreading to speed up your process. This is left up to your preferred method, either within Python or by spawning multiple scripts from the command line. Here are two approaches that I like.
Dask offers easy multithreading (shared resources) and multiprocessing (separate processes) in Python, and is particularly convenient because it includes a subset of Pandas DataFrames.
Here is a minimal example, that lazily loads token frequencies from a list of volume IDs, and counts them up by part of speech tag.
import dask.dataframe as dd
from dask import delayed
def get_tokenlist(vol):
''' Load a one volume feature reader, get that volume, and return its tokenlist '''
return FeatureReader(ids=[volid]).first().tokenlist()
delayed_dfs = [delayed(get_tokenlist)(volid) for volid in volids]
# Create a dask
ddf = (dd.from_delayed(delayed_dfs)
.reset_index()
.groupby('pos')[['count']]
.sum()
)
# Run processing
ddf.compute()
Here is an example of 78 volumes being processed in 24 seconds with 31 threads:
This example used multithreading. Due to the nature of Python, certain functions won't parallelize well. In our case, the part where the JSON is read from the file and converted to a DataFrame (the light green parts of the graphic) won't speed up because Python dicts lock the Global Interpreter Lock (GIL). However, because Pandas releases the GIL, nearly everything you do after parsing the JSON will be very quick.
To better understand what happens when ddf.compute()
, here is a graph for 4 volumes:
As an alternative to multiprocessing in Python, my preference is to have simpler Python scripts and to use GNU Parallel on the command line. To do this, you can set up your Python script to take variable length arguments of feature file paths, and to print to stdout.
This psuedo-code shows how that you'd use parallel, where the number of parallel processes is 90% the number of cores, and 50 paths are sent to the script at a time (if you send too little at a time, the initialization time of the script can add up).
find feature-files/ -name '*json.bz2' | parallel --eta --jobs 90% -n 50 python your_script.py >output.txt
git clone https://github.com/htrc/htrc-feature-reader.git
cd htrc-feature-reader
python setup.py install
If you need to do fast, highly customized processing without instantiating Volumes, FeatureReader has a convenient generator for getting the raw JSON as a Python dict: fr.jsons()
. This simply does the file reading, optional decompression, and JSON parsing.
utils
includes an Rsyncing utility, download_file
. This requires Rsync to be installed on your system.
Usage:
Download one file to the current directory:
utils.download_file(htids='nyp.33433042068894')
Download multiple files to the current directory:
ids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids=ids)
Download file to /tmp
:
utils.download_file(htids='nyp.33433042068894', outdir='/tmp')
Download file to current directory, keeping pairtree directory structure,
i.e. ./nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/nyp.33433042068894.json.bz2
:
utils.download_file(htids='nyp.33433042068894', keep_dirs=True)
```
### Getting the Rsync URL
If you have a HathiTrust Volume ID and want to be able to download the features for a specific book, `hrtc_features.utils` contains an [id_to_rsync](http://htrc.github.io/htrc-feature-reader/htrc_features/utils.m.html#htrc_features.utils.id_to_rsync) function. This uses the [pairtree](http://pythonhosted.org/Pairtree/) library but has a fallback written with that library is not installed, since it isn't compatible with Python 3.
```python
from htrc_features import utils
utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'
See the ID to Rsync notebook for more information on this format and on Rsyncing lists of urls.
There is also a command line utility installed with the HTRC Feature Reader:
$ htid2rsync miun.adx6300.0001.001
miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2
In the beta Extracted Features release, schema 2.0, a few features were separated out to an advanced files. However, this designation is no longer present starting with schema 3.0, meaning information like beginLineChars
, endLineChars
, and capAlphaSeq
are always available:
# What is the longest sequence of capital letter on each page?
vol.cap_alpha_seqs()[:10]
[0, 1, 0, 0, 0, 0, 0, 0, 4, 1]
end_line_chars = vol.end_line_chars()
print(end_line_chars.head())
count
page section place char
2 body end - 1
: 1
I 1
f 1
t 1
# Find pages that have lines ending with "!"
idx = pd.IndexSlice
print(end_line_chars.loc[idx[:,:,:,'!'],].head())
count
page section place char
45 body end ! 1
75 body end ! 1
77 body end ! 1
91 body end ! 1
92 body end ! 1
This library is meant to be compatible with Python 3.2+ and Python 2.7+. Tests are written for py.test and can be run with setup.py test
, or directly with python -m py.test -v
.
If you find a bug, leave an issue on the issue tracker, or contact Peter Organisciak at [email protected]
.