HTRC-Features

Tools for working with the HTRC Extracted Features dataset, a dataset of page-level text features extracted from 17 million digitized works.

This library provides a FeatureReader for parsing files, which are handled as Volume objects with collections of Page objects. Volumes provide access to metadata (e.g. language), volume-wide feature information (e.g. token counts), and access to Pages. Pages allow you to easily parse page-level features, particularly token lists.

This library makes heavy use of Pandas, returning many data representations as DataFrames. This is the leading way of dealing with structured data in Python, so this library doesn't try to reinvent the wheel. Since refactoring around Pandas, the primary benefit of using the HTRC Feature Reader is performance: reading the json structures and parsing them is generally faster than custom code. You also get convenient access to common information, such as case-folded token counts or part-of-page specific character counts. Details of the public methods provided by this library can be found in the HTRC Feature Reader docs.

Table of Contents: Installation | Usage | Additional Notes

Links: HTRC Feature Reader Documentation | HTRC Extracted Features Dataset

Citation: Peter Organisciak and Boris Capitanu, "Text Mining in Python through the HTRC Feature Reader," Programming Historian, (22 November 2016), http://programminghistorian.org/lessons/text-mining-with-extracted-features.

Installation

To install,

    pip install htrc-feature-reader

That's it! This library is written for Python 3.0+. For Python beginners, you'll need pip.

Alternately, if you are using Anaconda, you can install with

    conda install -c htrc htrc-feature-reader

The conda approach is recommended, because it makes sure that some of the hard-to-install dependencies are properly installed.

Given the nature of data analysis, using iPython with Jupyter notebooks for preparing your scripts interactively is a recommended convenience. Most basically, it can be installed with pip install ipython[notebook] and run with ipython notebook from the command line, which starts a session that you can access through your browser. If this doesn't work, consult the iPython documentation.

Optional: installing the development version.

Usage

Note: for new Python users, a more in-depth lesson is published by Programming Historian: Text Mining in Python through the HTRC Feature Reader. That lesson is also the official citation associated the HTRC Feature Reader library.

Reading feature files

The easiest way to start using this library is to use the Volume interface, which takes a path to an Extracted Features file.

from htrc_features import Volume
vol = Volume('data/ef2-stubby/hvd/34926/hvd.32044093320364.json.bz2')
vol

The Nautilus. by Delaware Museum of Natural History. (1904, 222 pages) - hvd.32044093320364

The FeatureReader can also download files at read time, by reference to a HathiTrust volume id. For example, if I want both of volumes of Pride and Prejudice, I can see that the URLs are babel.hathitrust.org/cgi/pt?id=hvd.32044013656053 and babel.hathitrust.org/cgi/pt?id=hvd.32044013656061. In the FeatureReader, these can be called with the ids=[] argument, as follows:

for htid in ["hvd.32044013656053", "hvd.32044013656061"]:
    vol = Volume(htid)
    print(vol.title, vol.enumeration_chronology)

Pride and prejudice. v.1
Pride and prejudice. v.2

This downloads the file temporarily, using the HTRC's web-based download link (e.g. https://data.analytics.hathitrust.org/features/get?download-id={{URL}}). One good pairing with this feature is the HTRC Python SDK's functionality for downloading collections.

For example, I have a small collection of knitting-related books at https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610. To read the feature files for those books:

from htrc import workset
volids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610')
FeatureReader(ids=volids).first().title

Remember that for large jobs, it is faster to download your dataset beforehand, using the rsync method.

Volume

A Volume contains information about the current work and access to the pages of the work. All the metadata fields from the HTRC JSON file are accessible as properties of the volume object, including title, language, imprint, oclc, pubDate, and genre. The main identifier id and pageCount are also accessible, and you can find the URL for the Full View of the text in the HathiTrust Digital Library - if it exists - with vol.handle_url.

"Volume {} is a {} page text from {} written in {}. You can doublecheck at {}".format(vol.id, vol.page_count, 
                                                                                      vol.year, vol.language, 
                                                                                      vol.handle_url)

'Volume hvd.32044013656061 is a 306 page text from 1903 written in eng. You can doublecheck at http://hdl.handle.net/2027/hvd.32044013656061'

This is the Extracted Features dataset, so the features are easily accessible. To most popular is token counts, which are returned as a Pandas DataFrame:

df = vol.tokenlist()
df.sample(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

				count
page	section	token	pos
201	body	abode	NN	1
117	body	head	NN	1
126	body	for	IN	1
210	body	three	CD	1
224	body	would	MD	1
89	body	The	DT	1
283	body	any	DT	1
63	body	surprise	NN	1
152	body	make	VB	1
170	body	I	PRP	3

Other extracted features are discussed below.

The full included metadata can be seen with vol.parser.meta:

vol.parser.meta.keys()

dict_keys(['id', 'metadata_schema_version', 'enumeration_chronology', 'type_of_resource', 'title', 'date_created', 'pub_date', 'language', 'access_profile', 'isbn', 'issn', 'lccn', 'oclc', 'page_count', 'feature_schema_version', 'ht_bib_url', 'genre', 'handle_url', 'imprint', 'names', 'source_institution', 'classification', 'issuance', 'bibliographic_format', 'government_document', 'hathitrust_record_number', 'rights_attributes', 'pub_place', 'volume_identifier', 'source_institution_record_number', 'last_update_date'])

These fields are mapped to attributes in Volume, so vol.oclc will return the oclc field from that metadata. As a convenience, Volume.year returns the pub_date information and Volume.author returns the contributor information.

vol.year, vol.author

('1903', ['Austen, Jane 1775-1817 '])

If the minimal metadata included with the extracted feature files is insufficient, you can fetch HT's metadata record from the Bib API with vol.metadata. Remember that this calls the HTRC servers for each volume, so can add considerable overhead. The result is a MARC file, returned as a pymarc record object. For example, to get the publisher information from field 260:

vol.metadata['260'].value()

'Boston : Little, Brown, 1903.'

At large-scales, using vol.metadata is an impolite and inefficient amount of server pinging; there are better ways to query the API than one volume at a time. Read about the HTRC Solr Proxy.

Another source of bibliographic metadata is the HathiTrust Bib API. You can access this information through the URL returned with vol.ht_bib_url:

vol.ht_bib_url

'http://catalog.hathitrust.org/api/volumes/full/htid/hvd.32044013656061.json'

Volumes also have direct access to volume-wide info of features stored in pages. For example, you can get a list of words per page through Volume.tokens_per_page(). We'll discuss these features below, after looking first at Pages.

Note that for the most part, the properties of the Page and Volume objects aligns with the names in the HTRC Extracted Features schema, except they are converted to follow Python naming conventions: converting the CamelCase of the schema to lowercase_with_underscores. E.g. beginLineChars from the HTRC data is accessible as Page.begin_line_chars.

The fun stuff: playing with token counts and character counts

Token counts are returned by Volume.tokenlist() (or Page.tokenlist(). By default, part-of-speech tagged, case-sensitive counts are returned for the body.

The token count information is returned as a DataFrame with a MultiIndex (page, section, token, and part of speech) and one column (count).

print(vol.tokenlist()[:3])

                         count
page section token  pos       
1    body    Austen .        1
             Pride  NNP      1
             and    CC       1

Page.tokenlist() can be manipulated in various ways. You can case-fold, for example:

tl = vol.tokenlist(case=False)
tl.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

				count
page	section	lowercase	pos
218	body	what	WP	1
30	body	pemberley	NNP	1
213	body	comes	VBZ	2
183	body	took	VBD	1
51	body	necessary	JJ	1

Or, you can combine part of speech counts into a single integer.

tl = vol.tokenlist(pos=False)
tl.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

			count
page	section	token
264	body	family	2
47	body	journey	1
98	body	Perhaps	1
49	body	at	2
227	body	so	1

Section arguments are valid here: 'header', 'body', 'footer', 'all', and 'group'

tl = vol.tokenlist(section="header", case=False, pos=False)
tl.head(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

			count
page	section	lowercase
9	header	's	1
		and	1
		austen	1
		jane	1
		prejudice	1

You can also drop the section index altogether if you're content with the default 'body'.

vol.tokenlist(drop_section=True, case=False, pos=False).sample(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		count
page	lowercase
247	suppose	1
76	would	2

The MultiIndex makes it easy to slice the results, and it is althogether more memory-efficient. For example, to return just the nouns (NN):

tl = vol.tokenlist()
tl.xs('NN', level='pos').head(4)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

			count
page	section	token
1	body	prejudiceJane	1
9	body	Volume	1
10	body	vol	3
12	body	./■	1

If you are new to Pandas DataFrames, you might find it easier to learn by converting the index to columns.

simpler_tl = df.reset_index()
simpler_tl[simpler_tl.pos == 'NN']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	page	section	token	pos	count
3	1	body	prejudiceJane	NN	1
19	9	body	Volume	NN	1
40	10	body	vol	NN	3
51	12	body	./■	NN	1
53	12	body	/	NN	1
...	...	...	...	...	...
43178	297	body	spite	NN	1
43187	297	body	uncle	NN	1
43191	297	body	warmest	NN	1
43195	297	body	wife	NN	1
43226	305	body	NON-RECEIPT	NN	1

7224 rows × 5 columns

If you prefer not to use Pandas, you can always convert the object, with methods like to_dict and to_csv).

tl[:3].to_csv()

'page,section,token,pos,count\n1,body,Austen,.,1\n1,body,Pride,NNP,1\n1,body,and,CC,1\n'

To get just the unique tokens, Volume.tokens provides them as a set. Here I select a specific page for brevity and a minimum count, but you can run the method without arguments.

vol.tokens(page_select=21, min_count=5)

{'"', ',', '.', 'You', 'been', 'have', 'his', 'in', 'of', 'the', 'you'}

In addition to token lists, you can also access other section features:

vol.section_features()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	tokenCount	lineCount	emptyLineCount	capAlphaSeq	sentenceCount
page
1	4	1	0	1	1
2	15	10	4	2	1
3	0	0	0	0	0
4	0	0	0	0	0
5	0	0	0	0	0
...	...	...	...	...	...
302	0	0	0	0	0
303	0	0	0	0	0
304	0	0	0	0	0
305	49	11	2	3	3
306	2	3	1	1	1

306 rows × 5 columns

Chunking

If you're working in an instance where you hope to have comparably sized document units, you can use 'chunking' to roll pages into chunks that aim for a specific length. e.g.

by_chunk = vol.tokenlist(chunk=True, chunk_target=10000)
print(by_chunk.sample(4))
# Count words per chunk
by_chunk.groupby(level='chunk').sum()

                              count
chunk section token      pos       
5     body    husbands   NNS      3
2     body    frequently RB       3
              domestic   JJ       3
3     body    :          :       10

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	count
chunk
1	12453
2	9888
3	9887
4	10129
5	10054
6	10065
7	12327

Multiprocessing

For large jobs, you'll want to use multiprocessing or multithreading to speed up your process. This is left up to your preferred method, either within Python or by spawning multiple scripts from the command line. Here are two approaches that I like.

Dask

Dask offers easy multithreading (shared resources) and multiprocessing (separate processes) in Python, and is particularly convenient because it includes a subset of Pandas DataFrames.

Here is a minimal example, that lazily loads token frequencies from a list of volume IDs, and counts them up by part of speech tag.

import dask.dataframe as dd
from dask import delayed

def get_tokenlist(vol):
    ''' Load a one volume feature reader, get that volume, and return its tokenlist '''
    return FeatureReader(ids=[volid]).first().tokenlist()

delayed_dfs = [delayed(get_tokenlist)(volid) for volid in volids]

# Create a dask
ddf = (dd.from_delayed(delayed_dfs)
         .reset_index()
         .groupby('pos')[['count']]
         .sum()
      )

# Run processing
ddf.compute()

Here is an example of 78 volumes being processed in 24 seconds with 31 threads:

This example used multithreading. Due to the nature of Python, certain functions won't parallelize well. In our case, the part where the JSON is read from the file and converted to a DataFrame (the light green parts of the graphic) won't speed up because Python dicts lock the Global Interpreter Lock (GIL). However, because Pandas releases the GIL, nearly everything you do after parsing the JSON will be very quick.

To better understand what happens when ddf.compute(), here is a graph for 4 volumes:

GNU Parallel

As an alternative to multiprocessing in Python, my preference is to have simpler Python scripts and to use GNU Parallel on the command line. To do this, you can set up your Python script to take variable length arguments of feature file paths, and to print to stdout.

This psuedo-code shows how that you'd use parallel, where the number of parallel processes is 90% the number of cores, and 50 paths are sent to the script at a time (if you send too little at a time, the initialization time of the script can add up).

find feature-files/ -name '*json.bz2' | parallel --eta --jobs 90% -n 50 python your_script.py >output.txt

Additional Notes

Installing the development version

git clone https://github.com/htrc/htrc-feature-reader.git
cd htrc-feature-reader
python setup.py install

Iterating through the JSON files

If you need to do fast, highly customized processing without instantiating Volumes, FeatureReader has a convenient generator for getting the raw JSON as a Python dict: fr.jsons(). This simply does the file reading, optional decompression, and JSON parsing.

Downloading files within the library

utils includes an Rsyncing utility, download_file. This requires Rsync to be installed on your system.

Usage:

Download one file to the current directory:

utils.download_file(htids='nyp.33433042068894')

Download multiple files to the current directory:

ids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids=ids)

Download file to /tmp:

utils.download_file(htids='nyp.33433042068894', outdir='/tmp')

Download file to current directory, keeping pairtree directory structure, i.e. ./nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/nyp.33433042068894.json.bz2:

utils.download_file(htids='nyp.33433042068894', keep_dirs=True)
    ```

### Getting the Rsync URL

If you have a HathiTrust Volume ID and want to be able to download the features for a specific book, `hrtc_features.utils` contains an [id_to_rsync](http://htrc.github.io/htrc-feature-reader/htrc_features/utils.m.html#htrc_features.utils.id_to_rsync) function. This uses the [pairtree](http://pythonhosted.org/Pairtree/) library but has a fallback written with that library is not installed, since it isn't compatible with Python 3.


```python
from htrc_features import utils
utils.id_to_rsync('miun.adx6300.0001.001')

'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

See the ID to Rsync notebook for more information on this format and on Rsyncing lists of urls.

There is also a command line utility installed with the HTRC Feature Reader:

$ htid2rsync miun.adx6300.0001.001
miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2

Advanced Features

In the beta Extracted Features release, schema 2.0, a few features were separated out to an advanced files. However, this designation is no longer present starting with schema 3.0, meaning information like beginLineChars, endLineChars, and capAlphaSeq are always available:

# What is the longest sequence of capital letter on each page?
vol.cap_alpha_seqs()[:10]

[0, 1, 0, 0, 0, 0, 0, 0, 4, 1]

end_line_chars = vol.end_line_chars()
print(end_line_chars.head())

                         count
page section place char       
2    body    end   -         1
                   :         1
                   I         1
                   f         1
                   t         1

# Find pages that have lines ending with "!"
idx = pd.IndexSlice
print(end_line_chars.loc[idx[:,:,:,'!'],].head())

                         count
page section place char       
45   body    end   !         1
75   body    end   !         1
77   body    end   !         1
91   body    end   !         1
92   body    end   !         1

Testing

This library is meant to be compatible with Python 3.2+ and Python 2.7+. Tests are written for py.test and can be run with setup.py test, or directly with python -m py.test -v.

If you find a bug, leave an issue on the issue tracker, or contact Peter Organisciak at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
conda		conda
data		data
dev		dev
examples		examples
htrc_features		htrc_features
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Makefile		Makefile
README.ipynb		README.ipynb
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTRC-Features

Installation

Usage

Reading feature files

Volume

The fun stuff: playing with token counts and character counts

Chunking

Multiprocessing

Dask

GNU Parallel

Additional Notes

Installing the development version

Iterating through the JSON files

Downloading files within the library

Advanced Features

Testing

About

Releases

Packages

Contributors 5

Languages

htrc/htrc-feature-reader

Folders and files

Latest commit

History

Repository files navigation

HTRC-Features

Installation

Usage

Reading feature files

Volume

The fun stuff: playing with token counts and character counts

Chunking

Multiprocessing

Dask

GNU Parallel

Additional Notes

Installing the development version

Iterating through the JSON files

Downloading files within the library

Advanced Features

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages