Skip to content

Overview of repository statistics

Peter Midford edited this page Jul 30, 2015 · 20 revisions

Overview

The germinator project now includes scripts that generate json-format statistics for phylesystem instances and for synthetic trees.

Report generation scripts

Both scripts pull data via API calls, accumulate counts of studies and OTUs within the studies, and generate a json record. The record is added to a json structure of records from previous runs, keyed by the time (at end of the analysis). Time is recorded as ISO 8601 with hour resolution (i.e., formatted '%Y-%m-%dT%HZ').

Report 'push' scripts

These scripts use rrsync to push the output of the generation scripts to the appropriate location in a web2py application running on a server machine. Use of rrsync requires a web2py installation and appropriate ssh public-key configuration. The current configuration pushes to a folder called 'statistics', within a web2py installation (e.g., HOST:/home/USER/web2py/applications/opentree/static/). The push script specifies the statistics folder, and the location on the web2py server is specified in the server's .ssh/authorized_keys file

The scripts take three command line arguments:

  • a local folder where the working copy of the report (which is appended to by the python script is found). There should be a separate folder for each server (e.g. dev, production) being monitored.
  • the api server used to retrieve studies and otus. This may not be the same as the target host.
  • the target host. The scripts will authenticate with the user 'opentree' by default.

At present, these scripts are run on crontabs as follows:

 0 0 * * * /home/opentree/statistics/phylesystem_stats.sh devstats devapi.opentreeoflife.org devtree.opentreeoflife.org
 0 0 * * * /home/opentree/statistics/synthesis_stats.sh devstats devapi.opentreeoflife.org devtree.opentreeoflife.org

Installation

Report Formats

Fields for phylesystem reports

  • reported_study_count - integer length of list of studies returned
  • study_count - integer number of studies that returned otus when queried
  • OTU_count - integer count of OTUs in studies
  • unique_OTU_count - count of OTUs without duplicates
  • unmapped_OTU_count - the count of the OTU objects that have not been mapped to OTT.
  • nominated_study_count - count of 'nominated' studies (= not marked ot:notIntendedForSynthesis)
  • nominated_study_OTU_count - count of OTUs in nominated studies
  • nominated_study_unique_OTU_count - count of OTUs in nominated studies w/o duplicates
  • nominated_study_unmapped_OTU_count - the count of the OTU objects in nominated studies that have not been mapped to OTT
  • run_time - elapsed time for processing, including queries, in seconds

Fields for synthesis reports

As of 2015-01-26 there have only been two synthetic trees, one from April 2014 and one from September 2014.

  • date (TO BE DONE) - date of synthesis
  • reported_study_count - integer length of list of studies in synthesis returned
  • study_count - integer number of studies that returned OTUs when queried (subset of previous)
  • total_OTU_count - sum of number of OTUs across all studies
  • unique_OTU_count - integer length of OTU list without duplicates
  • run_time - elapsed time for processing, including queries, in seconds

Fields for taxonomy reports (WORK IN PROGRESS)

As of 2015-01-26 we have taxonomy versions 2.0 through 2.8.

  • date (TO BE DONE) - date of taxonomy release
  • version - a string e.g. "2.8"
  • taxon_count - total number of taxa including 'hidden' taxa
  • visible_taxon_count - number of taxa exclusive of 'hidden' taxa
  • release page - URL (does this belong with statistics?)
  • download link - URL for compressed tarball, see http://files.opentreeoflife.org/ott/ - credits are on release page (does this belong with statistics?)
  • [other fields TBD]