-
Notifications
You must be signed in to change notification settings - Fork 0
Interim taxonomy file format
This page describes the format used to represent the taxonomies that are the inputs and outputs of the Open Tree of Life taxonomy build system.
The format is intentionally rudimentary because our needs are minimal; I think the format derives from NCBI. A better format to use might be Darwin Core Archive, which is what's used by GBIF. TBD.
Each source taxonomy (NCBI, GBIF, Index Fungorum, ...) has its own script that converts its native format into this format.
There is one directory per taxonomy. The contents of the directory are files with fixed names. Example: mycobank/taxonomy.tsv, mycobank/synonyms.tsv, mycobank/about.md.
All files use the UTF-8 character encoding. Native taxonomy files often use some other encoding, so conversion might be necessary. Some aggregated taxonomies on the web have gotten this wrong and are a mess of mixed encodings and spurious re-encodings.
File taxonomy.tsv has the following format:
Four columns, each column followed by tab - vertical bar - tab.
There should be a header row, which looks like:
uid | parent_uid | name | rank |
Followed by one row per taxon.
Column 1: identifier - an integer identifier for the taxon, unique within this file. Should be native accession number whenever possible, sine
Column 2: parent taxon identifier, or the empty string if there is no parent.
Column 3: name - arbitrary text for the taxon name; not necessarily unique within the file.
Column 4: rank, e.g. species, family, class. Should be all lower case. If no rank is assigned, or the rank is unknown, put "no rank".
Example (from NCBI):
5157 | 1028423 | Ceratocystis | genus |
5156 | 91171 | Gondwanamyces proteae | species |
Optional column:
sourceinfo: a comma-separated list of specifiers, each one either a URL or a CURIE. If a URL, it should be either a DOI in the form of a URL, or a link to some other source such as a database. URLs begin 'http://' or 'https://' and DOI URLs begin 'http://dx.doi.org/10.'. A CURIE is an abbreviated URI using a prefix drawn from a known set, e.g. ncbi:1234 is taxon 1234 in the NCBI taxonomy. Other prefixes include gbif:, if: (Index Fungorum), mb: (Mycobank). New prefixes can be added but this is a manual process, please request explicitly.
Usually there are synonyms. These go into a second file, synonyms.tsv. This file should have a header row
uid | parent_uid | name | rank |
Thereafter there are four columns:
Column 1: uid - the id for the taxon (from the taxonomy file) that this synonym resolves to
Column 2: name - the synonymic taxon name
Column 3: type - typically will be 'synonym' but could be any of the NCBI synonym types (authority, common name, etc.)
Column 4: I don't know what this is for. Seems to always be empty, and is ignored by taxonomy synthesis.
Example from NCBI:
89373 | Flexibacteraceae | synonym | |
Overall metadata for the taxonomy should be placed in a file as well. The metadata format is still under development, so for now you should create a markdown or plain text file called 'about.md' in the same directory as taxonomy.tsv and synonyms.tsv files. The file should give the source of the taxonomy (article or database) and any other descriptive information that's available. The purpose of the metadata is not just explanatory but also to explain how to check the correctness of the taxonomy against its source and make corrections and other improvements.
When using information from changing sources (databases) the date or dates of retrieval should be recorded.
This page was copied from https://github.com/OpenTreeOfLife/opentree/wiki/Interim-taxonomy-file-format on 2014-02-06.