Relating PRO to UniProt #165

nataled · 2019-10-02T12:29:27Z

This issue is a continuation of the discussion here:
geneontology/neo#34

This thread will focus on:

What do the UniProt PURLs denote: database entry, protein class, or sequence?
How does PRO relate to UniProt?

Interested parties (so far):
@JervenBolleman
@cmungall
@goodb
@alanruttenberg

cmungall · 2019-10-02T21:03:26Z

I would very much like there to be a single URI for a concept like "human Shh protein" (or at least two equivalent interchangeable URIs).

nataled · 2019-10-02T21:35:01Z

This will be possible once we find out just what the UniProt PURLS intend to mean. I recall @JervenBolleman saying he considers them to mean the same as PRO when he gives talks, but I'm not sure there's agreement on that (several people on the previous thread--myself included--indicated that they consider them as referring to database entries). In PRO we consider them exactly that--database entries that are about some protein class (for example, http://purl.uniprot.org/uniprot/P05067 is_about http://purl.obofoundry.org/obo/PR_P05067).

My main concern is that the UniProt PURLs might be overloaded in meaning. That is, some people consider them to refer to classes of proteins, some say they refer to database entries, and others might consider them as referring to sequences . If they are database entries, fine, but for PRO purposes we'll need a way to refer to the sequence. If they are protein classes, fine, we'll provide the appropriate equivalency statements, but we'll still need a way to refer to the sequence. If they are sequences, fine, we'll make the appropriate connection. I recall @cmungall suggesting that for the sequences we use a URL such as https://www.uniprot.org/uniprot/P05067.fasta?version=1. That would be fine, but there are also these things: http://purl.uniprot.org/isoforms/P05067-1. I asked if that PURL is intended to represent the (current) sequence, or intended to represent the class of proteins derived from that isoform. I did not get an answer.

cmungall · 2019-10-02T23:02:47Z

[broken record]
I think the whole referring to database entries is a red herring. http://purl.obolibrary.org/obo/GO_0097194 refers to a database entry, for a term in GO. It has databasey properties like identifiers, and xrefs, and information about which curator created it. But it's also a representation of a repeatable thing in nature. Ultimately we're all in the business of representing things in nature here, and at the same time doing database/ontology curation.

Our IDs can do dual duties as representing database entities and things in nature. There is no need to get meta and introduce an extra layer of indirection. Or at least I am not aware of such a use case, where someone really needs to track both these things and keep them distinct.
[/broken record]

I think the sequence vs protein molecule aspect is a bit more nuanced

nataled · 2019-10-02T23:54:50Z

I believe you missed my point. It isn't that I am introducing a layer. The question is "What kind of entity does UniProt consider its entries to be?" And one possible answer is..."Database entries."

nataled · 2019-10-03T11:38:01Z

@cmungall asked "What are the semantics of a non-GCRP trembl ID according to PRO?"

TrEMBL entries fall into the following types:

A) If there already exists a Swiss-Prot entry describing the products of some gene G (SP_of_G), then the TrEMBL entry describing a product of the same gene (Tr_of_G) can be:

A sequence variant (allele) of G. These would be Tr_of_G is_a SP_of_G
An isoform of G. These would be Tr_of_G is_a SP_of_G

B) If no Swiss-Prot entry describes the products of the TrEMBL gene, then the TrEMBL entry describing a product of that gene (Tr_of_G) can be:

The 'proto-canonical' sequence (either because there is no other entry describing a product of that gene, or because it has the longest sequence among all TrEMBL entries with that gene). We'll call these TrC_of_G. In this case TrC_of_G is_a protein (or whatever level is appropriate). I describe this only for completeness; these are (or should be) part of the GCRP set.
A sequence variant (allele) of that gene (TrV_of_G). Then, TrV_of_G is_a TrC_of_G.
An isoform of G (TrI_of_G). Then, TrI_of_G is_a TrC_of_G.

C) If no gene is indicated in the TrEMBL entry (call it TrX), then...

TrX is_a protein (if no species non-specific parent can be found).
TrX is_a =species non-specific parent=

Technically speaking, TrEMBL entries (like some Swiss-Prot) can also describe fragments.

cmungall · 2019-10-04T21:55:15Z

I'm going to post a strawman proposal:

PRO gene-level protein classes and UniProt canonical/GCRP entries are to be considered equivalent in the strict OWL sense. (ergo the URIs could be collapsed with no loss of logical entailment and no introduction of inconsistency. This would be a win as the community would not have to make an arbitrary selection between two distinct PURLs/CURIEs)

Ontologically these are protein classes, which are material entity classes (as is currently the case in PRO)

(The uniprot docs talk about these as sequences, which is perfectly valid as the main use case for these involves treating them as sequences, but in the ontological treatment, the sequence would be a property of the material entity)

They are the superclasses of isoform classes (as they are now, in PRO)

The isoform level classes in PRO would be equivalent to the uniprot isoform entries (e.g. P12345-1)

There could be some kind of has-canonical-form relationship between the main class and isoform-1 (see http://purl.obolibrary.org/obo/RO_0002214)

Note that at the database level, the canonical entry will have annotations for things such as protein domains, functions, etc. At the ontological level this will not be taken to mean that all instances of that protein have those properties. Otherwise we end up with logical inconsistencies. Instead it will be a some-some.

Note that neither resource needs to make any changes to implement this. It would be a semantic MOU about ontological commitment of PURLs. And both would agree not to publish logical axioms that introduce logical inconsistencies.

However, if both parties agree, then there is a strong case for PRO switching from PRO purls for gene-level to instead use uniprot PURLs.

cmungall · 2019-10-21T17:01:47Z

I don't know if this will be discussed at the PRO meeting this week, I may not have time after today for any Qs, but @goodb, @balhoff, @ukemi, @deustp01 may be able to help

nataled · 2019-10-21T17:09:47Z

Unfortunately, the PRO meeting is heavily focused on preparing for work proposed as part of an upcoming grant, and will be rather high level. It is possible (and likely) that this can be discussed with a few people outside the meeting, but there just isn't time to do so during the meeting itself (plus, we won't have the required stakeholders present). Given my schedule, I myself will not be able to address your proposal for another few weeks.

nataled added the Policy Discussions about PRO policies label Oct 2, 2019

nataled self-assigned this Oct 2, 2019

nataled mentioned this issue Oct 2, 2019

Please use http://purl.uniprot.org/uniprot or http://purl.uniprot.org/isoform/ IRIs for UniProt concepts geneontology/neo#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relating PRO to UniProt #165

Relating PRO to UniProt #165

nataled commented Oct 2, 2019 •

edited

Loading

cmungall commented Oct 2, 2019

nataled commented Oct 2, 2019

cmungall commented Oct 2, 2019

nataled commented Oct 2, 2019

nataled commented Oct 3, 2019 •

edited

Loading

cmungall commented Oct 4, 2019

cmungall commented Oct 21, 2019

nataled commented Oct 21, 2019 •

edited

Loading

Relating PRO to UniProt #165

Relating PRO to UniProt #165

Comments

nataled commented Oct 2, 2019 • edited Loading

cmungall commented Oct 2, 2019

nataled commented Oct 2, 2019

cmungall commented Oct 2, 2019

nataled commented Oct 2, 2019

nataled commented Oct 3, 2019 • edited Loading

cmungall commented Oct 4, 2019

cmungall commented Oct 21, 2019

nataled commented Oct 21, 2019 • edited Loading

nataled commented Oct 2, 2019 •

edited

Loading

nataled commented Oct 3, 2019 •

edited

Loading

nataled commented Oct 21, 2019 •

edited

Loading