Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qc-duplicate-exact-synonym-no-abbrev related updates #767

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 20 additions & 18 deletions src/ontology/mondo-ingest.Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -585,15 +585,17 @@ $(SYN_SYNC_DIR):
.PHONY: sync-synonyms
sync-synonyms: $(SYN_SYNC_DIR)/review-qc-duplicate-exact-synonym-no-abbrev.tsv

# side effects: Mutates .robot.tsv files, filtering out certain cases, which will instead get populated into the review-*.tsv.
$(SYN_SYNC_DIR)/review-qc-duplicate-exact-synonym-no-abbrev.tsv: $(SYN_SYNC_DIR)/synonym_sync_combined_cases.robot.tsv $(SYN_SYNC_DIR)/sync-synonyms.added.robot.tsv $(SYN_SYNC_DIR)/sync-synonyms.confirmed.robot.tsv $(SYN_SYNC_DIR)/sync-synonyms.updated.robot.tsv tmp/mondo-synonyms-scope-type-xref.tsv $(TMPDIR)/mondo.db
$(SYN_SYNC_DIR)/review-qc-duplicate-exact-synonym-no-abbrev.tsv $(SYN_SYNC_DIR)/sync-synonyms.added.robot.tsv $(SYN_SYNC_DIR)/sync-synonyms.confirmed.robot.tsv $(SYN_SYNC_DIR)/sync-synonyms.updated.robot.tsv: $(TMPDIR)/sync-synonyms.added.robot.tsv $(TMPDIR)/sync-synonyms.confirmed.robot.tsv $(TMPDIR)/sync-synonyms.updated.robot.tsv tmp/mondo-synonyms-scope-type-xref.tsv $(TMPDIR)/mondo.db
python3 $(SCRIPTSDIR)/sync_synonym_curation_filtering.py \
--added-path reports/sync-synonym/sync-synonyms.added.robot.tsv \
--confirmed-path reports/sync-synonym/sync-synonyms.confirmed.robot.tsv \
--updated-path reports/sync-synonym/sync-synonyms.updated.robot.tsv \
--mondo-synonyms-path tmp/mondo-synonyms-scope-type-xref.tsv \
--mondo-db-path $(TMPDIR)/mondo.db \
--outpath reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv
--added-inpath $(TMPDIR)/sync-synonyms.added.robot.tsv \
--confirmed-inpath $(TMPDIR)/sync-synonyms.confirmed.robot.tsv \
--updated-inpath $(TMPDIR)/sync-synonyms.updated.robot.tsv \
--added-outpath $(SYN_SYNC_DIR)/sync-synonyms.added.robot.tsv \
--confirmed-outpath $(SYN_SYNC_DIR)/sync-synonyms.confirmed.robot.tsv \
--updated-outpath $(SYN_SYNC_DIR)/sync-synonyms.updated.robot.tsv \
--mondo-synonyms-inpath $(TMPDIR)/mondo-synonyms-scope-type-xref.tsv \
--mondo-db-inpath $(TMPDIR)/mondo.db \
--review-outpath reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv

tmp/mondo-synonyms-scope-type-xref.tsv: $(TMPDIR)/mondo.owl
$(ROBOT) query -i tmp/mondo.owl --query ../sparql/synonyms-scope-type-xref.sparql $@
Expand All @@ -610,26 +612,26 @@ $(SYN_SYNC_DIR)/synonym_sync_combined_cases.robot.tsv: $(foreach n,$(ALL_COMPONE
tail -n +3 $$file >> $@; \
done

$(SYN_SYNC_DIR)/sync-synonyms.added.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(SYN_SYNC_DIR)/$(n)-synonyms.added.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(SYN_SYNC_DIR)/*.synonyms.added.robot.tsv > $@
$(TMPDIR)/sync-synonyms.added.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(TMPDIR)/$(n)-synonyms.added.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(TMPDIR)/*.synonyms.added.robot.tsv > $@

$(SYN_SYNC_DIR)/sync-synonyms.confirmed.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(SYN_SYNC_DIR)/$(n)-synonyms.confirmed.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(SYN_SYNC_DIR)/*.synonyms.confirmed.robot.tsv > $@
$(TMPDIR)/sync-synonyms.confirmed.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(TMPDIR)/$(n)-synonyms.confirmed.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(TMPDIR)/*.synonyms.confirmed.robot.tsv > $@

$(SYN_SYNC_DIR)/sync-synonyms.updated.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(SYN_SYNC_DIR)/$(n)-synonyms.updated.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(SYN_SYNC_DIR)/*.synonyms.updated.robot.tsv > $@
$(TMPDIR)/sync-synonyms.updated.robot.tsv: $(foreach n,$(ALL_COMPONENT_IDS), $(TMPDIR)/$(n)-synonyms.updated.robot.tsv)
awk '(NR == 1) || (NR == 2) || (FNR > 2)' $(TMPDIR)/*.synonyms.updated.robot.tsv > $@

$(SYN_SYNC_DIR)/%-synonyms.added.robot.tsv $(SYN_SYNC_DIR)/%-synonyms.confirmed.robot.tsv $(SYN_SYNC_DIR)/%-synonyms.updated.robot.tsv $(TMPDIR)/synonym_sync_combined_cases_%.tsv: $(TMPDIR)/mondo.sssom.tsv $(COMPONENTSDIR)/%.db metadata/%.yml tmp/mondo-synonyms-scope-type-xref.tsv tmp/%-synonyms-scope-type-xref.tsv | $(SYN_SYNC_DIR)
$(TMPDIR)/%-synonyms.added.robot.tsv $(TMPDIR)/%-synonyms.confirmed.robot.tsv $(TMPDIR)/%-synonyms.updated.robot.tsv $(TMPDIR)/synonym_sync_combined_cases_%.tsv: $(TMPDIR)/mondo.sssom.tsv $(COMPONENTSDIR)/%.db metadata/%.yml tmp/mondo-synonyms-scope-type-xref.tsv tmp/%-synonyms-scope-type-xref.tsv | $(TMPDIR)
python3 $(SCRIPTSDIR)/sync_synonym.py \
--mondo-mappings-path $(TMPDIR)/mondo.sssom.tsv \
--ontology-db-path $(COMPONENTSDIR)/$*.db \
--mondo-synonyms-path tmp/mondo-synonyms-scope-type-xref.tsv \
--mondo-exclusion-configs config/mondo-exclusion-configs.yml \
--onto-synonym-types-path tmp/$*-synonyms-scope-type-xref.tsv \
--onto-config-path metadata/$*.yml \
--outpath-added $(SYN_SYNC_DIR)/$*.synonyms.added.robot.tsv \
--outpath-confirmed $(SYN_SYNC_DIR)/$*.synonyms.confirmed.robot.tsv \
--outpath-updated $(SYN_SYNC_DIR)/$*.synonyms.updated.robot.tsv \
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \
--outpath-confirmed $(TMPDIR)/$*.synonyms.confirmed.robot.tsv \
--outpath-updated $(TMPDIR)/$*.synonyms.updated.robot.tsv \
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv

##################################
Expand Down
49 changes: 49 additions & 0 deletions src/ontology/reports/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,52 @@ Summary statistics for excluded terms that still have cross-references in Mondo.
- `pct_in1_notIn2_in3__over_in1` (`float`): Percentage of terms that still have cross-references in Mondo.

Created by running `cd src/ontology; sh run.sh make reports/<ONTOLOGY_NAME>_excluded_terms_in_mondo_xrefs.tsv`.

### 7. `reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv`
**What this file represents**
This file shows cases that were filtered out of the synonym sync because they caused conflicts, identified by qc-duplicate-exact-synonym-no-abbrev.sparql.

**Columns**
- `synonym`
- `mondo_id`: The Mondo term that is getting affected by an -added or -updated synonym change. If the 'case' for the row
is -confirmed or -unconfirmed, then this synonym already exists in that Mondo term.
- `source_id`: The source term ID that the synonym is coming from. In the case that a synonym appears in multiple
sources, there will be multiple rows.
- `case`: 'added' or 'updated', this is a new synonym or changed synonym scope which is coming in through the synonym
sync. If confirmed or unconfirmed, this is an existing synonym in Mondo, and no change is coming in for this synonym
through the sync. In the case of confirmed, this synonym is also corroborated by a mapped source term. In case of
unconfirmed, it exists in Mondo, but was not found as a synonym for any of the mapped source terms.
- `synonym_type`: This is left here mainly as a sanity check to ensure that no cases of MONDO:ABBREVIATION slipped in.
It is allowable for there to be duplicative synonyms for abbreviations.
- `filtered_because_this_mondo_id_already_has_this_synonym_as_its_label`: This column will be empty in the case of
exactSynonym-exactSynonym collisions (cases where an exact synonym on one Mondo term is the same as one on another
mondo term). However, label-synonym is another possible case; that is, a case where there is a new/updated exactSynonym
coming in through the synonym sync, for which that synonym exists as the label of a separate Mondo term. For those
cases, there will only be 1 row for the synonym, with 1 or more Mondo IDs in this column.

**Different kinds of conflicts**
_Example 1: exactSynonym-exactSynonym conflict_
You don't want to review just 1 row in isolation. You want to review 1 at a time all of the rows for a given synonym.

For example:
| synonym | mondo_id | source_id | case |
| --- | --- | --- | --- |
| 3C syndrome | MONDO:0009073 | OMIM:220210 | updated |
| 3C syndrome | MONDO:0019078 | GARD:0005666 | confirmed |
| 3C syndrome | MONDO:0019078 | Orphanet:7 | confirmed |

The conflict here is in the first of these rows. The synonym sync wants to update the synonym scope on MONDO:0009073, as evidenced by OMIM:220210. However, in changing it to exactSynonym, there would be a conflict, because that synonym already exists on MONDO:0019078, where it is evidenced by GARD:0005666 and Orphanet:7.

_Example 2: exactSynonym-label conflict_
| synonym | mondo_id | source_id | case | filtered_because_this_mondo_id_already_has_this_synonym_as_its_label |
| --- | --- | --- | --- | --- |
| A20 haploinsufficiency | MONDO:0800045 | DOID:0080944 | added | MONDO:0100222 |

In this case, the synonym sync wants to add a new exactSynonym to MONDO:0800045. However, that synonym exists as a label on a different Mondo term: MONDO:0100222.

**How to review**
Look at these conflict cases and make a recommended action. You can add a note to the 'action' column.
Possible actions (may not be a full list):
a. Disallow the suggested change from the synonym sync.
b. Allow the suggested change from the synonym sync, but remove the conflicting exactSynonym or label within Mondo. Then, the next time the synonym sync runs, it will detect no conflict, and these synonym updates will go through.
c. Allow the suggested change from the synonym sync AND allow a conflict?
Loading