Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for journal papers #15

Closed
wkiri opened this issue Mar 31, 2021 · 13 comments
Closed

Add support for journal papers #15

wkiri opened this issue Mar 31, 2021 · 13 comments
Assignees

Comments

@wkiri
Copy link
Contributor

wkiri commented Mar 31, 2021

Currently, the JournalParser class supports LPSC content. We would like to broaden it to accommodate other publication sources such as the Journal of Geophysical Research, Icarus, etc., each of which will have some journal-specific parsing needed.

stevenlujpl added a commit that referenced this issue Apr 23, 2021
stevenlujpl added a commit that referenced this issue Apr 23, 2021
stevenlujpl added a commit that referenced this issue Apr 23, 2021
stevenlujpl added a commit that referenced this issue Apr 29, 2021
stevenlujpl added a commit that referenced this issue Apr 29, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 20, 2021
stevenlujpl added a commit that referenced this issue May 26, 2021
stevenlujpl added a commit that referenced this issue May 26, 2021
stevenlujpl added a commit that referenced this issue May 26, 2021
stevenlujpl added a commit that referenced this issue Jun 11, 2021
stevenlujpl added a commit that referenced this issue Jun 11, 2021
fix ADS parser when docs not found in ADS database #15
@stevenlujpl
Copy link
Contributor

The problem of HTTP error code 400 should have been resolved with the commit above. The solution is that we need to escape special characters in the paper titles for Solr queries.

@stevenlujpl
Copy link
Contributor

stevenlujpl commented Jun 23, 2021

The MPF doc (/proj/mte/data/corpus-lpsc/mpf-pdf/2014_1566.pdf) entitled "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS" is still giving HTTP error code 400.

When I query the ADS database using the full title, I got the following HTTP 400 error.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20NEAR%20AND%20AT%20CAPE%20YORK%2C%20WESTERN%20RIM%20OF%20ENDEAVOUR%20CRATER%2C%20MARS&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":400,
    "QTime":0,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d291f1-5521bd5b1a7e97df1bb7f55e",
      "rows":"10",
      "wt":"json"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","java.lang.Exception"],
    "msg":"org.apache.solr.search.SyntaxError: INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS: The parser reported a syntax error, antlrqueryparser hates errors! ",
    "code":400}}

When I query the ADS database with the partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY", then it worked.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":0,
    "QTime":71,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY ",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d29236-0eb121686e834cfd1f33c7e8",
      "rows":"10",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "pubdate":"2014-03-00",
        "first_author":"Fairen, A. G.",
        "year":"2014",
        "author":["Fairen, A. G.",
          "Squyres, S. W.",
          "Grotzinger, J. P.",
          "Calvin, W. M.",
          "Ruff, S. W."],
        "aff":["-",
          "-",
          "-",
          "-",
          "-"],
        "pub":"Lunar and Planetary Science Conference",
        "title":["Hollowed Spherules Identified with the MER Opportunity Near and at Cape York, Western Rim of Endeavour Crater, Mars"]}]
  }}

When I query the DB with another (longer) partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR" (note that I added the word NEAR to the end of the previous partial title), and I got the same HTTP 400 error again.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20NEAR&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":400,
    "QTime":0,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d292a5-78cf1d6c0f0a4bae3f21198d",
      "rows":"10",
      "wt":"json"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","java.lang.Exception"],
    "msg":"org.apache.solr.search.SyntaxError: INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR: The parser reported a syntax error, antlrqueryparser hates errors! ",
    "code":400}}

When I searched on the ADS website using the same full and partial titles, I got the same behaviors. This is very confusing because the title doesn't contain any special characters at all. I will ignore this document for now.

@stevenlujpl
Copy link
Contributor

The new LPSC parser still cannot parser the following 8 MPF docs due to HTTP code 400 (bad request) and 500 (internal server errors):

/proj/mte/data/corpus-lpsc/mpf-pdf/1998_1338.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/1998_1817.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/1999_1166.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2000_1846.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2001_1379.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2002_1242.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2011_2407.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2014_1566.pdf

The full error log file is at /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/errors.log.

@stevenlujpl
Copy link
Contributor

All of the errors related to HTTP 400 and 500 (mentioned above) should have been fixed.

@stevenlujpl
Copy link
Contributor

@wkiri The ADS search with year, pub, and page we tried during the meeting is now working. I had to add double quotes to the Lunar and Planetary Science Conference. I don't understand why I had to add double quotes to make the query work because the word and in Lunar and Planetary Science Conference is lower cases and it should not be treated as a Lucene reserved keyword.

I tested querying ADS DB using year, pub, and page for the following 5 docs that had multiple responses when querying ADS DB using title previously.

2000_1898.pdf
2000_1853.pdf
2000_1938.pdf
2003_1325.pdf
2004_2177.pdf

The 5 docs above worked fine with year, pub, and page search. I am going to implement this approach in the ADS parser script.

@wkiri
Copy link
Contributor Author

wkiri commented Jul 7, 2021

@stevenlujpl This is great news! Thanks for figuring it out. I agree that the handling of keywords in this API seems very strange.

When you say it worked for those 5 docs, do you mean each query only returns a single match now?

@stevenlujpl
Copy link
Contributor

Yes, each query returned only 1 matching document and the returned fields such as title and author of the matching document are correct.

@stevenlujpl
Copy link
Contributor

stevenlujpl commented Jul 7, 2021

@wkiri I want to apologize to use the meeting time to implement the year + abstract number + venue ADS search. I don't want to wait until next week to work on this, but I ran out of MTE time this week.

As I mentioned in the meeting, I tested this searching strategy on all 591 MPF LPSC docs. Now, only 2 docs (2006_2424.pdf and 2003_1088.pdf) are not found in the ADS database. I then confirmed using curl to query the ADS database, and it indeed seems that these two docs are not indexed in the ADS database.

This new code structure that I just checked in will support the DOI search (we just need to add a function to construct the DOI query string) for journal papers if we can extract DOI.

I won't have time this week to carefully verifiy the content in the jsonl file, but the following command confirms that there are indeed 589 ads:primary_author fields in the jsonl file.

cat lpsc_mpf_lpsc_parser6.jsonl | grep 'ads:primary_author' | wc -l

The jsonl and the corresponding processing log can be found at the following locations on analysis if you are interested to take a look.

/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_lpsc_parser6.log
/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_lpsc_parser6.jsonl

@wkiri
Copy link
Contributor Author

wkiri commented Jul 7, 2021

@stevenlujpl No need to apologize - I am glad you got this working! Thank you for including the details here.

stevenlujpl added a commit that referenced this issue Jul 15, 2021
stevenlujpl added a commit that referenced this issue Jul 15, 2021
@wkiri
Copy link
Contributor Author

wkiri commented Jul 15, 2021

Want to test on more journal .pdf files (from our existing downloads) as well. Then can close this issue.

stevenlujpl added a commit that referenced this issue Aug 10, 2021
@stevenlujpl
Copy link
Contributor

stevenlujpl commented Aug 10, 2021

I tested the paper_parser.py script on all of the journal papers (n=37) we have collected so far. All 37 journal papers were successfully processed (after I fixed a bug), and we found the contains relations for 14 papers.

However, there are 12 papers, whose titles were extracted by grobid, are not found in the ADS database. These titles extracted by grobid are listed below:

1. High concentrations of manganese and sulfur in deposits on Murray Ridge, Endeavour Crater, Marsk
2. Stability of perchlorate hydrates and their liquid solutions at the Phoenix landing site, Mars
3. Click Here for Soluble sulfate in the martian soil at the Phoenix landing site
4. Identification of the perchlorate parent salts at the Phoenix Mars landing site and possible implications
5. Atmospheric origins of perchlorate on Mars and in the Atacama
6. Microscopy analysis of soils at the Phoenix landing site, Mars: Classification of soil particles and description of their optical and magnetic properties
7. Wet Chemistry experiments on the 2007 Phoenix Mars Scout Lander mission: Data analysis and results
8. Click Here for Habitability of the Phoenix landing site
9. Click Here for Initial results from the thermal and electrical conductivity probe (TECP) on Phoenix
10. Mars Exploration Rover Tenth Extended Mission Appendices K-1 Section 2 References
11. tice sites of calcium carbonate and affect Mars' soil geochemistry, and calcium carbonate can cement small soil grains and change the phys-ical properties of the surface of Mars
12. Supporting Online Material REPORTS H 2 O at the Phoenix Landing Site

I have created an issue (#38 ) to capture ideas for improving the ADS parser.

The jsonl and log files can be found at the following locations:

/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.jsonl
/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants