Add support for journal papers #15

wkiri · 2021-03-31T18:47:04Z

Currently, the JournalParser class supports LPSC content. We would like to broaden it to accommodate other publication sources such as the Journal of Geophysical Research, Icarus, etc., each of which will have some journal-specific parsing needed.

…date lpsc and jgr parsers; #15

…15

add progress bar to parser scripts #15

fix ADS parser when docs not found in ADS database #15

stevenlujpl · 2021-06-23T00:42:58Z

The problem of HTTP error code 400 should have been resolved with the commit above. The solution is that we need to escape special characters in the paper titles for Solr queries.

stevenlujpl · 2021-06-23T01:46:29Z

The MPF doc (/proj/mte/data/corpus-lpsc/mpf-pdf/2014_1566.pdf) entitled "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS" is still giving HTTP error code 400.

When I query the ADS database using the full title, I got the following HTTP 400 error.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20NEAR%20AND%20AT%20CAPE%20YORK%2C%20WESTERN%20RIM%20OF%20ENDEAVOUR%20CRATER%2C%20MARS&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":400,
    "QTime":0,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d291f1-5521bd5b1a7e97df1bb7f55e",
      "rows":"10",
      "wt":"json"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","java.lang.Exception"],
    "msg":"org.apache.solr.search.SyntaxError: INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS: The parser reported a syntax error, antlrqueryparser hates errors! ",
    "code":400}}

When I query the ADS database with the partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY", then it worked.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":0,
    "QTime":71,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY ",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d29236-0eb121686e834cfd1f33c7e8",
      "rows":"10",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "pubdate":"2014-03-00",
        "first_author":"Fairen, A. G.",
        "year":"2014",
        "author":["Fairen, A. G.",
          "Squyres, S. W.",
          "Grotzinger, J. P.",
          "Calvin, W. M.",
          "Ruff, S. W."],
        "aff":["-",
          "-",
          "-",
          "-",
          "-"],
        "pub":"Lunar and Planetary Science Conference",
        "title":["Hollowed Spherules Identified with the MER Opportunity Near and at Cape York, Western Rim of Endeavour Crater, Mars"]}]
  }}

When I query the DB with another (longer) partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR" (note that I added the word NEAR to the end of the previous partial title), and I got the same HTTP 400 error again.

[youlu@mlia-compute1 verification_test]$ curl -H 'Authorization: Bearer jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U' 'https://api.adsabs.harvard.edu/v1/search/query?q=title:HOLLOWED%20SPHERULES%20IDENTIFIED%20WITH%20THE%20MER%20OPPORTUNITY%20NEAR&fl=first_author,author,aff,pubdate,title,year,pub'
{
  "responseHeader":{
    "status":400,
    "QTime":0,
    "params":{
      "q":"title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR",
      "fl":"first_author,author,aff,pubdate,title,year,pub",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-60d292a5-78cf1d6c0f0a4bae3f21198d",
      "rows":"10",
      "wt":"json"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","java.lang.Exception"],
    "msg":"org.apache.solr.search.SyntaxError: INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse title:HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR: The parser reported a syntax error, antlrqueryparser hates errors! ",
    "code":400}}

When I searched on the ADS website using the same full and partial titles, I got the same behaviors. This is very confusing because the title doesn't contain any special characters at all. I will ignore this document for now.

stevenlujpl · 2021-06-23T02:29:08Z

The new LPSC parser still cannot parser the following 8 MPF docs due to HTTP code 400 (bad request) and 500 (internal server errors):

/proj/mte/data/corpus-lpsc/mpf-pdf/1998_1338.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/1998_1817.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/1999_1166.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2000_1846.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2001_1379.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2002_1242.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2011_2407.pdf
/proj/mte/data/corpus-lpsc/mpf-pdf/2014_1566.pdf

The full error log file is at /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/errors.log.

stevenlujpl · 2021-07-07T01:34:35Z

All of the errors related to HTTP 400 and 500 (mentioned above) should have been fixed.

stevenlujpl · 2021-07-07T18:34:40Z

@wkiri The ADS search with year, pub, and page we tried during the meeting is now working. I had to add double quotes to the Lunar and Planetary Science Conference. I don't understand why I had to add double quotes to make the query work because the word and in Lunar and Planetary Science Conference is lower cases and it should not be treated as a Lucene reserved keyword.

I tested querying ADS DB using year, pub, and page for the following 5 docs that had multiple responses when querying ADS DB using title previously.

2000_1898.pdf
2000_1853.pdf
2000_1938.pdf
2003_1325.pdf
2004_2177.pdf

The 5 docs above worked fine with year, pub, and page search. I am going to implement this approach in the ADS parser script.

wkiri · 2021-07-07T18:55:33Z

@stevenlujpl This is great news! Thanks for figuring it out. I agree that the handling of keywords in this API seems very strange.

When you say it worked for those 5 docs, do you mean each query only returns a single match now?

stevenlujpl · 2021-07-07T19:58:41Z

Yes, each query returned only 1 matching document and the returned fields such as title and author of the matching document are correct.

stevenlujpl · 2021-07-07T21:16:19Z

@wkiri I want to apologize to use the meeting time to implement the year + abstract number + venue ADS search. I don't want to wait until next week to work on this, but I ran out of MTE time this week.

As I mentioned in the meeting, I tested this searching strategy on all 591 MPF LPSC docs. Now, only 2 docs (2006_2424.pdf and 2003_1088.pdf) are not found in the ADS database. I then confirmed using curl to query the ADS database, and it indeed seems that these two docs are not indexed in the ADS database.

This new code structure that I just checked in will support the DOI search (we just need to add a function to construct the DOI query string) for journal papers if we can extract DOI.

I won't have time this week to carefully verifiy the content in the jsonl file, but the following command confirms that there are indeed 589 ads:primary_author fields in the jsonl file.

cat lpsc_mpf_lpsc_parser6.jsonl | grep 'ads:primary_author' | wc -l

The jsonl and the corresponding processing log can be found at the following locations on analysis if you are interested to take a look.

/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_lpsc_parser6.log
/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_lpsc_parser6.jsonl

wkiri · 2021-07-07T21:24:12Z

@stevenlujpl No need to apologize - I am glad you got this working! Thank you for including the details here.

add README for parser scripts #15

wkiri · 2021-07-15T17:30:03Z

Want to test on more journal .pdf files (from our existing downloads) as well. Then can close this issue.

stevenlujpl · 2021-08-10T01:13:44Z

I tested the paper_parser.py script on all of the journal papers (n=37) we have collected so far. All 37 journal papers were successfully processed (after I fixed a bug), and we found the contains relations for 14 papers.

However, there are 12 papers, whose titles were extracted by grobid, are not found in the ADS database. These titles extracted by grobid are listed below:

1. High concentrations of manganese and sulfur in deposits on Murray Ridge, Endeavour Crater, Marsk
2. Stability of perchlorate hydrates and their liquid solutions at the Phoenix landing site, Mars
3. Click Here for Soluble sulfate in the martian soil at the Phoenix landing site
4. Identification of the perchlorate parent salts at the Phoenix Mars landing site and possible implications
5. Atmospheric origins of perchlorate on Mars and in the Atacama
6. Microscopy analysis of soils at the Phoenix landing site, Mars: Classification of soil particles and description of their optical and magnetic properties
7. Wet Chemistry experiments on the 2007 Phoenix Mars Scout Lander mission: Data analysis and results
8. Click Here for Habitability of the Phoenix landing site
9. Click Here for Initial results from the thermal and electrical conductivity probe (TECP) on Phoenix
10. Mars Exploration Rover Tenth Extended Mission Appendices K-1 Section 2 References
11. tice sites of calcium carbonate and affect Mars' soil geochemistry, and calcium carbonate can cement small soil grains and change the phys-ical properties of the surface of Mars
12. Supporting Online Material REPORTS H 2 O at the Phoenix Landing Site

I have created an issue (#38 ) to capture ideas for improving the ADS parser.

The jsonl and log files can be found at the following locations:

/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.jsonl
/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.log

wkiri added a commit that referenced this issue Apr 14, 2021

Handle papers with multiple content types. (#15)

19d3be1

wkiri assigned stevenlujpl Apr 14, 2021

stevenlujpl added a commit that referenced this issue Apr 23, 2021

(1) rename journalparser.py to lpsc_parser.py; (2) PEP-8 formatting #15

f6152cf

stevenlujpl added a commit that referenced this issue Apr 23, 2021

(1) rename parser.py to tika_parser.py; (2) PEP 8 formatting #15

54788d0

stevenlujpl added a commit that referenced this issue Apr 23, 2021

re-organize existing parsers (tika, lpsc, corenlp, jsre) #15

a1c4f62

stevenlujpl added a commit that referenced this issue Apr 23, 2021

update docstring #15

bea046b

stevenlujpl added a commit that referenced this issue Apr 23, 2021

add general journal parser #15

a698dd4

stevenlujpl added a commit that referenced this issue Apr 23, 2021

remove extra lines #15

101d013

stevenlujpl added a commit that referenced this issue Apr 29, 2021

(1) add paper_parser; (2) rename journal_parser to jgr_parser; (3) up…

8ff4d31

…date lpsc and jgr parsers; #15

stevenlujpl added a commit that referenced this issue Apr 29, 2021

add the general paper parser #15

f8675e9

stevenlujpl added a commit that referenced this issue May 20, 2021

fix corenlp server default url #15

a382a84

stevenlujpl added a commit that referenced this issue May 20, 2021

update corenlp properties #15

d507d57

stevenlujpl added a commit that referenced this issue May 20, 2021

update corenlp parser #15

2cfe639

stevenlujpl added a commit that referenced this issue May 20, 2021

fix corenlp parser #15

836072a

stevenlujpl added a commit that referenced this issue May 20, 2021

fix corenlp default url #15

9cd692e

stevenlujpl added a commit that referenced this issue May 20, 2021

temporary solution for arguments mismatch for canonical_target_name() #…

3aeab39

…15

stevenlujpl added a commit that referenced this issue May 20, 2021

fix typo #15

b509d5d

stevenlujpl added a commit that referenced this issue May 20, 2021

fix corenlp default url #15

1c91660

stevenlujpl added a commit that referenced this issue May 20, 2021

fix corenlp default url for lpsc parser #15

147ca9d

stevenlujpl added a commit that referenced this issue May 20, 2021

fix typo for lpsc parser #15

e4e99be

stevenlujpl added a commit that referenced this issue May 20, 2021

move reference extraction to venue-specific parser #15

3c16bd6

stevenlujpl added a commit that referenced this issue May 25, 2021

fix typo and default corenlp url #15

97c1fcb

stevenlujpl added a commit that referenced this issue May 26, 2021

add ADS parser #15

6613e20

stevenlujpl added a commit that referenced this issue May 26, 2021

add a main function #15

397a3a1

stevenlujpl added a commit that referenced this issue May 26, 2021

fix ads parser #15

76a094f

stevenlujpl added a commit that referenced this issue May 26, 2021

update other parsers to use ads_parser #15

d08c41a

stevenlujpl added a commit that referenced this issue Jun 11, 2021

add progress bar to parser scripts #15

0ffc572

stevenlujpl added a commit that referenced this issue Jun 11, 2021

Merge pull request #21 from USCDataScience/issue15-journal

b05a863

add progress bar to parser scripts #15

stevenlujpl added a commit that referenced this issue Jun 11, 2021

fix ADS parser when docs not found in ADS database #15

c363b1d

stevenlujpl added a commit that referenced this issue Jun 11, 2021

Merge pull request #22 from USCDataScience/issue15-journal

78eed6e

fix ADS parser when docs not found in ADS database #15

stevenlujpl added a commit that referenced this issue Jun 23, 2021

escape solr special chars in paper titles #15

2ba3d21

stevenlujpl added a commit that referenced this issue Jun 23, 2021

separate warnings and exceptions #15

c13f57a

stevenlujpl added a commit that referenced this issue Jun 23, 2021

throw warnings when grobid failed extracting paper titles #15

3f8abf5

stevenlujpl added a commit that referenced this issue Jun 23, 2021

escape 5 more solr chars #15

c120e78

stevenlujpl added a commit that referenced this issue Jun 23, 2021

include paper title in the error messages for ads parser #15

60e1b6f

stevenlujpl added a commit that referenced this issue Jun 30, 2021

use ordered dict to define rules for escaping solr special chars #15

64893da

stevenlujpl added a commit that referenced this issue Jun 30, 2021

modify warning implementations #15

60f853b

stevenlujpl added a commit that referenced this issue Jun 30, 2021

add line break to lpsc jsonl file #15

14c1958

stevenlujpl added a commit that referenced this issue Jun 30, 2021

configure warnings to be always #15

119c036

stevenlujpl added a commit that referenced this issue Jun 30, 2021

add doc title to log file #15

d89f135

stevenlujpl added a commit that referenced this issue Jul 1, 2021

(1) fixing typo for escaping /; (2) enable special rules #15 #27

3b3fc3e

stevenlujpl added a commit that referenced this issue Jul 7, 2021

fix logging grobid title #15

e7d8a79

stevenlujpl added a commit that referenced this issue Jul 7, 2021

enrich logging for paper parser #15

cedb39a

stevenlujpl added a commit that referenced this issue Jul 7, 2021

escape double quotes #15

05987d6

stevenlujpl added a commit that referenced this issue Jul 7, 2021

enable year + abstract number + venue ADS search for LPSC docs #15

665ae72

stevenlujpl added a commit that referenced this issue Jul 15, 2021

add README for parser scripts #15

e3102cc

stevenlujpl added a commit that referenced this issue Jul 15, 2021

Merge pull request #31 from USCDataScience/issue15-journal

fb5579d

add README for parser scripts #15

stevenlujpl added a commit that referenced this issue Aug 10, 2021

fix bugs #15

24dfc1b

stevenlujpl closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for journal papers #15

Add support for journal papers #15

wkiri commented Mar 31, 2021

stevenlujpl commented Jun 23, 2021

stevenlujpl commented Jun 23, 2021 •

edited

Loading

stevenlujpl commented Jun 23, 2021

stevenlujpl commented Jul 7, 2021

stevenlujpl commented Jul 7, 2021

wkiri commented Jul 7, 2021 •

edited

Loading

stevenlujpl commented Jul 7, 2021

stevenlujpl commented Jul 7, 2021 •

edited

Loading

wkiri commented Jul 7, 2021

wkiri commented Jul 15, 2021

stevenlujpl commented Aug 10, 2021 •

edited

Loading

Add support for journal papers #15

Add support for journal papers #15

Comments

wkiri commented Mar 31, 2021

stevenlujpl commented Jun 23, 2021

stevenlujpl commented Jun 23, 2021 • edited Loading

stevenlujpl commented Jun 23, 2021

stevenlujpl commented Jul 7, 2021

stevenlujpl commented Jul 7, 2021

wkiri commented Jul 7, 2021 • edited Loading

stevenlujpl commented Jul 7, 2021

stevenlujpl commented Jul 7, 2021 • edited Loading

wkiri commented Jul 7, 2021

wkiri commented Jul 15, 2021

stevenlujpl commented Aug 10, 2021 • edited Loading

stevenlujpl commented Jun 23, 2021 •

edited

Loading

wkiri commented Jul 7, 2021 •

edited

Loading

stevenlujpl commented Jul 7, 2021 •

edited

Loading

stevenlujpl commented Aug 10, 2021 •

edited

Loading