-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for journal papers #15
Comments
add progress bar to parser scripts #15
fix ADS parser when docs not found in ADS database #15
The problem of HTTP error code 400 should have been resolved with the commit above. The solution is that we need to escape special characters in the paper titles for Solr queries. |
The MPF doc (/proj/mte/data/corpus-lpsc/mpf-pdf/2014_1566.pdf) entitled "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR AND AT CAPE YORK, WESTERN RIM OF ENDEAVOUR CRATER, MARS" is still giving HTTP error code 400. When I query the ADS database using the full title, I got the following HTTP 400 error.
When I query the ADS database with the partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY", then it worked.
When I query the DB with another (longer) partial title "HOLLOWED SPHERULES IDENTIFIED WITH THE MER OPPORTUNITY NEAR" (note that I added the word NEAR to the end of the previous partial title), and I got the same HTTP 400 error again.
When I searched on the ADS website using the same full and partial titles, I got the same behaviors. This is very confusing because the title doesn't contain any special characters at all. I will ignore this document for now. |
The new LPSC parser still cannot parser the following 8 MPF docs due to HTTP code 400 (bad request) and 500 (internal server errors):
The full error log file is at |
All of the errors related to HTTP 400 and 500 (mentioned above) should have been fixed. |
@wkiri The ADS search with I tested querying ADS DB using
The 5 docs above worked fine with |
@stevenlujpl This is great news! Thanks for figuring it out. I agree that the handling of keywords in this API seems very strange. When you say it worked for those 5 docs, do you mean each query only returns a single match now? |
Yes, each query returned only 1 matching document and the returned fields such as title and author of the matching document are correct. |
@wkiri I want to apologize to use the meeting time to implement the As I mentioned in the meeting, I tested this searching strategy on all 591 MPF LPSC docs. Now, only 2 docs (2006_2424.pdf and 2003_1088.pdf) are not found in the ADS database. I then confirmed using curl to query the ADS database, and it indeed seems that these two docs are not indexed in the ADS database. This new code structure that I just checked in will support the DOI search (we just need to add a function to construct the DOI query string) for journal papers if we can extract DOI. I won't have time this week to carefully verifiy the content in the jsonl file, but the following command confirms that there are indeed 589
The jsonl and the corresponding processing log can be found at the following locations on analysis if you are interested to take a look.
|
@stevenlujpl No need to apologize - I am glad you got this working! Thank you for including the details here. |
add README for parser scripts #15
Want to test on more journal .pdf files (from our existing downloads) as well. Then can close this issue. |
I tested the However, there are 12 papers, whose titles were extracted by grobid, are not found in the ADS database. These titles extracted by grobid are listed below:
I have created an issue (#38 ) to capture ideas for improving the ADS parser. The jsonl and log files can be found at the following locations:
|
Currently, the JournalParser class supports LPSC content. We would like to broaden it to accommodate other publication sources such as the Journal of Geophysical Research, Icarus, etc., each of which will have some journal-specific parsing needed.
The text was updated successfully, but these errors were encountered: