Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fetching from kaggle dataset #164

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Randl
Copy link

@Randl Randl commented May 18, 2021

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

@Devetec
Copy link

Devetec commented Jul 5, 2021

@Randl

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

Question: Does this support newer arxiv papers that are not in the Kaggle dataset?

@Randl
Copy link
Author

Randl commented Jul 5, 2021

The standard fetching is still available, it's just that you'll need to download only a couple of hundreds of newest papers from API

@lance10t
Copy link

lance10t commented Jul 5, 2021

This is interesting, I didn't see this earlier and developed something on my own too. But what you have done is definitely a lot neater.

My only immediate concern while developing it was that the fields were slightly different and I needed some transformations. As much as possible, I fitted it into the db.p dictionary structure to minimise any downstream bugs.

But looking forward to the review of whether this can become part of the main branch. The rate limiters and blocks on arXiv is quite painful.

@Randl
Copy link
Author

Randl commented Jul 5, 2021

it seems that @karpathy is rewriting it (from scratch?): https://www.reddit.com/r/MachineLearning/comments/obne9p/d_is_arxivsanity_down_what_people_use_these_days/h3q422o
Hopefully it'll be part of the features of new version.

@lance10t
Copy link

lance10t commented Jul 5, 2021

This is a great initiative. One of the best projects I've come across that solves a practical need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants