Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncrawleable sites list #36

Open
koriaf opened this issue Feb 8, 2018 · 3 comments
Open

Uncrawleable sites list #36

koriaf opened this issue Feb 8, 2018 · 3 comments

Comments

@koriaf
Copy link
Contributor

koriaf commented Feb 8, 2018

  • https://guidelines.canceraustralia.gov.au/ - relies on javascript, links are not in href="" element, so crawled doesn't see them
@monkeypants
Copy link
Collaborator

we did talk about the idea of handing over to something like phantomjs for indexing this sort of thing, but it seemed like it would become a bit of a never-ending job to maintain it, because it would be difficult to generalise between sites.

maybe we should maintain a reference list of uncrawlable sites in the repo?

@koriaf
Copy link
Contributor Author

koriaf commented Feb 10, 2018

I think they may be easily retrieved from the resulting index, by getting all pages with just 1-2 resp 200 pages for domain. anyway, I keep it here just because I have met it and for a reference for future testing

@monkeypants
Copy link
Collaborator

OK, that's a TODO item then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants