Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not respect robots.txt or rel="nofollow" directives #29

Open
rtrvrtg opened this issue Dec 16, 2017 · 9 comments
Open

Does not respect robots.txt or rel="nofollow" directives #29

rtrvrtg opened this issue Dec 16, 2017 · 9 comments

Comments

@rtrvrtg
Copy link

rtrvrtg commented Dec 16, 2017

This behaviour is causing performance issues on sites that use dynamic URLs to serve up filtered content.

@monkeypants
Copy link
Collaborator

Hi, thanks for raising this. We are running some some new code and this is obviously a bug!

@nathan-w
Copy link

nathan-w commented Dec 16, 2017

Hi Chris

Please call Sharyn Clarkson, Assistant Secretary, Online Services Branch, Department of Finance. Her number is xxxxxxxxx. She is expecting your call.

Nathan Wall
Head of govCMS

@monkeypants
Copy link
Collaborator

I have spoken to Sharyn, and redacted her number from your post @nathan-w.

@nathan-w
Copy link

Thanks Chris!

@monkeypants
Copy link
Collaborator

Well that was exciting. I'll circulate a post-incident report (through Sharyn) on Monday afternoon or Tuesday, after gathering facts then consulting our DTA masters.

As well as fixing the robots.txt bug, we might need to create some new throttle features. The current throttle says "don't hit the same domain name more than once per {DNS_THROTTLE} seconds". We assumed that would suffice to stop us DOSsing any servers, but we didn't think hard enough about very large multi-tenancy virtual hosts.

I think we might need two more throttle rules:

  • don't hit the same IP address more than once per {IP_THROTTLE} seconds (catch-all sanity check, unlikely to prevent traffic caused by CDN cache-misses).
  • for every known {VHOST_CLUSTER_DOMAIN_LIST}, don't hit the same domain in a known vhost_cluster more than once per {VHOST_THROTTLE} seconds

Is there a publicly available list of domains hosted by GovCMS?

@monkeypants
Copy link
Collaborator

monkeypants commented Dec 17, 2017

@nathan-w, is it the case that all GovCMS sites have something like this in their HTML Head?

<meta
    name="generator"
    content="Drupal 7 (http://drupal.org) + govCMS (http://govcms.gov.au)" 
/>

If so, I think we could make a "govCMS detector" and self-maintain a list of govCMS sites.

@nathan-w
Copy link

nathan-w commented Dec 17, 2017

Chris - please email your contact details to [email protected] - I'd like to take this conversation private while options are explored.

@monkeypants
Copy link
Collaborator

note: apparently all govCMS sites share a google analytics key, so perhaps that could also be used for "govCMS detection"

@koriaf
Copy link
Contributor

koriaf commented Feb 21, 2018

I guess this issue may be closed now, while deployed and has been working for some time already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants