Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to remove trailing amp URL markup #23

Open
evanberkowitz opened this issue Dec 18, 2020 · 7 comments
Open

Fails to remove trailing amp URL markup #23

evanberkowitz opened this issue Dec 18, 2020 · 7 comments

Comments

@evanberkowitz
Copy link

Here are two examples which redirect to the wrong page:

https://www.google.com/amp/s/www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates%3famp

  • I just want to say for posterity's sake that I have no interest in the bachelorette, but some of my friends do 😂

https://www.google.com/amp/s/www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/%3foutputType=amp

The issue is the trailing parts of the URL. After redirect, the first example's URL has an %3famp at the end which, when removed, gives the correct page,

https://www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates

The second example's URL has an extra ?outputType=amp/ which, when removed, gives the correct page,

https://www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/?outputType=amp/

Some other similar examples such as

https://www.google.com/amp/s/nypost.com/2020/12/07/queens-boy-steals-family-car-drives-away-with-sister-to-nj/amp/

https://www.google.com/amp/s/m.jpost.com/omg/former-israeli-space-security-chief-says-aliens-exist-humanity-not-ready-651405/amp

enjoy server-side redirects to the content of 'interest' (again: no real interest, just an example).

This may be a server-side problem---but if half of servers have a wrong implementation, it's something this extension could help with.

@da2x
Copy link
Owner

da2x commented Dec 18, 2020

https://www.google.com/amp/s/www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates%3famp
https://www.google.com/amp/s/www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/%3foutputType=amp

Okay, so this is a bug in the extension. It should URL decode the URL to get “?amp” instead of “%3famp”. After fixing that, it should get the right pages and redirect to them to the right location.

https://www.google.com/amp/s/nypost.com/2020/12/07/queens-boy-steals-family-car-drives-away-with-sister-to-nj/amp/
https://www.google.com/amp/s/m.jpost.com/omg/former-israeli-space-security-chief-says-aliens-exist-humanity-not-ready-651405/amp

The extension redirects these URLs to the canonical HTML location when the page loads. What problem are you seeing with these ones? It’s not the same issue as with the others.

@evanberkowitz
Copy link
Author

The latter ones I didn't have a problem with---I just noticed that they had amp-related suffixes.

When I load the target of the redirect (which succeeds), the server rewrites the URL again, removing the /amp or /amp/ suffix (among other rewrites). So I wasn't sure if, according to the amp protocol, handling the suffix is something the extension should be doing or if it's the server's responsibility. So those latter examples aren't problems, they're just contrasts for the problematic ones---URLs where an amp-related suffix is handled correctly (even though it is ignored by the extension).

@da2x
Copy link
Owner

da2x commented Dec 18, 2020

Hm. So the AMP Cache URLs shouldn’t have been urlencoded, and decoding them when they’re not encoded can lead to unexpected results. They’re not urlencoded on the publishers AMP page nor when served by Google or Bing. Where did these URLs come from? Did someone send them to you? Using what app? or did you get them from a search result page? If I can find out where they came from, I can figure out where things went wrong and what the underlying problem is.

@da2x
Copy link
Owner

da2x commented Dec 18, 2020

When I load the target of the redirect (which succeeds), the server rewrites the URL again, removing the /amp or /amp/ suffix (among other rewrites). So I wasn't sure if, according to the amp protocol, handling the suffix is something the extension should be doing or if it's the server's responsibility.

AMP redirection happens in two steps: the first step is to redirect off from an AMP Viewer or AMP Cache (e.g. Google, Bing, Baidu). This is done my rewriting URLs that match a pseudo-standard pattern. This redirects you to the publishers AMP page. As you’ve noticed, every publisher applies their own patterns to determine the URL for their AMP versions of pages. The extension can’t blindly chop off suffixes like “/amp/” because it can’t know whether that refers to the AMP tech or e.g. the page is about an electronic amplifier The second steps is parsing the AMP page to find the non-AMP version of that page. Lastly, you’re redirected to the non-AMP version.

It’s a convoluted journey, but it gets there in the end. Complications like this is why many web developers feel strongly against AMP in the first place. There should be only one web page and one URL for any given document.

@evanberkowitz
Copy link
Author

These URLs came to me via slack. I do not know how the people on the other end got them to be the way they are android vs ios, copy/paste vs an OS share button, etc. But I can inquire if it'd help.

I'm with you on anti-AMP sentiments, which is why I was delighted to find this extension in the first place!

@da2x
Copy link
Owner

da2x commented Jan 24, 2022

// Google AMP Viewer sometimes URI encode query parameters, handle optimistically
if (!redirection.includes('?') && redirection.includes('%3f'))
{
var uriComponents = redirection.split('%3f');
var uri = uriComponents[0];
// this can make a mess, but it’s not of our making
var query = decodeURIComponent(uriComponents.slice(1).join('%3f'));
redirection = uri.concat('?', query);
}
// Unhandled case: <viewer-url>%f3<page-query>?<google-query>
// It’s impossible to distinguish page-query and google-query, as we never know beforehand if
// the URL has been encoded or not (the escaped question mark can also be in any part of the URL).
//
// Incorrectly handled cases:
// * question mark in path, e.g.
// example.com/is-it-good%f3/sure-why-not

@da2x
Copy link
Owner

da2x commented Jan 24, 2022

I’ve attempted a fix. It’s not great, see code comments above.

I’ll release it in a few days after some more testing.

You can help with testing.

Disable the current extension. Download and unzip the test version, and load it temporarily in Firefox or Chrome/Edge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants