Page MenuHomePhabricator

JavaScript redirect shows irrelevant internal information in search page
Closed, ResolvedPublic

Description

When I search for something that doesn't exist, e.g. https://en.wikipedia.org/wiki/Special:Search?search=zybjykht, some JavaScript code changes the displayed URL to add &searchToken=.... This is pretty inconvenient because it means I can't easily edit the URL to fix my search query.

In most cases this kind of fake redirect is useful, but here it's kind of irritating and just adds irrelevant information to the URL -- a passed in searchToken just gets ignored, so as far as I can tell this is just internal information.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Bawolff subscribed.

Its needed if the "remember selected namespaces" checkbox is selected. We should use js to remove it if that box is not set.

I misunderstood this bug. Appearently we have js code that uses html5 history api to inject the search csrf token into the addressbar. Which seems very silly. Sorry, I made a lot of assumptions about this bug that was wrong. My bad. I was confusing with nsRemember field.

In particular it looks like this is implemented by src/mediawiki.action/mediawiki.action.view.redirect.js redirecting to the "canonical" URL.

Afaik: This token is not a csrf token and not added by MediaWiki core redirect handling.

Rather, it’s added by WikimediaEvents for use during EventLogging campaigns. I agree that it should be proactively hidden from users. It is generated on pageload (if missing) and all added to clicked links as well. But there should be no need to add it to the current url when missing. And when found, after clicking a link, we should remember to hide it on page load as well (after storing it in memory).

We’ve done the same in the past with similar campaigns so this isn’t a new issue. We’ve solved it before already.

This searchToken is primarily used to associate search requests with clicks on those search result pages. The token itself is a random string that is inserted into a log message created when a user performs a search. We then join the search logs with the web request logs to generate click logs. These click logs are fed into the machine learning system that ranks search results. It is certainly not the best way, and we can discuss what might be a better way.

Some notes:

  • Additional query parameters can not be added to search result urls, as that would bypass caching. There is a special wprov query parameter that varnish ignores but it doesn't seem to fit well here.
  • Instead the token is added to the URL of the search result page. Any link clicked then sends this token in the referrer header. This gives the direct connection between individual search events and what links were clicked
  • Other sites implement this as a redirect bounce. We could certainly do that but it seemed more performant to not make the extra round trip to record clicks.
  • We've evaluated using sendBeacon on clicks to record explicit events instead of parsing the web request logs for referrers. At that time we saw a large enough % of missing events from send beacon (caniuse still reports no sendbeacon for IE or safari) that we decided to collect via referrers.

@EBernhardson See my earlier comment. The existing methodology is not a problem from a technical perspective. The method used is fine. But it does not require it to be visible in the address bar while viewing the search page, nor after the user goes back to the SERP.

It should be solvable by delaying the change to the address bar until a link is clicked. (Verify that browsers includes changes from onclick in its referral).

I will see about getting access to browserstack so i can try out the late-URL edit and see how it works across browsers.

EBjune triaged this task as Medium priority.Mar 29 2018, 5:25 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.

Pulled a test set of browsers reporting to search eventlogging for sept 1-12 in P7539. This is not the same data as is collected in the code for this ticket, but is a close enough representative sample that is very easy to query. That data additionally isn't perfect since there is a target sampling rate (x events per wiki per day) rather than some percentage of all requests, but it's probably close enough. It will essentially down-weight some of the largest wikis in the stats. Below is everything with >1% of distinct search sessions for sept 1-14, as classified by our eventlogging user agent classifier.

osbrowser%results
win 10chrome 6824.8%good
win 7chrome 6816.1%good
win 10edge 174.8%good
win 10firefox 613.8%good
os xsafari 113.5%good (only tested high sierra / safari 11.1)
win 8.1chrome 683.4%good
os xchrome 683.0%good (tested sierra, os x 10.12)
win 10ie 112.9%Empty referrer. https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/10474810/
win 7firefox 612.9%good
win 7ie 112.4%empty referrer. see above
win 10firefox 62 beta2.1%good
win 10chrome 692.0%good
win 7firefox 62 beta1.6%good
win 7chrome 691.2%good
win xpchrome 491.2%good
win 10opera 551.1%good

This is missing mobile, because the satisfaction schema doesn't run on mobile. The click tracking here does so i pulled a second set of top browsers from https://pivot.wikimedia.org

Ua Os FamilyUa Os MajorUa Browser FamilyUa Browser MajorView CountView %
iOS11Mobile Safari1162748449316.3%good (via iPhone 8)
Windows10Chrome683171420048.2%good
Other-Other-2189964405.7%???
Android8Chrome Mobile681910933735.0%good (via pixel 2)
Android7Chrome Mobile681863832664.8%good (via s8)
Windows7Chrome681846384734.8%good
Other-bingbot21472804583.8%irrelevant
Other-Googlebot21405230773.6%irrelevant
Windows7IE111150564393.0%empty referrer (no change)
Android6Chrome Mobile68883775822.3%good (via LG G5)
Windows10Firefox61791075212.1%good
Other-YandexBot3676309401.8%irrelevant
Mac OS X10Chrome68615264631.6%good
iOS10Mobile Safari10576102891.5%good (via iPhone 7)
Windows10Edge17527271991.4%good
Mac OS X10Safari11517045551.3%good
Windows7Firefox61496388041.3%good
Mac OS X10Applebot0494326091.3%irrelevant
Android5Chrome Mobile68454091881.2%good (via s8)
Android4Chrome Mobile38452517421.2%couldn't find anything old than chrome 63 for android 4 in browserstack
Windows10IE11412685261.1%empty referrer (no change)
Android8Samsung Internet7402027111.0%good (via s8, android 7. Not available under browserstack android 8)
Windows8.1Chrome68397225061.0%good
iOS11Chrome Mobile iOS68393697671.0%good (via iphone 8)

Overall everything except IE 11 works, and the proposed change did not cause that. The pre-existing use of history.replaceState triggers that bug.

Change 460110 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Only add searchToken to url when something is clicked

https://gerrit.wikimedia.org/r/460110

Change 460110 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Only add searchToken to url when something is clicked

https://gerrit.wikimedia.org/r/460110

Krinkle changed Risk Rating from N/A to default.

A recent update seems to have added another inconvenient redirect: https://en.wikipedia.org/wiki/Special:Search?search=zybjykht now redirects to https://en.wikipedia.org/wiki/Special:Search?search=zybjykht&ns0=1. Is this intentional?

This is an intentional feature added by the people behind AdvancedSearch. The high level goal there is for the URL to represent what is being searched. In particular if a user has a set of namespaces saved as their default search namespaces their search URL's will not be shareable. The exact implementation details are debatable, but the overall goal is reasonable. See T217445 for more details, discussion of the feature should likely also happen there.

OK, that makes sense. I was going to suggest adding the extra parameters to the beginning rather than the end of URL, so the search query is still at the end, but it looks like I can just add ns0=1 to my initial search query to avoid the redirect, which works well enough. Thanks for clarifying!