Page MenuHomePhabricator

Section fragment information stripped from webrequests
Closed, ResolvedPublic

Description

Based on my exploration, our current webrequest data does not retain fragment information from the URL that indicates which section of the article is being linked to -- i.e. anything following the # sign in a URL. There is no immediate use-case for this information but it could be valuable for understanding the relative importance of different sections in articles and feels more like a bug than an intentional decision. Fragment examples:

It is possible that not storing this information is purposeful, but I wanted to have a task to at least document that this is missing from our webrequest logs.

A few possible places where #Etymology_and_nicknames might be stored as with the above example:

  • Most cleanly as a new column in webrequests uri_fragment, though this would require a schema change to webrequest and I understand that that might not be desirable.
  • In the pageview_info map, though including it as part of the page_title parameter would likely break some of the downstream usages of this data that are not expecting section information -- e.g., when aggregating page views or joining with other tables to resolve redirects
  • As part of X-Analytics, though that might be viewed as hacky and not what X-Analytics was intended for.

Event Timeline

Fragments aren't even sent in requests, they are handled entirely client side.

I was just thinking recently it would be interesting to see how commonly fragment links are used, and how often they are broken (pointing to a non-existent section). Seems like it would need EventLogging or something similar though.

Isaac claimed this task.

Fragments aren't even sent in requests, they are handled entirely client side.

@Pcoombe oh yikes, good point, thanks! I had thought I had verified that it was sent as part of the URL but you're right that it's purely client-side. Well I suppose that resolves why we do not currently track it :)

For reference, after some further digging, we do have an EventLogging schema for this sort of thing: https://meta.wikimedia.org/wiki/Schema:MobileWebSectionUsage
And there's a task (T200810) focused on reimplementing the schema (for A/B tests, but this would also provide some general statistics)
This ad-hoc data collection actually feels more appropriate to me than trying to collect it wholesale through webrequests, at least while we don't have a strong, on-going use-case for the data.