Section fragment information stripped from webrequests
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Dec 10 2019, 4:36 PM

Description

Based on my exploration, our current webrequest data does not retain fragment information from the URL that indicates which section of the article is being linked to -- i.e. anything following the # sign in a URL. There is no immediate use-case for this information but it could be valuable for understanding the relative importance of different sections in articles and feels more like a bug than an intentional decision. Fragment examples:

https://en.wikipedia.org/wiki/Chicago#Etymology_and_nicknames
There is also some usage of this format with a parameter known as targetText, where search results link directly to a segment of text within a page. It's only on Google Chrome though and experimental.

It is possible that not storing this information is purposeful, but I wanted to have a task to at least document that this is missing from our webrequest logs.

A few possible places where #Etymology_and_nicknames might be stored as with the above example:

Most cleanly as a new column in webrequests uri_fragment, though this would require a schema change to webrequest and I understand that that might not be desirable.
In the pageview_info map, though including it as part of the page_title parameter would likely break some of the downstream usages of this data that are not expecting section information -- e.g., when aggregating page views or joining with other tables to resolve redirects
As part of X-Analytics, though that might be viewed as hacky and not what X-Analytics was intended for.

Related Objects

Mentioned In: T235784: Identify data / questions that we can(not) answer regarding external reuse
Mentioned Here: T200810: Make it possible to A/B test different section headings on mobile web

Event Timeline

Isaac created this task.Dec 10 2019, 4:36 PM

Fragments aren't even sent in requests, they are handled entirely client side.

I was just thinking recently it would be interesting to see how commonly fragment links are used, and how often they are broken (pointing to a non-existent section). Seems like it would need EventLogging or something similar though.

Fragments aren't even sent in requests, they are handled entirely client side.

@Pcoombe oh yikes, good point, thanks! I had thought I had verified that it was sent as part of the URL but you're right that it's purely client-side. Well I suppose that resolves why we do not currently track it :)

For reference, after some further digging, we do have an EventLogging schema for this sort of thing: https://meta.wikimedia.org/wiki/Schema:MobileWebSectionUsage
And there's a task (T200810) focused on reimplementing the schema (for A/B tests, but this would also provide some general statistics)
This ad-hoc data collection actually feels more appropriate to me than trying to collect it wholesale through webrequests, at least while we don't have a strong, on-going use-case for the data.

Isaac mentioned this in T235784: Identify data / questions that we can(not) answer regarding external reuse.Dec 23 2019, 5:05 PM

Section fragment information stripped from webrequestsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Section fragment information stripped from webrequests
Closed, ResolvedPublic
Actions