Based on my exploration, our current webrequest data does not retain fragment information from the URL that indicates which section of the article is being linked to -- i.e. anything following the # sign in a URL. There is no immediate use-case for this information but it could be valuable for understanding the relative importance of different sections in articles and feels more like a bug than an intentional decision. Fragment examples:
- https://en.wikipedia.org/wiki/Chicago#Etymology_and_nicknames
- There is also some usage of this format with a parameter known as targetText, where search results link directly to a segment of text within a page. It's only on Google Chrome though and experimental.
It is possible that not storing this information is purposeful, but I wanted to have a task to at least document that this is missing from our webrequest logs.
A few possible places where #Etymology_and_nicknames might be stored as with the above example:
- Most cleanly as a new column in webrequests uri_fragment, though this would require a schema change to webrequest and I understand that that might not be desirable.
- In the pageview_info map, though including it as part of the page_title parameter would likely break some of the downstream usages of this data that are not expecting section information -- e.g., when aggregating page views or joining with other tables to resolve redirects
- As part of X-Analytics, though that might be viewed as hacky and not what X-Analytics was intended for.