Page MenuHomePhabricator

Add revision ID to X-Analytics header
Closed, ResolvedPublic

Description

On the moderator tools team we want to analyse how many readers might have seen vandalism content before it is reverted. To do this we want to count pageviews while certain revisions are visible on a page.

This will help us understand the impact that Automoderator will have - our hypothesis is that if a community uses Automoderator, fewer readers will see bad content on their project, because it will be reverted more quickly. Although we are already planning to measure the time between content being added and being reverted, being able to track pageviews would give us a much clearer sense on the impact this has on readers.

@mpopov suggested that we could record the revision ID in the X-Analytics header, and then use a list of reverted revisions to find how many pageviews loaded that now-reverted content.

Notes

  1. Only the app servers know the revision ID of the page that's being requested. The app servers have to propagate the information in the header
  2. The code that sets the X-Analytics header on the appservers lives in the WikimediaEvents extension: https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/05d1de36160d047e4a19a14cd250f1082d3f3c02/includes/WikimediaEventsHooks.php#74
  3. The raw header is converted to a map by the refine_webrequest_hourly job with no further processing so any key-value-pair added to the header anywhere in the pipeline will appear in wmf.webrequest automagically
  4. The above has access to a Title object, which represents the title of the current page
  5. Title::getLatestRevisionID() returns the ID of the latest revision associated with the current page
  6. The keys and possible values of the X-Analytics header are documented here: https://wikitech.wikimedia.org/wiki/X-Analytics

Event Timeline

Samwalton9-WMF renamed this task from Add revision ID to x-analytics header to Add revision ID to X-Analytics header.Sep 14 2023, 3:13 PM
Samwalton9-WMF created this task.
phuedx added subscribers: VirginiaPoundstone, phuedx.

@VirginiaPoundstone: Being bold, I've added some detail about how this could be implemented to the description. Since this task is clearly scoped/low risk, I've added the good first task the tag. I'd estimate this as a 1 or 2 (for those unfamiliar with the MediaWiki ecosystem).

@VirginiaPoundstone: I've checked in with @Samwalton9-WMF & @KCVelaga_WMF and it is not urgent as the first deployment of Automoderator is planned for Q3/Q4. So anytime in Q2 would work :)

For record keeping: when I checked in with Data Platform Engineering it sounded like this would need to be done in coordination with Traffic team. They would modify X-Analytics and then Data Products would follow up on the data pipeline side to make sure that information propagated across webrequests-derived datasets.

@KOfori: Can you please confirm if that understanding is correct? Or is this something Data Products can do by themselves (seeing as Sam has figured out & documented how to do it) and you'd just like to have visibility on it?

For record keeping: when I checked in with Data Platform Engineering it sounded like this would need to be done in coordination with Traffic team. They would modify X-Analytics and then Data Products would follow up on the data pipeline side to make sure that information propagated across webrequests-derived datasets.

I could have been clearer in my original comment/notes – AIUI:

  1. Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header
  2. The code that sets the X-Analytics header lives in the WikimediaEvents extension
  3. The raw header is converted to a map by the refine_webrequest_hourly job with no further processing so any key-value-pair added to the header anywhere in the pipeline will appear in wmf.webrequest automagically

If I'm wrong about #1, then please LMK and I apologise in advance for the noise!

phuedx updated the task description. (Show Details)

The code that sets the X-Analytics header lives in the WikimediaEvents extension

As we're coming to the end of Q2 it would be helpful to understand what the next steps are on this / who would be responsible - should we as a team be able to make this code change, or should we be coordinating with another team?

As we're coming to the end of Q2 it would be helpful to understand what the next steps are on this / who would be responsible - should we as a team be able to make this code change, or should we be coordinating with another team?

I think we're still waiting for confirmation that the approach that I've detailed is sufficient (see T346350#9175071). If it is, then we should probably announce the data change to Data Platform Engineering and Research and Data Science in order to test whether there'll be any issues up front. The change that I've described is fairly simple, so your team should be able to make it in isolation.

  1. Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header

Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request

@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.

  1. Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header

Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request

@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.

We (Traffic) can take care of the CDN side of things, mainly Varnish at this moment, but we need to get that data from the appservers/mediawiki, somebody needs to add the revision ID on https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/05d1de36160d047e4a19a14cd250f1082d3f3c02/includes/WikimediaEventsHooks.php#74

  1. Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header

Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request

@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.

We (Traffic) can take care of the CDN side of things, mainly Varnish at this moment, but we need to get that data from the appservers/mediawiki, somebody needs to add the revision ID on https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/05d1de36160d047e4a19a14cd250f1082d3f3c02/includes/WikimediaEventsHooks.php#74

We can do this :)

Change 992653 had a related patch set uploaded (by Abaris; author: Abaris):

[mediawiki/extensions/WikimediaEvents@master] Add revision ID to X-Analytics header

https://gerrit.wikimedia.org/r/992653

Change 992653 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Add revision ID to X-Analytics header

https://gerrit.wikimedia.org/r/992653

Re-opening this while we wait for the patch to be deployed and for the key-value pair to propagate to the wmf.webrequest Hive table.

Re-opening this while we wait for the patch to be deployed and for the key-value pair to propagate to the wmf.webrequest Hive table.

Just checking in: has this happened yet? I don't seem to be authorized to query this in superset.

SELECT
  normalized_host.project,
  namespace_id IS NULL AS ns_id_is_null,
  element_at(x_analytics_map, 'ns') IS NULL AS x_ns_is_null,
  page_id IS NULL AS page_id_is_null,
  element_at(x_analytics_map, 'page_id') IS NULL AS x_page_id_is_null,
  element_at(x_analytics_map, 'rev_id') IS NULL AS x_rev_id_is_null,
  COUNT(1) AS n_pageviews
FROM wmf.webrequest 
WHERE webrequest_source = 'text'
  AND year = 2024 AND month = 2 AND day = 12 AND hour = 1
  AND is_pageview
  AND uri_host IN('en.wikipedia.org', 'en.m.wikipedia.org', 'commons.wikimedia.org', 'commons.m.wikimedia.org')
GROUP BY 1, 2, 3, 4, 5, 6
ORDER BY project, ns_id_is_null, x_ns_is_null, page_id_is_null, x_page_id_is_null, x_rev_id_is_null
projectns_id_is_nullx_ns_is_nullpage_id_is_nullx_page_id_is_nullx_rev_id_is_nulln_pageviews
0commonsFalseFalseFalseFalseFalse697988
1commonsFalseFalseFalseFalseTrue14203
2commonsFalseFalseTrueTrueTrue4056
3commonsTrueTrueTrueTrueTrue577042
4enFalseFalseFalseFalseFalse16892865
5enFalseFalseFalseFalseTrue1909898
6enFalseFalseTrueTrueTrue411263
7enTrueTrueTrueTrueTrue462287

Pageviews mostly have the rev ID in X-Analytics. Looks like it's missing in some cases – probably worth investigating?

What's also troubling (but outside the scope of this ticket) is the missingness of other data (page ID and namespace ID) in X-Analytics. (Also worth investigating as a potential data quality issue.)

SELECT
  normalized_host.project,
  namespace_id IS NULL AS ns_id_is_null,
  element_at(x_analytics_map, 'ns') IS NULL AS x_ns_is_null,
  page_id IS NULL AS page_id_is_null,
  element_at(x_analytics_map, 'page_id') IS NULL AS x_page_id_is_null,
  element_at(x_analytics_map, 'rev_id') IS NULL AS x_rev_id_is_null,
  COUNT(1) AS n_pageviews
FROM wmf.webrequest 
WHERE webrequest_source = 'text'
  AND year = 2024 AND month = 2 AND day = 12 AND hour = 1
  AND is_pageview
  AND uri_host IN('en.wikipedia.org', 'en.m.wikipedia.org', 'commons.wikimedia.org', 'commons.m.wikimedia.org')
GROUP BY 1, 2, 3, 4, 5, 6
ORDER BY project, ns_id_is_null, x_ns_is_null, page_id_is_null, x_page_id_is_null, x_rev_id_is_null
projectns_id_is_nullx_ns_is_nullpage_id_is_nullx_page_id_is_nullx_rev_id_is_nulln_pageviews
0commonsFalseFalseFalseFalseFalse697988
1commonsFalseFalseFalseFalseTrue14203
2commonsFalseFalseTrueTrueTrue4056
3commonsTrueTrueTrueTrueTrue577042
4enFalseFalseFalseFalseFalse16892865
5enFalseFalseFalseFalseTrue1909898
6enFalseFalseTrueTrueTrue411263
7enTrueTrueTrueTrueTrue462287

Pageviews mostly have the rev ID in X-Analytics. Looks like it's missing in some cases – probably worth investigating?

What's also troubling (but outside the scope of this ticket) is the missingness of other data (page ID and namespace ID) in X-Analytics. (Also worth investigating as a potential data quality issue.)

Thank you for checking! I also would have expected it to be populated for all page views, but I'm also assuming that we always have a positive integer on those. Are there situations where the revision id can be 0 on a page view?

Are there situations where the revision id can be 0 on a page view?

Yes. The patch doesn't add the revision ID of the page when the user is viewing its history (action=history) or is editing it (action=edit). Our definition of a pageview in refinery-core excludes action=edit but not action=history. I suspect that this could explain the some (hopefully all!) of the discrepancy that @mpopov is highlighting.

Here's an example demonstrating the behaviour of the patch:

Notes:

  1. The list of actions is non-exhaustive. I can't find where the complete list of actions is either on-wiki or in MediaWiki Core
  2. action=render doesn't cause code in the XAnalytics MediaWiki extension to execute. It probably should!

Oh that's brilliant! Thanks so much for looking into it and shining light on this @phuedx!

Great! I think we can mark this as resolved then?

jsn.sherman moved this task from QA to Done on the Moderator-Tools-Team (Kanban) board.

Great! I think we can mark this as resolved then?

Looks good to me. Thanks for the breakdown @phuedx!

@phuedx I wondered if you (or any other subscribers here) had any insight on how Flagged Revisions would impact the data being stored here. In T348861 we've started analysing this data and it occurred to me that the 'latest' revision may not actually be the revision being seen by a user if a wiki uses Flagged Revisions, which would hide the edit from view until reviewed. Do you know if that might already be being accounted for here (I don't know what FlaggedRevs does technically to hide the 'latest' revision) or if it might be something we would need to manually account for in the analysis?

Nice catch!

At first glance, it looks like FlaggedRevs will have an effect but I'm not sure if it's something we just have to be aware of during analysis or we have to make changes for.

If a page has pending revisions, then a logged-out user will see the last stable revision of the page and a logged-in user will see the most-recent revision of the page. This is reflected in the UI: a logged-out user will see the "Read" tab selected and a logged-in user will see the "Pending changes" tab selected. However, this is not reflected in the URL.

I noted that the above is reflected in client-side config variables:

URL: https://en.wikipedia.org/wiki/Plum_(color)

Logged in?wgRevisionIdwgCurRevisionIdwgStableRevisionId
Yes121500046912150004691209689412
No120968941212150004691209689412

Thank you so much for looking into it, @phuedx!!!

So if I'm interpreting that table correctly, we can trust rev_id in X-Analytics to be the appropriate revision ID (E.g. the stable rev ID for logged-out visitors and current rev ID for logged-in visitors) – is that right?

Asking because KC has started using this to analyze "potential exposure to vandalized content" (T348861) and we just want to ensure accuracy.

So if I'm interpreting that table correctly, we can trust rev_id in X-Analytics to be the appropriate revision ID (E.g. the stable rev ID for logged-out visitors and current rev ID for logged-in visitors) – is that right?

That's correct.