Page MenuHomePhabricator

[Event Platform] Disable default collection of user agent for analytics streams
Closed, ResolvedPublic

Description

Once T382173: Enable Event Platform streams to opt out of collecting User-Agent data is done, we should modify the default wgEventStreamsDefault setting so that user-agent is not collected by default.

To do this, we can explicitly enable user agent collection on all existent (and relevant) streams. New streams that do not explicitly opt in to user-agent collection will not do so.

Along the way, we should disable user-agent collection for any existent streams that don't actually need it.

Done is

  • wgEventStreamsDefault set to disable user-agent collection
  • Streams that need to continue to collect user-agent opt in by overriding the setting

Details

Event Timeline

Ottomata updated the task description. (Show Details)

In Slack, @JMonton-WMF wrote:

I could disable the collection by default and enable it on the current ones, it won't change the current behavior and it will disable it for future events. But I'm assuming we want to keep it disabled on events that don't need it.

The issue is that there are around 200 events, and most of them have the http object in their schema because they include some common schemas with http included.

I've found around 40 that don't have http in their schemas, mainly mediawiki events, so I could keep it disabled on those and enable it manually on the rest.

Long story short:

  • Should I enable User Agent in all events that currenctly have it in their schema? No changes in current events, new events will come disabled by default.
  • Should I tell in a Slack channel or email that we are disabling User Agent collection unless someone says something?
  • Should I try to follow each event to find if we end up using the User Agent or not? This might take a while.

Good qs! Hm. I think:

Should I enable User Agent in all events that currently have it in their schema? No changes in current events, new events will come disabled by default.

Yes.

Should I tell in a Slack channel or email that we are disabling User Agent collection unless someone says something?

Yes, for this I think a Slack notification will suffice. Maybe #working-with-data, #talk-to-data-engineering are enough.

Should I try to follow each event to find if we end up using the User Agent or not? This might take a while.

Naw, I wouldn't bother. That would be nice but would really expand the scope of this task. The main intention here is to disable the default collection, and the event streams that have included http have in some way opted in (perhaps unintentionally, but still...) for user-agent collection.

Should I try to follow each event to find if we end up using the User Agent or not?

I suppose, if you find any that are obvious, you are welcome to disable it, but I wouldn't try too hard ;)

Change #1199246 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/mediawiki-config@master] Disable default user-agent collection.

https://gerrit.wikimedia.org/r/1199246

The Gerrit patch https://gerrit.wikimedia.org/r/1199246 is disabling the default user-agent collection, and enabling it in every stream that has an http.request_headers field in its schema. The patch is a bit big and repetitive, as it adds the same config to around 175 streams.

There are a total of 215 schemas, some of them already had http.request_headers.user-agent enabled or disabled manually, those ones are not changed.

Around 175 streams have the field http.request_headers in their schemas so they are now enabled in this patch. I'm not sure if all them are actually using the user-agent, most of them include a common schema that includes the http object, but reviewing the lineage of 175 pipelines might be time consuming. Maybe we could review it in another task about cleaning streams.

Around 40 streams don't have the http object in their schemas, I haven't changed them in the patch, so they'll lose the user-agent. As they didn't have schema to keep it, the change shouldn't affect them.

Practically speaking, these events are the ones that are now losing the user-agent, they are mainly mediawiki state change events (page changes, revisions, etc):

/^mediawiki\\.job\\..+/
mediawiki.centralnotice.campaign-change
mediawiki.centralnotice.campaign-create
mediawiki.centralnotice.campaign-delete
mediawiki.cirrussearch.page_rerender.v1
mediawiki.cirrussearch.page_rerender.private.v1
mediawiki.page-create
mediawiki.page-delete
mediawiki.page-links-change
mediawiki.page-move
mediawiki.page-properties-change
mediawiki.page-restrictions-change
mediawiki.page-suppress
mediawiki.page-undelete
mediawiki.recentchange
mediawiki.revision-create
mediawiki.revision-score
mediawiki.revision-score-test
mediawiki.revision_score_goodfaith
mediawiki.revision_score_damaging
mediawiki.revision_score_reverted
mediawiki.revision_score_articlequality
mediawiki.revision_score_draftquality
mediawiki.revision_score_articletopic
mediawiki.revision_score_drafttopic
mediawiki.page_prediction_change.rc0
mediawiki.article_country_prediction_change.v1
mediawiki.page_outlink_topic_prediction_change.v1
mediawiki.page_revert_risk_prediction_change.v1
mediawiki.revision-tags-change
mediawiki.revision-visibility-change
mediawiki.user-blocks-change
resource_change
resource-purge
change-prop.transcludes.resource-change
mediawiki.revision-recommendation-create
mediawiki.image_suggestions_feedback
maps.tiles_change
maps.tiles_change_bookworm.v1
mediawiki.page_change.v1
mediawiki.page_change.private.v1
mediawiki.page_change.staging.v1
mediawiki.page_content_change.v1
mw_page_content_change_enrich.error
mediawiki.dump.revision_content_history.reconcile.rc0
mediawiki.dump.revision_content_history.reconcile.enriched.rc0
mw_dump_rev_content_reconcile_enrich.error
mediawiki.content_history_reconcile.v1
mediawiki.content_history_reconcile_enriched.v1
mw_content_history_reconcile_enrich.error
rdf-streaming-updater.mutation.v2
rdf-streaming-updater.mutation-main.v2
rdf-streaming-updater.mutation-scholarly.v2
rdf-streaming-updater.mutation-staging.v2
rdf-streaming-updater.mutation-main-staging.v2
rdf-streaming-updater.mutation-scholarly-staging.v2
mediainfo-streaming-updater.mutation.v2
mediainfo-streaming-updater.mutation-staging.v2
rdf-streaming-updater.lapsed-action
rdf-streaming-updater.state-inconsistency
rdf-streaming-updater.fetch-failure
rdf-streaming-updater.reconcile
mediawiki.cirrussearch.page_weighted_tags_change.v1
cirrussearch.update_pipeline.update.v1
cirrussearch.update_pipeline.update.private.v1
cirrussearch.update_pipeline.fetch_error.v1
eventgate-logging-external.test.event
eventgate-analytics-external.test.event
eventgate-analytics.test.event
eventgate-main.test.event
eventgate-logging-external.error.validation
eventgate-analytics-external.error.validation
eventgate-analytics.error.validation
eventgate-main.error.validation

As this change adds the same block of code to 175 streams, I'm wondering if it would be simpler to keep the user-agent collection enabled by default, and create new events with user-agent collection disabled when needed. But I'm assuming the experimentation team somehow create streams without changing this repository, and they don't want to get the user-agenton new streams.

Change #1199246 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable default user-agent collection.

https://gerrit.wikimedia.org/r/1199246

Mentioned in SAL (#wikimedia-operations) [2025-10-30T13:47:43Z] <mfossati@deploy2002> Started scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R:

Mentioned in SAL (#wikimedia-operations) [2025-10-30T13:50:20Z] <mfossati@deploy2002> superpes, bunnypranav, javiermonton, mfossati, seanleong-wmde: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[g

Mentioned in SAL (#wikimedia-operations) [2025-10-30T14:11:22Z] <mfossati@deploy2002> Finished scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R: