User-Agent string should be a contextual attribute that Experimentation Lab client libraries can populate when requested.
When we start managing contextual attributes via xLab rather than stream configs, it will be easy to turn UA collection on/off, without it being strongly coupled to the stream config.
Context
Some thoughts by @Ottomata:
In T382173: Enable Event Platform streams to opt out of collecting User-Agent data, we added a producers.eventgate.enrich_fields_from_http_headers stream config setting. This setting instructs eventgate to enrich event data with HTTP request headers before producing the event.
This will be useful especially in cases where the header values are not available to the client (e.g. set by the server side CDN).
However, in cases where the data is available to the client, there are several reasons and advantages to configurably set the data in the event on the client side instead of server side (eventgate).
- eventgate is agnostic to the semantics of the events it produces. It is not 'wiki aware'. It requests global stream config from https://meta.wikimedia.org/w/api.php?action=streamconfigs. If there are per wiki settings (via per-wiki overrides in mediawiki-config), those settings will only be available from the wiki's api endpoint, e.g https://en.wikipedia.org/w/api.php?action=streamconfigs. MediaWiki clients have this per-wiki configuration automatically available to them.
- The desired data, e.g. the client's user-agent, might not always be in the headers for the POST request to eventgate. When MediaWiki PHP POSTs the event, it makes an HTTP POST request to eventgate that is distinct from the original user client that made an HTTP request to MediaWiki. E.g. The Growth team's HomepageVisit instrumentation is sent from MW PHP after a user visits the MW homepage. To work around this, EventLogging is manually setting the event's http.request_headers['user-agent'] field to the current MW HTTP request's 'User-Agent' header. This is a bit awkward, because MW is acting as a proxy for the real client (the user's browser that made the original HTTP request). Which request is http.request_headers meant to represent? As is, it might contain headers from multiple requests, but there would be no way to understand which ones were from which? Does this matter?
We should add client specific configuration (to EventStreamConfig or elsewhere (MPIC contextual attributes?) that allows configuration of clients to set specific event fields.
Ideally this would be user-agent agnostic, and instead control setting headers in fields, like the EventGate configuration. If this was done in EventStreamConfig, perhaps a producers.mediawiki_client.enrich_fields_from_http_headers setting?
This was also discussed in Slack.
Notes
This would be a great first task for an engineer onboarding to the Experiment Platform team and learning the client library codebase and the greater system.
Acceptance criteria
- fragment/analytics/product_metrics/common includes agent.ua_string as a property (deployed)
- base schemas updated (deployed)
- client libraries populate agent.ua_string with the User-Agent string when requested (deployed)
- Refine (TransformFunctions.scala) updated to parse these UA strings when they are present in data, e.g. (merged, pending deployment)
// Only one possible source column currently, but we could add more. val possibleSourceColumnNames = Seq("http.request_headers.`user-agent`", "agent.ua_string")
[ v] https://wikitech.wikimedia.org/wiki/Metrics_Platform/Contextual_attributes updated
- agent_ua_string is a value that can be included in provide_values array when configuring a stream (no specific action needed here)
- xLab has been updated to consider agent.ua_string as a contextual attribute (deployed)