Page MenuHomePhabricator

Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate
Open, Needs TriagePublic

Description

As of 2023-12, The legacy MediaWikiPingback EventLogging schema is the only remaining schema that needs migrated to Event Platform: T323828: Update Pingback to use the Event Platform

This migration is blocking T238230: Decommission EventLogging backend components by migrating to MEP.

As described in this comment, MediaWikiPingback is very useful, and we don't want to disable collection of this data from the legacy EventLogging backend until older versions of MediaWiki are basically unused. This could take up to as much as 5 years.

Instead of waiting 5+ more years to decommission EventLogging backend, we should build an anti corruption layer proxy to intake these events, translate them to Event Platform compatible events, and send them to eventgate.

It would be nice to make this endpoint generic enough to do this for any legacy event. However since MediaWikiPingback is the ONLY event that does this, it would be sufficient to write hardcoded code that knows how to do this for only MediaWikiPingback.

Suggested implementation
  • Endpoint created specifically for mediawiki.org pingback intake. This can use a mediawiki-config/docroot/mediawiki.org/beacon/event PHP file to avoid having to do any custom frontend routing from /beacon/event to a different backend. This endpoint would:
    • Parse json event from HTTP encoded query parameters
    • Modify parsed event to Event Platform standard (add fields like meta.stream, etc.)
    • Post event to eventgate-analytics-external with hasty=true
    • return response to client

We still need to do the migration steps for T323828. This proxy endpoint would just allow the old installs of MediaWiki to keep sending events.

Event Timeline

After a discussion in Slack, I have changed the suggested implementation to be use medaiwiki-config/docroot/mediawiki.org. This would make the solution only work from mediawiki.org/beacon/event , but would avoid any need for custom routing or custom deployment. MediaWikiPingback sends events to mediawiki.org, so this would suffice to unblock the eventlogging backend decom.

Ottomata updated the task description. (Show Details)

Change 985023 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] WIP - create eventlogging-processor legacy proxy to eventgate for mediawiki.org

https://gerrit.wikimedia.org/r/985023

suggested implementation to be use medaiwiki-config/docroot/mediawiki.org

I think the only feature I can no longer is recvFrom and seqId. These are set by the varnish cache host that receives the request, and the varnishkafka sequence number. I might be able to get recvFrom from a HTTP header, but seqId is generated by varnishkafka and set in the raw client side event message it sends to Kafka.

I'm not aware of any thing using seqId, and either uuid or meta.id should suffice for Refine deduplication.

I might be able to set the ip field to the client IP, usually parsed and provided by varnish in the X-Client-IP header, but I think we don't need it for MediaWikiPingback, and shouldn't collect it if we don't. I'll not support it.

A few infra-level questions:

  1. PHP execution.

Afaik PHP execution is limited for security reasons to only specific directories. This will thus likely need a puppet change first to Apache config to allow this directory to execute PHP.

Given the transition to Docker/Helm/Kubernetes etc this will also need a corresponding change there, which has its own copy of the Apache config. To my knowledge, the latter is not something that is currently documented to be possible to iterate/test locally, nor in beta cluster, but perhaps SRE can help you with porting that once you have it working in Puppet for Beta Cluster. It should be a fairly trivial change.

  1. Volume.

What's the traffic like? In particular, do we know of spikes or peaks in the past 90 days? Whats the highest peak in the last 90 days, and what's the typical peak per day?

  1. Error log / visibility.

I've commented on the Gerrit change with suggestion to use trigger_error. This way it would naturally go to the php-fpm syslog, which is already ingested and monitored and has Logstash dashboards that people know to query and look for, and works on both K8S and baremetal.

I suggest setting up a Logstash query that searches for php-fpm logs and then a query for a prefix that identifies your errors, e.g. format your error messages like EventLoggingConverter: Got HTTP $code response so that you can find then easily with a query for "EventLoggingConverter".

Assuming your team has some kind of habit, rotation, or shared bookmark with Logstash and Grafana dashboard to look at, you could include this one to periodically monitor them. Alternatively, you could include a statsd increment in the script and easily set an alert on that via the operations/alert repository.

Example: wmf-config/src/Profiler.php increments statsd. The address for statsd can be read from ProductionSettings.php via ServiceConfig::getLocalService().

  1. Timeouts.

There don't seem to be any timeouts specified in the patch currently. Setting a timeout there might make sense. In addition to limiting how long the Apache worker can stay active (through a timeout on the curl request), we'd likely also want to think about how long we're keeping the client connection open.

Generally for this kind of beacon or intake we respond with HTTP 204 or HTTP 202 immediately, and then do the rest post-send. I've commented on the Gerrit change with how to use fastcgi_finish_request, which is what we use in excimer-ui-server and mediawiki-core as well (Codesearch).

  1. Volume.

What's the traffic like? In particular, do we know of spikes or peaks in the past 90 days? Whats the highest peak in the last 90 days, and what's the typical peak per day?

If 30 days is sufficient, then we can get a rough answer to your questions from Turnilo:

https://turnilo.wikimedia.org/#webrequest_sampled_128/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADTjTxKoY4DKxaG2A5lHyh+NTDAAO3GhGJC8AM0RRiAXyYYAtsWQ5i+EAFE05APQBVACoBhRiAUQEaYjXkBtUGgCeE/QS36TDTECgYA+mEBdgEACi5YACbuoAkwNOhYuAQxAMwADAAidlDOEvikAIwaPn6E6HL0qgC66p41BiUyAnbBoQRpEGES6JR2cOQcWSAQuExgiDBy+G4gpgBGxOM4psRgxNj0TUzYmPSKympHIFASSGjJIN6+BlFMCRDa2FCZ4ZEfgSBRHBsDAELQIN4DDFLBUABLFTB0YQgXq/V6PdoEd6fb44OzvYITH4EOBQcj7d7ddSED4Q/AghAIZpMFQyJZ4DwokIufbkgyUCFCJgKRGadDIp61BIhOCg+hzBa1EDUyUGbSktL6ZkgCQzEgJAr/L6ZdxXXXYfWcRFnEAC+7KoA===

There was a spike between 25-27th Dec 2023, of which the peak request rate was ~1900 requests/s.

If we need to go back further, then we'll have to query wmf.webrequest.

Volume
peak request rate was ~1900 requests/s.

I expect most of are from the various MobileApp* schemas that have been marked for decommissioning. Not much we can to about old app versions, but do latest MobileApps still have this instrumentation? I'll ask them, but if it does, they should remove it.

As noted by @phuedx in the review, we should indeed have an includelist of schemas we allow. Test and MediaWikiPingback. That volume will be very low.

respond with HTTP 204 or HTTP 202 immediately

Ah yes, this reminds me that I need to use the hasty=true eventgate endpoint (which, using the MediaWikiServices config will do). This will cause eventgate to respond immediately with a 202, which will be immediately proxied back to the original client.

peak request rate was ~1900 requests/s.

Oh, that turnilo chart is per hour (I think), and is also sampled 1/128. 1900/s Seemed like a lot! So more like Peak of 900*128/60/60 == 32 requests/s. (I think you misread the chart, the peak I see shows '900', not 1900).

The eventlogging-client-side topic, which is produced to by the current /beacon/event varnishkafka producer, is about 15 / second. During that Dec, the peak was 33 / second, so that lines up.

Oh, that turnilo chart is per hour (I think), and is also sampled 1/128. 1900/s Seemed like a lot! So more like Peak of 900*128/60/60 == 32 requests/s. (I think you misread the chart, the peak I see shows '900', not 1900).

Thanks for double-checking this! 1900/s didn't seem high in the context of all of our event streams but obviously not for these legacy streams.

Oh, and actually, we only need to count requests to mediawiki.org/beacon/event, so:

-- In the first week of 2024, count requests to mediawiki.org/beacon/event
spark-sql (default)> select count(*) from wmf.webrequest where webrequest_source='text' and year=2024 and month=1 and day between 1 and 7 and uri_host LIKE '%mediawiki.org' and uri_path = '/beacon/event';
count(1)
5136989
Time taken: 109.947 seconds, Fetched 1 row(s)

So 5136989 in a week is ~8.5 requests per second to mediawiki.org/beacon/event.

PHP execution.
Afaik PHP execution is limited for security reasons to only specific directories. This will thus likely need a puppet change first to Apache config to allow this directory to execute PHP.

Given the transition to Docker/Helm/Kubernetes etc this will also need a corresponding change there, which has its own copy of the Apache config.

@Joe, doing some git blames leads me to ask you what we need to do here. I didn't find anything from my uneducated search in Puppet and deployment-charts for this. From what I can tell, the docroot should be allowed to execute PHP? But I'm probably wrong, please point me to the correct places.