User Details
- User Since
- Jun 21 2021, 2:34 PM (260 w, 3 h)
- Availability
- Available
- LDAP User
- TChin
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Fri, Jun 12
(Added the page-analytics port to Wikitech)
Mon, Jun 8
Since this requires bumping the page change schema, patches are blocked from merging until T421237 is resolved
Wed, Jun 3
It's okay, I changed a few small things in the sql so I can just do it manually on my end
Tue, Jun 2
The mediawiki_database field is missing for page_visit events, which means we cannot calculate retention rates at the wiki level.
Suggestion: Snapshot only the global retention baseline and skip the per-wiki retention baseline for Rounds 8–10.
@amastilovic is there an easy way to selectively run modified dbt jobs in production for backfilling like what we might need above?
Mon, Jun 1
I can take a look at this; what are the ports for the pageviews and unique-devices services?
Mon, May 18
Just re-ran analytics-refinery-maven-release and it succeeded, I guess my specific problem was transitory
May 15 2026
Just tried running analytics-refinery-maven-release, it failed with:
14:12:36 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:3.0.1:prepare (default-cli) on project refinery: Unable to commit files 14:12:36 [ERROR] Provider message: 14:12:36 [ERROR] The git-push command failed. 14:12:36 [ERROR] Command output: 14:12:36 [ERROR] To https://gerrit.wikimedia.org/r/analytics/refinery/source 14:12:36 [ERROR] ! [rejected] master -> master (fetch first) 14:12:36 [ERROR] error: failed to push some refs to 'https://gerrit.wikimedia.org/r/analytics/refinery/source' 14:12:36 [ERROR] hint: Updates were rejected because the remote contains work that you do 14:12:36 [ERROR] hint: not have locally. This is usually caused by another repository pushing 14:12:36 [ERROR] hint: to the same ref. You may want to first integrate the remote changes 14:12:36 [ERROR] hint: (e.g., 'git pull ...') before pushing again. 14:12:36 [ERROR] hint: See the 'Note about fast-forwards' in 'git push --help' for details. 14:12:36 [ERROR] -> [Help 1]
May 11 2026
Took a look at this, here's what I found:
- Main menu sidebar: Has a class n-sitesupport on the <li>. This is default to wikis with the WikimediaMessages extension.
- Top links when logged out - This is specific to the Vector 2022 skin. Has the class pt-sitesupport-2 on the <li>, but when overflowed and put into a hamburger menu it has class pt-sitesupport. If it exists, it removes the n-sitesupport button.
- Mobile Web "Hamburger" Menu: This is from the Minerva Neue skin. The donate button is generated by the skin, and bypasses the normal sidebar behavior so doesn't have any sitesupport id. But it does have a data-event-name="menu.donate"already implemented to track clicks.
- Contact Us Portal: I have no idea where this comes from. Is it just a regular wiki page that only admins can edit?
May 8 2026
We could do this by adding page_type enum field, or a boolean is_content_namespace field
I don't think an enum would work, a content namespace is more like an abstract concept that exists outside of the normal MW-defined namespaces and could technically be any namespace, so I think using a boolean is probably simplest. An enum would only be useful as an array for CONTENT and then the actual namespace prefix itself
May 7 2026
A simple version bump to node 24 in the blubber file for eventgate-wikimedia failed due to some C++ compiling error from node-rdkafka. Probably we'd have to update that in node-rdkafka-factory and eventgate first before being able to update eventgate-wikimedia.
Apr 27 2026
Forgot to add an update from the Dublin offsite, there's now the client_platform_family column so now the dataset can be split desktop/mobile. Because of the inability to backfill from the beginning of time this means the first week or so of the dataset doesn't have it but afterwards it should be there
Apr 7 2026
Realized that it would be useful to have the domain the pageview happened, so recreated the table and backfilled.
Apr 3 2026
Data is now available in the data lake under wmf_readership.active_reader_baseline.
spark-sql (default)> select count(1) from wmf_readership.active_reader_baseline; count(1) 894215 Time taken: 13.804 seconds, Fetched 1 row(s)
Mar 24 2026
I got the same error running an airflow devenv while developing a Spark 3.3.2 DAG.
Mar 23 2026
we should cover as many wikis as feasible
That itself could be its own task, but I'm assuming that 100% sampling on every wiki is *probably* fine since this instrument will only apply to logged in users.
Mar 19 2026
Mar 16 2026
Mar 9 2026
cc: @Ottomata as a very interesting read
Mar 6 2026
Eventgate v1.28.0 is now deployed
Feb 6 2026
Jan 29 2026
I chatted with @Ottomata about this a little bit, here's what I'm going to attempt:
Jan 26 2026
Jan 6 2026
Dec 8 2025
Dec 5 2025
Dec 3 2025
@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later
Dec 2 2025
Nov 14 2025
Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?
Nov 12 2025
what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?
Nov 11 2025
Nov 3 2025
Oct 31 2025
Oct 28 2025
Talked with Andrew about this more. The main problem is that MediaWiki is active/passive, but eventgate is basically active/active. The external eventgate instances will expect traffic on both DCs, but the internal ones would see activity on the actice DC (but may still get events on the passive DC).
Oct 27 2025
Wow, this was harder than I thought. So what we to need to happen is to detect the active DC mediawiki_wmf_master_datacenter which is indicated by the the datacenter label and only alert on the active datacenter. All metrics have a site label which is (from what I can tell) the datacenter the metric is exported from.
Oct 9 2025
Oct 8 2025
Sep 29 2025
Sep 26 2025
Hmmm ok everything is deployed now and it works fine, but I can't tell if the p99 performance got worse, or the express metrics are broken somehow (either broken beforehand and now fixed or vice-versa). What makes me suspicious is that when you look at the latency quantiles by HTTP method from before the deployment, every deployment and every instance had a GET and POST p99 of almost exactly 9.90ms. After the deployment, it's actually correlated with the amount of events it's received. I'm assuming this means that something was actually fixed somewhere, but because of this, alerts are being fired on the passive DC because of the bursty nature of events there and the latency increase that's correlated with it.
In the logs I spotted another offender
{"@timestamp":"2025-09-26T16:47:14.571Z","ecs.version":"8.10.0","log.level":"info","message":"Overriding meta.dt in event b63f71b4-d6ff-4a1f-8544-e01d11df60c3 of schema at /sparql/query/1.3.0 destined to stream wdqs-external.sparql-query from 2025-09-26T16:47:14.499Z to 2025-09-26T16:47:14.571Z.","service":{"name":"eventgate-analytics"}}Deployed to eventgate-analytics-external and it looks stable. Proceeding to deploy to the remaining instances.
Fully deployed to eventgate-logging-external. Logs seem fine. No log spam, the duplicate dropped fields are fixed, and it's fully ingested into logstash in ECS format now. Metrics also look good. It's a bit fuzzy because this happened during the DC switchover, but looking at the dashboard it seems like all metrics still match except the only one lost is the one I stated before that's under Memory usage (sum over all pods). That doesn't concern me that much since the service is also now being picked up by the new service-utils metrics dashboard so we still have memory reporting.
Sep 23 2025
Also these three metrics stopped being reported, which I don't really know why since from what I can tell it's a Kubernetes metric
nodejs_process_heap_used_bytes
nodejs_process_heap_total_bytes
nodejs_process_heap_rss_bytes
Deployed to eventgate-logging-external for codfw, it works in the sense that it didn't blow up, but will have to fix some stuff before I deploy the rest of it. Logs export fine in ECS, but for some reason a lot of fields are being dropped. Metrics show up in the dashboards but some need renaming, and I also forgot to add metrics for the express routes.
Sep 16 2025
I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.
Aug 29 2025
oo very nice!! I wonder how it'd compare to a pure java version of the deequ code. Maybe if we switch to SQL we can take the opportunity to revamp the metrics table schema?
Aug 28 2025
Aug 27 2025
Aug 26 2025
Aug 18 2025
@dcausse fyi I just deployed eventstreams with your patch
Aug 5 2025
Jul 30 2025
Jul 28 2025
assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite
Jul 25 2025
Yeah it can be closed out
Jul 21 2025
If you want to drive it, be my guest and I can help you out if needed. Or we can pair program together. Whichever you prefer
Jul 11 2025
Jul 7 2025
I should also mention that metrics eventually recovered, probably due to T383977 having unintentionally doing rolling restarts of all the pods, resetting metrics. Taking a look at that ticket again since I'm already in the code.
Was digging through the logs and found a bunch of these InvalidAssignmentError requests which is caused by setting a malformed last-event-id header. This header when set takes precedence over the url stream parameter (which can lead to another bug if the stream names are different than the topics in the header), and passes all of eventstream's checks until it gets handed to KafkaSSE where it blows up. This somehow fires both the close and finish event.
Jul 2 2025
Jul 1 2025
Jun 30 2025
Adjusted airflow variables to use the new conda artifact. Should be good to go now. Now the only question is how long the metrics computation will take...
Jun 26 2025
Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.
The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess
Jun 24 2025
Since implementing the metrics segregation, we should now update the legacy metrics with the computation class before implementing the monthly metrics
spark-sql (default)> SELECT COUNT(*) AS count
> FROM wmf_data_ops.data_quality_metrics
> WHERE tags['project'] = 'mediawiki_content_history'
> AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
> ;
count
577920May 16 2025
I think the solution is to make the code aware of both endpoints, and then pick the correct one inside the SparkSubmitOperator based off of the launcher param before it sets the rest of the config. Right now the endpoint is set by a jinja template, but by the time airflow templates the string it's probably too late?
May 11 2025
Deployed and updated airflow variables to use artifact v0.6.0
May 5 2025
May 2 2025
On the service-utils side, that property should've been filled in by express: