User Details
- User Since
- Nov 7 2014, 8:52 PM (596 w, 1 d)
- Availability
- Available
- IRC Nick
- mforns
- LDAP User
- Mforns
- MediaWiki User
- Mforns (WMF) [ Global Accounts ]
Wed, Apr 8
I'd be totally in favor of setting a long-time retention period for event_sanitized.
The overall plan looks great to me!
Tue, Apr 7
The deployment finished successfully.
The pageview and unique devices data will be using the new JA3N-JA4H list starting on 2026-04-07T11:00:00.
Fri, Apr 3
I prepared a deployment plan for the automated traffic detection changes:
- Get an approval for airflow-dag changes.
- Pause webrequest_actor_metrics_hourly DAG in Airflow UI.
- Wait until webrequest_actor jobs have finished.
- Merge airflow-dags changes.
- Make sure the Blunderbuss pipeline runs.
- ALTER TABLE wmf.webrequest_actor_metrics_hourly ADD COLUMNS (bot_ja3n_ja4h_pageview_count bigint COMMENT 'Number of pageviews with suspicious JA3N+JA4H pairs');
- ALTER TABLE wmf.webrequest_actor_metrics_rollup_hourly ADD COLUMNS (bot_ja3n_ja4h_pageview_share double COMMENT 'Percentage of pageviews with suspicious JA3N+JA4H pairs');
- Unpause webrequest_actor_metrics_hourly DAG in Airflow UI.
- Quickly vet the data generated by the production DAGs.
I don't think it's a good idea to deploy today Friday before a long weekend, and most of the team on holiday. So, I'll do it when we come back next week.
I've finished the tests.
Hamid and I have checked that both actor counts and pageview counts for different dates within the incident match Hamid's analysis.
Given this, and the 2 approvals for the code, I just merged https://gitlab.wikimedia.org/repos/data-engineering/refinery-private/-/merge_requests/2
Thu, Apr 2
Mon, Mar 30
Here's the backfilling plan document.
Thu, Mar 26
@Ahoelzl @GGoncalves-WMF
@CMyrick-WMF reached out to us, interested in this task, since she will soon be working on time-to-revert insights, and this fix would benefit her work.
Wed, Mar 25
I started the tests on Jan 14th.
I will regenerate all the pipeline from webrequest_actor_metrics_hourly to unique_devices_*_daily.
If there's more time I will test other dates (Dec 1st, Dec 26th, Feb 19th) but only until pageview_actor.
Mon, Mar 23
@JAllemandou Ah! I understand now.
Like:
- Initially, we set valid_until to NULL.
- Then, the day we decide to turn off some of those JA3N/JA4H pairs, we manually set their valid_until to that date.
OK, I will do it like this.
🙏
but I don't think we should automatically disable the filtering when we reach the valid_until date.
So we keep it null for now and when audited that it should be stopped to be used we add the value?
How about the table schema being created_at timestamp, updated_at timestamp, ja3n string, ja4h string, reference string (phab task),
and at each audit, we manually add/delete/update the records that we decide?
Fri, Mar 20
OK, will switch to an Iceberg table and put it in wmf_traffic! Thanks for the feedback.
The valid_until field makes sense as a reminder, but I don't think we should automatically disable the filtering when we reach the valid_until date.
Otherwise, we could revert to previous incident's traffic level without noticing.
In our last meeting with @GGoncalves-WMF we informally agreed that there would be a periodical audit of this list, and other automated traffic parameters, and we would reevaluate, remove and add as necessary.
Thu, Mar 19
@APizzata-WMF Yes, you're right! Hm, maybe we don't even need to add it as a dataset in Airflow, since we can assume it's always going to exist and be properly populated? It's a static dataset, that doesn't see any regular updates, so I think we should not use Airflow sensors for it. So, I guess we can skip table maintenance altogether?
Wed, Mar 18
Thanks for the feedback!
Tue, Mar 17
One question related to creating new tables in the new Iceberg databases.
I want to create a new table that stores the JA3N-JA4H pairs that we'll mark as automated traffic in our bot detection pipeline.
It would be nice that we store them as TSV, and whenever we update them, a refinery deploy (or manual sync) is sufficient to update the contents of the table.
However, if we create that new table in Iceberg, IIUC we can not do that, we need a delete-select-insert from the file to the Iceberg table.
I would be OK with using an external Hive table for this, so it stays simple.
But, then, can we still put this table in the (in theory exclusively) Iceberg database wmf_traffic (which Antonio and I think is the proper location of the table)?
I think @JAllemandou has strong opinions on this.
Mon, Mar 16
Mar 11 2026
Mar 6 2026
When troubleshooting and fixing this, we should consider also solving T266374,
since diving in mediawiki_history code always takes some time and effort.
If we solve the 2 issues in one go, we can save important context switching time.
Mar 4 2026
In theory, the default sensor timeout is 7 days.
I haven't found anywhere in the code where we override this value.
Do you know why our sensors timeout so early?
Mar 2 2026
LGTM @xcollazo
Dec 1 2025
Nov 27 2025
Thank you @Eevans!
Nov 25 2025
Makes sense @JAllemandou!
Nov 21 2025
Merged and deployed the update, thanks!
Nov 20 2025
Nov 18 2025
Nov 10 2025
Nov 7 2025
Nov 3 2025
Summary of JA3N data vetting:
Oct 28 2025
Oct 27 2025
Oct 23 2025
Oct 21 2025
Oct 15 2025
After some troubleshooting I saw that, when we added the linktarget table as a datasource for Commons Impact Metrics, we forgot to add the corresponding sensor.
This made it so that the September DAG run started before the linktarget data was properly loaded to the data lake, and so the CIM job produced empty results.
The MR above adds the proper sensor to the DAG.
Oct 13 2025
Oct 8 2025
OK, I think this time it worked.
- We rolled back to the Iceberg snapshot previous to the corrupting of the data by the backfill process.
- Then we copied April (still not backfilled) into a temp table in the wmf_staging database.
- Then we rolled the tables forward to its present state
- And finally deleted April data and copied the old temporary uncorrupted data into the April gap.
We executed the snapshot revert commands with success:
spark-sql (default)> CALL spark_catalog.system.rollback_to_snapshot('wmf_readership.unique_devices_per_domain_monthly', 2325163397903337906);
25/10/08 18:50:56 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
previous_snapshot_id current_snapshot_id
1441360873772091690 2325163397903337906
Time taken: 2.376 seconds, Fetched 1 row(s)