User Details
- User Since
- Jan 4 2022, 1:16 PM (213 w, 5 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- AQuhen (WMF) [ Global Accounts ]
Wed, Feb 4
On this ticket, we have consulted both SREs and our team. We have agreed on the following details:
- extract a single repo refinery-python to Gitlab from analytics/refinery
- multiple CI output for this repo with multiple pipeline depending on the need:
- docker image (seems like the best solution for Airflow triggering)
- conda package
- and the repo could be pip compatible to be eventually required from conda-analytics or airflow-dags
Tue, Feb 3
Mon, Feb 2
Thu, Jan 29
Wed, Jan 28
Closing. Next optimization could be splitting from main instance file_exporters job. Should be done in another ticket.
Following discussion with @Ahoelzl we can postpone that on Q4.
I've marked all failed dag run as success to clear the UI.
K8s execution deployed, but we are not observing the overall performance gain we would have expected. We later tweak those 2 Airflow configs:
- worker_pods_creation_batch_size
- worker_pods_queued_check_interval
At least each task is consuming less k8s resources.
Tue, Jan 27
Build creation has been moved here: https://gitlab.wikimedia.org/repos/data-engineering/datahub-cli
- git history is preserved (the repo is actually a fork of analytics/refinery)
- CI is from workflow utils
- To setup the repo & CI, important steps:
- remove lfs support: https://gitlab.wikimedia.org/repos/data-engineering/datahub-cli/edit#js-shared-permissions
- create a token which will be used within the CI: https://gitlab.wikimedia.org/repos/data-engineering/datahub-cli/-/settings/access_tokens
- add variables https://gitlab.wikimedia.org/repos/data-engineering/datahub-cli/-/settings/ci_cd#js-cicd-variables-settings including one containing the token CI_PROJECT_PASSWORD
With T415357 I’ve already started extracting the Python conda environment build for analytics/refinery into GitLab CI.
- Option 1:
- downstreaming xcoms (keeping only the necessary fields)
- computing parameters for each task in a pre_execution function (1 for each task)
First deploy crashed because and reverted.
I was blocked by missing connection from k8s to eventgates.
It was fixed by SREs: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1229524
Now testing on dev-env before retrying a deploy.
Mon, Jan 26
Fri, Jan 23
Thu, Jan 22
Tue, Jan 20
Duplicate of T411999
Mon, Jan 19
Deployment of db_cleaner dag on Airflow instances went mostly well.
Sat, Jan 17
Fri, Jan 16
Tue, Jan 13
Last notebook is here:
https://gitlab.wikimedia.org/hghani/movement-insights-requests/-/blob/main/SDS%201.3/client-side/simple-client_analysis_summary.ipynb?ref_type=heads
I reviewed it.
Mon, Jan 12
As we are discussing the limits of the current system putting strains on Airflow. The idea of this now old refactoring seems not a priority.
Dec 23 2025
Dec 10 2025
Patch adding pg_stat to analytics_test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1217138
Dec 8 2025
We have already merged 2 features to improve on that:
- all the preparation tasks are gone
- all evolve+refine are now a single task
Dec 4 2025
Nov 24 2025
Nov 17 2025
Lets create dags to import datasets from spur.us:
- anonymous+residential
- dc
- geoips
Nov 14 2025
Waiting for client side signal for more Spur.us dataset evaluations.
Nov 3 2025
Oct 27 2025
Oct 24 2025
All Septembre is imported:
desc wmf_traffic.spur_feed ; col_name data_type comment ip string NULL organization string NULL as struct<number:bigint,organization:string> NULL client struct<behaviors:array<string>,concentration:struct<city:string,country:string,density:double,geohash:string,skew:bigint,state:string>,count:bigint,countries:bigint,proxies:array<string>,spread:bigint,types:array<string>> NULL tunnels array<struct<anonymous:boolean,entries:array<string>,exits:array<string>,operator:string,type:string>> NULL services array<string> NULL location struct<city:string,state:string,country:string> NULL risks array<string> NULL snapshot string NULL # Partition Information # col_name data_type comment snapshot string NULL Time taken: 0.239 seconds, Fetched 12 row(s)
All September is imported.
desc aqu.20251023_bot_ips_study ; col_name data_type comment ip string NULL pageviews_count bigint NULL legacy_reasons array<string> NULL legacy_automated_pageviews_proportion double NULL hap_flagged_request_proportion double NULL spur_risks array<string> NULL spur_proxies array<string> NULL year int NULL month int NULL day int NULL # Partition Information # col_name data_type comment year int NULL month int NULL day int NULL Time taken: 0.147 seconds, Fetched 15 row(s)
Oct 14 2025
Awesome study. Thanks for adding the split by countries.
here is the new table: wmf_traffic.spur_feed by snapshot and each partition is split in 8 parquet files. They are views of Spur feeds anonymous-residential.
https://docs.spur.us/feeds/types?id=custom-feeds&utm_source=chatgpt.com#anonymous-residential-feed all bad actors.
snapshot count(1) 20250430 56142741 20250530 57991550 20250918 60518449
Oct 13 2025
Done. That took ~3.5h and ~1h respectively.