User Details
- User Since
- Jan 4 2022, 1:16 PM (120 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- AQuhen (WMF) [ Global Accounts ]
Thu, Apr 4
Wed, Mar 27
Tue, Mar 26
5 datasets are being refined as a POC on the prod cluster. 2 on the test cluster.
The Airflow PR has been merged and should be released in Airflow 2.9 in April.
Feb 22 2024
Done. The last version was done with Blubber:
Feb 21 2024
Some research:
- Each XML dumps snapshot may represent ~5.5TB (including ~1.8TB for wikidata and 1.4TB for enwiki)
- The Airflow sensor may take ~19days to turn green. It waits until the last dump has been processed (_IMPORTED flag). Most dumps are generated in a matter of days (~4 on average, maybe). Enwiki may take 7 days. And they all wait for the wikidata dump (~19 days).
- When the sensor turns green, a heavy Spark job is launched to convert all the compressed XML to parquet. ~5.5TB (compressed) is taking ~4.5 days to process.
- The perceived gaps are
due to the non-parallelism of the dag + very long jobs. 1 heavy job is preventing the other ones from running.due to the retries (Thx for the pointer @JAllemandou ). Other symptoms, same problem here I think: https://phabricator.wikimedia.org/T342911
Feb 13 2024
Feb 8 2024
Feb 6 2024
Feb 2 2024
Linked with: https://phabricator.wikimedia.org/T351792
Jan 31 2024
wmf.wikidata_item_page_link/snapshot=2024-01-15 looked fine from a row count perspective but 2024-01-22 snapshot was unexpectedly available and had zero rows:
select count(*) from wmf.wikidata_item_page_link where snapshot='2024-01-22'; ... +------+ | _c0 | +------+ | 0 | +------+
Jan 30 2024
I've added the missing partitions in the source of wikidata_item_page_link and its missing snapshot is now generated.
Jan 29 2024
I would like to add rewrite_manifests to the list of maintenance actions:
- rewrite_data_files
- expire snapshots
- rewrite_manifests https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests
Jan 25 2024
Jan 22 2024
@BTullis , @brouberol , @Stevemunene I would like your feedback on this subject:
Jan 19 2024
Jan 17 2024
Jan 16 2024
Jan 15 2024
Following the Grafana dashboard review, I've performed some changes to it:
- I distributed the graphs into 4 sections: Failures, Durations, Counts, Scheduling
- I added missing parameterization of variables (e.g the list of operators changes when the instance is selected)
- I updated TTLs into statsd-exporter in order to reflect into Prometheus the rate of the metrics sent by Airflow (e.g. when Airflow emits a metric to say 1 task is in failure, it generates 1 call. and we would like to get this single point into Prometheus)
- I isolated task failures into its own graph
Jan 9 2024
1/ Splitting the CI
Dec 21 2023
In this ticket, I needed to find a way to detect when an Iceberg table has some data in it. This would replace the Hive partition sensor when migrating a table to Iceberg.
We chose to launch a Spark application running an SQL count. It's now implemented here:
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/564
- https://gerrit.wikimedia.org/r/c/analytics/refinery/+/983673
A drawback of this solution is that it generates a FAILED Spark application each time the sensor does not find any data in the interval. When monitoring our Spark applications, we want to avoid artificially growing the FAILED counts.
Dec 20 2023
I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPATH=/path/to/root/of/airflow-dags/ as an env variable.
Dec 18 2023
I have the first version of the code in review.
Dec 5 2023
Dec 4 2023
Nov 30 2023
Nov 22 2023
The workaround to our Gilab-CI pb is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537
The puppet configuration is now merged, and statsd_exporter is running on an-test-client1002. Analytics Prometheus is scrapping from it, as it should.
Nov 16 2023
Nov 13 2023
We have multiple needs considering scheduling Airflow dags & tasks.
Nov 8 2023
Nov 7 2023
Oct 27 2023
Running locally with Docker, I manage to fetch a sample:
airflow_schedulerjob_start{instance="airflow-statsd-exporter:9102", job="statsd-exporter"} 3
Where:
- The airflow_ prefix was added in airflow.
- airflow-statsd-exporter:9102 is an independent container
After all the discussion on this subject, I also think that publishing Spark metrics to Kafka (then exported to hdfs) seems like the most obvious first step.
Oct 26 2023
Wikitech page updated: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest
Oct 24 2023
Hello @fgiunchedi ! This ticket, when done, will add some Airflow metrics to Prometheus:
- a very reasonable amount (less than a
hundredthousand) - prefixed with the name of the airflow instance
- passing through a statsd_exporter
Just to let you know 😄
Oct 23 2023
Yes, Druid is an option. There is a Grafana collector, and Druid is designed for time-series.
Oct 20 2023
@Ottomata , yes, collecting to HDFS is completely possible yes. e.g. with a Spark metrics KafkaSink (I've seen one) + Gobblin.
Oct 18 2023
As a closing word, two serious options remain:
- Prometheus with a low resolution (1 point per minute for any metrics). This is OK to monitor evolution between runs but usually not for debugging a Spark job.
- Send metrics-data with high time resolution to a new ad-hoc backend accessible from Grafana (InfluxDB or PG are good candidates)
Oct 17 2023
Knowing the estimated metrics cardinality looks OK for the current Prometheus setup is good.
Oct 16 2023
+1 for the simplicity induced in the conda-analytics package.
Thanks @BTullis ! I've tested it. It's now resolved!
Oct 13 2023
Oct 12 2023
Here is the list of events concerned:
- Edit
- Diacritics*
- Some prefixed with MobileWeb:
- MobileWebBrowse
- MobileWebClickTracking
- MobileWebCta
- MobileWebDiffClickTracking
- MobileWebEditing
- MobileWebInfobox
- MobileWebLanguageSwitcher
- MobileWebMainMenuClickTracking
- MobileWebSectionUsage
- MobileWebShareButton
- MobileWebUIClickTracking
- MobileWebUploads
- MobileWebWatching*
- MobileWebWikiGrok*
- TaskRecommendation*
- TestSearchSatisfaction
- VET135171
- WikipediaZeroUsage
Let's summarize our current problem.
Oct 11 2023
Oct 10 2023
@Jelto done. wikitech username & email address checked. Thanks!
Sep 29 2023
Sep 28 2023
Here is the first version of it. Please review in the gdoc: https://docs.google.com/document/d/1clSe6bnIxJUdd2LaFtQ-_MXGIRFBOzKNVZ2130qfGqI/edit