Page MenuHomePhabricator

Add data quality metrics to mediawiki_content_current_v1
Closed, ResolvedPublic

Description

Let's add data quality metrics to wmf_content.mediawiki_content_current_v1.

Some suggestions:

  • Check for duplicate wiki_id, page_id pairs, as it was found on T388715.
  • Do total row count, and per wiki row count to have a historic trendlines.
  • Do non null check on wiki_id, page_id, revision_id.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Disable pydeeque 'Compliance' checks until we can migrate production Spark to be >= 3.3.repos/data-engineering/mediawiki-content-pipelines!70xcollazodisable-compliance-checksmain
analytics: hotfix for task order when doing metrics on mw-content-currentrepos/data-engineering/airflow-dags!1339xcollazohotfix-fix-task-order-on-compute-metricsmain
analytics: Add task to compute metrics and alerts for wmf_content.mediawiki_content_current_v1repos/data-engineering/airflow-dags!1328xcollazoadd-dq-test-for-mw-content-currentmain
Add DQ for mediawiki_content_current_v1.repos/data-engineering/mediawiki-content-pipelines!63xcollazoadd-dq-for-mw-currentmain
Customize query in GitLab

Event Timeline

Example runs via Presto:

Counts:

presto:wmf_content> select count(1) as count  from mediawiki_content_current_v1;
   count   
-----------
 663350101 
(1 row)

Query 20250424_135625_00522_53yev, FINISHED, 15 nodes
Splits: 8,790 total, 8,790 done (100.00%)
[Latency: client-side: 0:07, server-side: 0:07] [663M rows, 137MB] [99.8M rows/s, 20.6MB/s]



presto:wmf_content> select count(1) as count, wiki_id from mediawiki_content_current_v1 group by wiki_id order by count DESC;
   count   |       wiki_id        
-----------+----------------------
 153563110 | commonswiki          
 122225112 | wikidatawiki         
  62850374 | enwiki               
  19505632 | viwiki               
  13985201 | ruwikinews           
  13445925 | frwiki               
  12645582 | metawiki             
  11229772 | cebwiki              
  10089741 | enwiktionary         
   8725384 | arwiki               
   8392567 | eswiki      
....

Query 20250424_135816_00523_53yev, FINISHED, 15 nodes
Splits: 9,509 total, 9,509 done (100.00%)
[Latency: client-side: 0:10, server-side: 0:10] [663M rows, 137MB] [65.3M rows/s, 13.5MB/s]

Duplicates:

presto:wmf_content> SELECT count(1) as count,
                 ->        wiki_id,
                 ->        page_id
                 -> FROM wmf_content.mediawiki_content_current_v1
                 -> GROUP BY wiki_id, page_id
                 -> HAVING count(1) > 1;
 count | wiki_id | page_id 
-------+---------+---------
(0 rows)

Query 20250424_140108_00525_53yev, FINISHED, 15 nodes
Splits: 9,269 total, 9,269 done (100.00%)
[Latency: client-side: 0:13, server-side: 0:13] [663M rows, 1.14GB] [50.1M rows/s, 87.9MB/s]

Null checks:

presto:wmf_content> SELECT wiki_id,
                 ->        page_id,
                 ->        revision_id
                 -> FROM wmf_content.mediawiki_content_current_v1
                 -> WHERE wiki_id IS NULL
                 ->    OR wiki_id = ''
                 ->    OR page_id IS NULL
                 ->    OR page_id < 0
                 ->    OR revision_id IS NULL
                 ->    OR revision_id < 0;
 wiki_id | page_id | revision_id 
---------+---------+-------------
(0 rows)

Query 20250424_140447_00527_53yev, FINISHED, 15 nodes
Splits: 8,789 total, 8,789 done (100.00%)
[Latency: client-side: 0:05, server-side: 0:05] [663M rows, 3.01GB] [143M rows/s, 664MB/s]

All these queries likely hit the parquet footers, because they are blazing fast.

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1328

Draft: analytics: Add task to compute metrics and alerts for wmf_content.mediawiki_content_current_v1

Ok tests are looking good as per https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1328#note_141478.

There were some late minor changes on the review process, and it is late on a Friday. So I will abstain from deploying today. We should be good to deploy on Monday.

xcollazo changed the task status from Open to In Progress.May 16 2025, 8:16 PM

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1328

analytics: Add task to compute metrics and alerts for wmf_content.mediawiki_content_current_v1

Mentioned in SAL (#wikimedia-operations) [2025-05-19T16:53:43Z] <xcollazo@deploy1003> Started deploy [airflow-dags/analytics@d07b52d]: Deploy latest Airflow DAGs for the main instance. T392494.

Mentioned in SAL (#wikimedia-operations) [2025-05-19T16:54:19Z] <xcollazo@deploy1003> Finished deploy [airflow-dags/analytics@d07b52d]: Deploy latest Airflow DAGs for the main instance. T392494. (duration: 00m 36s)

Mentioned in SAL (#wikimedia-analytics) [2025-05-19T16:55:17Z] <xcollazo> Deployed latest Airflow DAGs for the main instance. T392494.

Ok we now have DQ tests in production.

Will wait until a successful run before closing this task.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1339

analytics: hotfix for task order when doing metrics on mw-content-current

Ok we now have DQ tests in production.

Will wait until a successful run before closing this task.

While the DQ tests on mediawiki_content_current were successfull, we also added two tests to mediawiki_content_history and there seems to be a class mismatch issue in prod:

...
  File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics/appcache/application_1741864027385_1386679/container_e133_1741864027385_1386679_01_000001/venv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 330, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.com.amazon.deequ.analyzers.Compliance. Trace:
py4j.Py4JException: Constructor com.amazon.deequ.analyzers.Compliance([class java.lang.String, class java.lang.String, class scala.None$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:237)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)

Will investigate.

Ok the issue from T392494#10844228 is a class mismatch that will likely have to wait.

We currently use the java deequ compiled against Spark 3.1:

<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>2.0.4-spark-3.1</version>

But we run pydeequ from a conda environment that builds with pyspark 3.3.2:

dependencies:
  - pip=23.1.2
  - python=3.10.11
  - pyspark=3.3.2
  - pip:
      - dnspython==2.5.0
...

In fact, we have been lucky that we had not found a class incompatibility so far. We cannot bump refinery-source, because production Spark is indeed 3.1.2. We cannot downgrade mediawiki-content-dump's Spark because we depend on Spark 3.3 for reasonable MERGE behavior.

I think that for now we will have to disable the two new Compliance checks until we do the work to bump the cluster's production Spark.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/70

Disable pydeeque 'Compliance' checks until we can migrate production Spark to be >= 3.3.

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1343

analytics: bump mw-content conda env to pickup fixes for reconcile and metrics.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1343

analytics: bump mw-content conda env to pickup fixes for reconcile and metrics.

Mentioned in SAL (#wikimedia-operations) [2025-05-21T21:00:04Z] <xcollazo@deploy1003> Started deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310.

Mentioned in SAL (#wikimedia-operations) [2025-05-21T21:01:00Z] <xcollazo@deploy1003> Finished deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310. (duration: 00m 55s)

Mentioned in SAL (#wikimedia-analytics) [2025-05-21T21:02:35Z] <xcollazo> Deploy Airflow artifact for T392494 and T394310.