User Details
- User Since
- Oct 3 2025, 9:25 AM (35 w, 5 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- APizzata-WMF [ Global Accounts ]
Tue, Jun 9
Tue, Jun 2
Yesterday's sqoop process terminated ingesting all the necessary tables for MWH at 2026-06-01 15:22 UTC. We are not that much far off the 14 hrs result obtained during testing!
Further optimization could be applied by increasing the parallelism of the smaller wiki, as long as it does not impact the performance of an-launcher1003.
Mon, Jun 1
Fri, May 29
Also today's run of mw_content_reconcile_mw_content_history_daily for eswiki failed for the same reason.
mw-content-history-reconcile-enrich
Needs mediawiki/page/change version 1.3.0
On the 2026-05-27 the mw_content_reconcile_mw_content_history_daily DAG failed on the step emit_reconcile_events_to_kafka for zhwikibooks due to:
error: numeric instance is lower than the required minimum (minimum: 0, found: -1)
level: "error"
schema: {"loadingURI":"#","pointer":"/properties/page/properties/redirect_page_link/properties/namespace_id"}
instance: {"pointer":"/page/redirect_page_link/namespace_id"}
domain: "validation"
keyword: "minimum"
minimum: 0
found: -1Recent changes have introduced the redirect_page_link under the page struct.
Thu, May 28
executed the following commands:
screen -S T424355-copy-content sudo -u analytics kerberos-run-command analytics hdfs dfs -mkdir -p /wmf/data/raw/mediawiki_private/tables/content sudo -u analytics kerberos-run-command analytics hdfs dfs -cp /wmf/data/raw/mediawiki/tables/content/* /wmf/data/raw/mediawiki_private/tables/content
Fri, May 22
I have tested with the following spark session:
spark35-sql \ --master yarn \ --conf spark.executor.cores=2 \ --conf spark.executor.memory=8g \ --conf spark.driver.memory=8g \ --conf spark.sql.shuffle.partitions=768 \ --conf spark.dynamicAllocation.maxExecutors=128 \ --conf spark.executor.memoryOverhead=6g \ --conf spark.sql.legacy.timeParserPolicy=LEGACY \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ --conf spark.sql.catalog.iceberg_check=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.iceberg_check.type=hadoop \ --conf spark.sql.catalog.iceberg_check.warehouse=hdfs:///tmp/iceberg-version-check \ <--- amenities to check the iceberg version --conf spark.driver.userClassPathFirst=true \ <--- amenities to check the iceberg version --conf spark.executor.userClassPathFirst=true \ <--- amenities to check the iceberg version --conf spark.executorEnv.SPARK_HOME=$SPARK_HOME \ --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf \ --conf spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME \ --conf spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf \ --conf spark.yarn.archive=hdfs:///user/a-pizzata/artifacts/spark-3.5.8-jars.zip \ --conf spark.jars=hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.6.1.jar <--- providing my own iceberg runtime
a-pizzata merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2239
after this merge we cam see reconciled records with the correct page.redirect_page_link.
Thu, May 21
Yes, tomorrow I will validate everything on an-launcher and merge the other MRs.
Tested and got the same result.
It'd be interesting to check without those fields to validate.
I can give it a go and come back
If there are no objections I'll reduce the rate-limit back to 20 evt/sec from the 60 we set when the suggestions confidence scores were added.
No objection from my side!
Tue, May 19
Mon, May 18
Was able to run the following commands:
Fri, May 15
Wed, May 13
After some checks here are my results.
May 11 2026
However, I find 1,308,751 pages in English Wikipedia main namespace that are redirects (content_body contains #REDIRECT) but where page_redirect_target is not set (null). From what I understand from the latest MR, the MariaDB redirect table should add this information?
May 8 2026
Created couple MR:
Additionally this is the status of the Cloud DB during the operations.
Will now do a complete test of all the scripts in parallel with this change.
Here are my results:
May 7 2026
I can test this and come back with the results!
this took roughly 2 hours (started at 2026-05-06T15:55:02 ended at 2026-05-06T17:48:14).
May 6 2026
I recommend we sqoop this table from the analytics-replicas.
I can test this and come back with the results!
Given this data, one idea is to split the tables into two groups by size and measure how much time we can save by running the smaller subset separately. I’m currently testing how long it takes to sqoop the following (smaller) set of tables: slot_roles, change_tag_def, user_groups, user, archive.
This took roughly 5 hours:
start at 2026-05-06T09:40:45
finished at 2026-05-06T14:35:52
The run of /usr/local/bin/refinery-sqoop-mediawiki-production-history took about 2 hours, from 2026-05-05T09:04:38 to 2026-05-05T11:25:48.
The /usr/local/bin/refinery-sqoop-centralauth-production job completed in roughly 20 minutes, from 2026-05-05T09:10:13 to 2026-05-05T09:33:02.
May 5 2026
After a discussion with @xcollazo and @JAllemandou we decided to create 3 parallel processes:
/usr/local/bin/refinery-sqoop-mediawiki-history \ && /usr/local/bin/refinery-sqoop-mediawiki-not-history
May 4 2026
Created a local script to ingest google sheet. It is not yet ready to be deployed but is a start to ingest and test.
May 1 2026
All the sqoop commands are executed by the service refinery-sqoop-whole-mediawiki.service that runs the bin /usr/local/bin/refinery-sqoop-whole-mediawiki.
The bin launches the following commands sequentially (notice the &&):
/usr/local/bin/refinery-sqoop-mediawiki-history \ && /usr/local/bin/refinery-sqoop-mediawiki-production-history \ && /usr/local/bin/refinery-sqoop-centralauth-production \ && /usr/local/bin/refinery-sqoop-mediawiki-not-history \ && /usr/local/bin/refinery-sqoop-mediawiki-production-not-history
All the public tables necessary for the DAG are ingested via refinery-sqoop-mediawiki-history (the first step of the pipeline), actor and comment through refinery-sqoop-mediawiki-production-history (the second step), and centralauth_localuser through refinery-sqoop-centralauth-production (third step).
Apr 30 2026
Here is the same check on with today's run:
spark.sql(f"""select count(*),snapshot from analytics_platform_eng.image_suggestions_suggestions group by snapshot order by snapshot asc""").show(truncate=False)
Apr 28 2026
tested the results with the changes in the airflow devenv and they look good:
spark.sql("""select count(*),snapshot
from apizzata.image_suggestions_suggestions
group by snapshot
order by snapshot asc""").show(truncate=False)Apr 24 2026
Do we need to do anything else here @APizzata-WMF?
Not really, the mechanism for reconciliation is in place and has been running both daily and monthly. I'd say is validation time.
@MGerlach, please notify us if you find something weird in the data so that we can take action!
Apr 23 2026
The current way the pipeline queries the imagelink table is the following:
Apr 13 2026
The query:
spark.sql(f"""
select
page_redirect_target
from wmf_content.mediawiki_content_history_v1
where
page_id= 71571
and wiki_id = 'enwiktionary'
and revision_id=87466915"""
).show(truncate=False)
- Emit 0 instead of null on page_change events for rev_parent_id when there is no parent.
- Run a query on our datalake tables to correct data so that any null values on revison_parent_id are changed to 0.
Apr 2 2026
The table has been sqooped and I have ran all the queries to check the status:
Apr 1 2026
Here are my findings:
-- I want to see the status of the source tables `event.mediawiki_page_content_change_v1` and `event.mediawiki_content_history_reconcile_enriched_v1` therefore I executed the following query
@xcollazo could you remind me the semantics of missing_from_source?
I think I can help you with that!
The missing_from_source shows pages that we have in the wmf_content.mediawiki_content_history_v1 that are not present anymore in the MariaDB tables.
perfect! Will keep the ticket open for tomorrow run, if everything looks good I will close it.
Mar 31 2026
spark.sql("""
(
select *
from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2026-03-16'
and tag = 'recommendation.image'
limit 5)
union all
(
select *
from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2026-03-16'
and tag = 'recommendation.image_section'
limit 5)
""").show(truncate=False)Mar 30 2026
Tested in the dev env and the results are good:
spark.sql("""
select tag,count(*) from analytics_platform_eng.image_suggestions_search_index_full
where snapshot = '2026-03-16' and tag in ('recommendation.image','recommendation.image_section')
group by tag""").show()Unfortunately, due to permission on the delta and full tables, I was not able to update the faulty records. Therefore here is the MR that should fix it, I will run it locally and test the results (will post here) and merge ASAP.
I have created and validated all my update commands, will shortly run them all and past here all the results with the validations. As a next step we can just rerun the image_suggestions_weekly dag correct?
Hey @dcausse I must have misunderstood and thought the exists| part could be removed. I can update the output of the tables to show the correct form and fix the code to show in the correct form from next run. Does this sound good to you?
Also SLIS results look like intended:
Mar 27 2026
Mar 26 2026
ALIS run correctly and here the results:
spark.sql(
"""
SELECT
tag,
COUNT(*) AS cnt
FROM analytics_platform_eng.image_suggestions_search_index_delta
WHERE snapshot = '2026-03-16'
GROUP BY tag
"""
).show()Mar 25 2026
Pipeline has been merged, will wait for monthly reconcile to check the results and see if there are any other issues. @JeffDoozan please let us know if you find any other inconsistencies.
forgot to run the same command on snapshot= 2025-09:
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-09/wiki_db=centralauth/part-m-00017.avro
Merged the MR, waiting for tomorrow's run to validate.
Mar 24 2026
+1 on
to remember the state of the filtering in the past. If you still don't want it, please do as you see fit :)
and
- Initially, we set valid_until to NULL.
- Then, the day we decide to turn off some of those JA3N/JA4H pairs, we manually set their valid_until to that date.
Mar 23 2026
Just ran the following with the watchful eye of @JAllemandou :
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-10/wiki_db=centralauth/part-m-00017.avro sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-11/wiki_db=centralauth/part-m-00017.avro sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-12/wiki_db=centralauth/part-m-00017.avro sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2026-01/wiki_db=centralauth/part-m-00017.avro sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2026-02/wiki_db=centralauth/part-m-00017.avro
but I don't think we should automatically disable the filtering when we reach the valid_until date.
So we keep it null for now and when audited that it should be stopped to be used we add the value?
Mar 20 2026
The file I am talking about is only made up of duplicates, therefore deleting it would just remove duplication
Everything went smooth, check_bad_parsing finished successfully in 19 hrs and so did SLIS, Cassandra, and image_suggestions_weekly.
Mar 19 2026
If we go for more "future-proof", I'd use Iceberg, with timestamp fields for insertion, update, and valid-until. And in the filtering, we would only consider rows for which the valid-until date is in the future.
+1 on this
Hm, this would mean partially incomplete data. I'd rather have duplicate in my data than incomplete one.
The file I am talking about is only made up of duplicates, therefore deleting it would just remove duplication
So, I guess we can skip table maintenance altogether?
agreed :)
I think it's easier to make it happen this way
I agree with this!
The pipeline ran successfully.
But we can always look at Iceberg snapshots.
You mean we can look at the insert times and some sort of git log? In case we want to use the snapshots we need to disable the maintenance job that deletes the old snapshots.
I can test this easily and come back with the results
check_bad_parsing has been re-ran successfully, with this MR we should avoid these problems in the future.
Cassandra has also been re-run to upload the data.
