Page MenuHomePhabricator

APizzata-WMF (a-pizzata)
User

Projects (1)

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Oct 3 2025, 9:25 AM (35 w, 5 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
APizzata-WMF [ Global Accounts ]

Recent Activity

Tue, Jun 9

APizzata-WMF created T428644: analytics-test gobblin tasks failure.
Tue, Jun 9, 4:10 PM · Data-Platform-SRE (2026-06-05 - 2026-06-26)

Tue, Jun 2

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Yesterday's sqoop process terminated ingesting all the necessary tables for MWH at 2026-06-01 15:22 UTC. We are not that much far off the 14 hrs result obtained during testing!
Further optimization could be applied by increasing the parallelism of the smaller wiki, as long as it does not impact the performance of an-launcher1003.

Tue, Jun 2, 8:20 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mon, Jun 1

APizzata-WMF moved T418190: Refactor pingback reports pipelines using dbt from In progress to Blocked/Paused on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:44 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T415283: Refactor pingback analytics pipeline from In progress to Blocked/Paused on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:44 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T423573: Blunderbuss doesn't replace the whole destination folder in HDFS from In progress to Ready to Deploy on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:43 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T411771: Migrate PageViewInfo calls away from rest-gateway from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:20 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), ServiceOps-SharedInfra, ServiceOps new, PageViewInfo
APizzata-WMF moved T425029: mediawiki.page_change.v1 - add revision.editor.first_edit_dt field from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:20 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Event-Platform, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T425730: Airflow DAGs for mediawiki_history_incremental_v1 writers from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Jun 1, 3:15 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Fri, May 29

APizzata-WMF added a comment to T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

Also today's run of mw_content_reconcile_mw_content_history_daily for eswiki failed for the same reason.

Fri, May 29, 1:57 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
APizzata-WMF added a comment to T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

mw-content-history-reconcile-enrich

Needs mediawiki/page/change version 1.3.0

Fri, May 29, 12:26 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
APizzata-WMF added a comment to T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

On the 2026-05-27 the mw_content_reconcile_mw_content_history_daily DAG failed on the step emit_reconcile_events_to_kafka for zhwikibooks due to:

error: numeric instance is lower than the required minimum (minimum: 0, found: -1)
    level: "error"
    schema: {"loadingURI":"#","pointer":"/properties/page/properties/redirect_page_link/properties/namespace_id"}
    instance: {"pointer":"/page/redirect_page_link/namespace_id"}
    domain: "validation"
    keyword: "minimum"
    minimum: 0
    found: -1

Recent changes have introduced the redirect_page_link under the page struct.

Fri, May 29, 8:58 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

Thu, May 28

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

executed the following commands:

screen -S T424355-copy-content
sudo -u analytics kerberos-run-command analytics hdfs dfs -mkdir -p /wmf/data/raw/mediawiki_private/tables/content
 sudo -u analytics kerberos-run-command analytics hdfs dfs -cp /wmf/data/raw/mediawiki/tables/content/* /wmf/data/raw/mediawiki_private/tables/content
Thu, May 28, 12:27 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF claimed T427441: Delete `wmf_raw.mediawiki_content` table after a few months.
Thu, May 28, 9:35 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Fri, May 22

APizzata-WMF added a comment to T426801: Iceberg 1.6.1 bug makes SELECTs fail due to vectorized read path being the default.

I have tested with the following spark session:

spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \   
  --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
  --conf spark.sql.catalog.iceberg_check=org.apache.iceberg.spark.SparkCatalog \ 
  --conf spark.sql.catalog.iceberg_check.type=hadoop \
  --conf spark.sql.catalog.iceberg_check.warehouse=hdfs:///tmp/iceberg-version-check \ <--- amenities to check the iceberg version
  --conf spark.driver.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executor.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executorEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.archive=hdfs:///user/a-pizzata/artifacts/spark-3.5.8-jars.zip \
  --conf spark.jars=hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.6.1.jar  <--- providing my own iceberg runtime
Fri, May 22, 1:24 PM · DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.

a-pizzata merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2239

after this merge we cam see reconciled records with the correct page.redirect_page_link.

Fri, May 22, 9:48 AM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content

Thu, May 21

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Yes, tomorrow I will validate everything on an-launcher and merge the other MRs.

Thu, May 21, 8:07 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T426801: Iceberg 1.6.1 bug makes SELECTs fail due to vectorized read path being the default.

Tested and got the same result.

Thu, May 21, 3:28 PM · DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T426801: Iceberg 1.6.1 bug makes SELECTs fail due to vectorized read path being the default.

It'd be interesting to check without those fields to validate.

I can give it a go and come back

Thu, May 21, 1:18 PM · DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T423238: ALIS data pipeline produced too many suggestions.

If there are no objections I'll reduce the rate-limit back to 20 evt/sec from the 60 we set when the suggestions confidence scores were added.

No objection from my side!

Thu, May 21, 8:45 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions

Tue, May 19

APizzata-WMF moved T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps from In progress to In Review on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Tue, May 19, 2:17 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content

Mon, May 18

APizzata-WMF added a comment to T426349: Spark 3.5.8 Shuffler not working.

Was able to run the following commands:

Mon, May 18, 12:25 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15)
APizzata-WMF added a comment to T426349: Spark 3.5.8 Shuffler not working.

@BTullis Thanks for all the work!! will check that works for me too and will let you know :) I am replacing @xcollazo on this so he can focus on other things

Mon, May 18, 10:34 AM · Data-Platform-SRE (2026-04-24 - 2026-05-15)

Fri, May 15

APizzata-WMF placed T426349: Spark 3.5.8 Shuffler not working up for grabs.
Fri, May 15, 7:22 AM · Data-Platform-SRE (2026-04-24 - 2026-05-15)
APizzata-WMF claimed T426349: Spark 3.5.8 Shuffler not working.
Fri, May 15, 7:22 AM · Data-Platform-SRE (2026-04-24 - 2026-05-15)

Wed, May 13

APizzata-WMF added a comment to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.

After some checks here are my results.

Wed, May 13, 8:48 AM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content

May 11 2026

APizzata-WMF moved T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
May 11 2026, 1:16 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content
APizzata-WMF added a comment to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.

However, I find 1,308,751 pages in English Wikipedia main namespace that are redirects (content_body contains #REDIRECT) but where page_redirect_target is not set (null). From what I understand from the latest MR, the MariaDB redirect table should add this information?

May 11 2026, 1:14 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content
APizzata-WMF moved T424355: Accelerate sqoop landing for MediaWiki History private tables from In progress to In Review on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
May 11 2026, 9:27 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 8 2026

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Created couple MR:

May 8 2026, 12:28 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Additionally this is the status of the Cloud DB during the operations.

May 8 2026, 8:09 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Will now do a complete test of all the scripts in parallel with this change.

Here are my results:

May 8 2026, 7:50 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 7 2026

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

I can test this and come back with the results!

this took roughly 2 hours (started at 2026-05-06T15:55:02 ended at 2026-05-06T17:48:14).

May 7 2026, 7:11 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 6 2026

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

I recommend we sqoop this table from the analytics-replicas.

I can test this and come back with the results!

May 6 2026, 3:52 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

Given this data, one idea is to split the tables into two groups by size and measure how much time we can save by running the smaller subset separately. I’m currently testing how long it takes to sqoop the following (smaller) set of tables: slot_roles, change_tag_def, user_groups, user, archive.

This took roughly 5 hours:
start at 2026-05-06T09:40:45
finished at 2026-05-06T14:35:52

May 6 2026, 3:05 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

The run of /usr/local/bin/refinery-sqoop-mediawiki-production-history took about 2 hours, from 2026-05-05T09:04:38 to 2026-05-05T11:25:48.
The /usr/local/bin/refinery-sqoop-centralauth-production job completed in roughly 20 minutes, from 2026-05-05T09:10:13 to 2026-05-05T09:33:02.

May 6 2026, 10:04 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 5 2026

APizzata-WMF updated subscribers of T424355: Accelerate sqoop landing for MediaWiki History private tables.

After a discussion with @xcollazo and @JAllemandou we decided to create 3 parallel processes:

/usr/local/bin/refinery-sqoop-mediawiki-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-not-history
May 5 2026, 9:08 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 4 2026

APizzata-WMF added a comment to T425347: brand health tracker data ingestion.

Created a local script to ingest google sheet. It is not yet ready to be deployed but is a start to ingest and test.

May 4 2026, 1:05 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF created T425347: brand health tracker data ingestion.
May 4 2026, 12:21 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

May 1 2026

APizzata-WMF added a comment to T424355: Accelerate sqoop landing for MediaWiki History private tables.

All the sqoop commands are executed by the service refinery-sqoop-whole-mediawiki.service that runs the bin /usr/local/bin/refinery-sqoop-whole-mediawiki.
The bin launches the following commands sequentially (notice the &&):

/usr/local/bin/refinery-sqoop-mediawiki-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-production-history \
  && /usr/local/bin/refinery-sqoop-centralauth-production \
  && /usr/local/bin/refinery-sqoop-mediawiki-not-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-production-not-history

All the public tables necessary for the DAG are ingested via refinery-sqoop-mediawiki-history (the first step of the pipeline), actor and comment through refinery-sqoop-mediawiki-production-history (the second step), and centralauth_localuser through refinery-sqoop-centralauth-production (third step).

May 1 2026, 10:12 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T424355: Accelerate sqoop landing for MediaWiki History private tables from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
May 1 2026, 8:35 AM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Apr 30 2026

APizzata-WMF moved T423238: ALIS data pipeline produced too many suggestions from In Review to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 30 2026, 11:28 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions
APizzata-WMF moved T421677: Add overwrite to check_bad_parsing from Ready to Deploy to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 30 2026, 11:27 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T423238: ALIS data pipeline produced too many suggestions.

Here is the same check on with today's run:

spark.sql(f"""select count(*),snapshot 
from analytics_platform_eng.image_suggestions_suggestions
group by snapshot
order by snapshot asc""").show(truncate=False)
Apr 30 2026, 7:18 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions

Apr 28 2026

APizzata-WMF moved T423238: ALIS data pipeline produced too many suggestions from In progress to In Review on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 28 2026, 1:04 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions
APizzata-WMF added a comment to T423238: ALIS data pipeline produced too many suggestions.

tested the results with the changes in the airflow devenv and they look good:

spark.sql("""select count(*),snapshot 
from apizzata.image_suggestions_suggestions
group by snapshot
order by snapshot asc""").show(truncate=False)
Apr 28 2026, 10:53 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions
APizzata-WMF moved T423238: ALIS data pipeline produced too many suggestions from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 28 2026, 9:49 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions
APizzata-WMF claimed T423238: ALIS data pipeline produced too many suggestions.
Apr 28 2026, 9:49 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions

Apr 24 2026

APizzata-WMF added a comment to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.

Do we need to do anything else here @APizzata-WMF?

Not really, the mechanism for reconciliation is in place and has been running both daily and monthly. I'd say is validation time.
@MGerlach, please notify us if you find something weird in the data so that we can take action!

Apr 24 2026, 2:36 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content

Apr 23 2026

APizzata-WMF added a comment to T423238: ALIS data pipeline produced too many suggestions.

The current way the pipeline queries the imagelink table is the following:

Apr 23 2026, 2:12 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Image-Suggestions

Apr 13 2026

APizzata-WMF moved T417596: Data missing from en.wiktionary.org February 2026 "MediaWiki Content File Exports" compared to "XML Database dump" from Ready to Deploy to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 13 2026, 8:43 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF added a comment to T417596: Data missing from en.wiktionary.org February 2026 "MediaWiki Content File Exports" compared to "XML Database dump".

The query:

spark.sql(f"""
    select
        page_redirect_target 
    from wmf_content.mediawiki_content_history_v1 
    where 
        page_id= 71571 
        and wiki_id = 'enwiktionary' 
        and revision_id=87466915"""
).show(truncate=False)
Apr 13 2026, 8:42 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF moved T421677: Add overwrite to check_bad_parsing from In Review to Ready to Deploy on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Apr 13 2026, 7:29 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T420974: when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None.
  1. Emit 0 instead of null on page_change events for rev_parent_id when there is no parent.
  2. Run a query on our datalake tables to correct data so that any null values on revison_parent_id are changed to 0.
Apr 13 2026, 7:28 AM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation

Apr 2 2026

APizzata-WMF moved T421677: Add overwrite to check_bad_parsing from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Apr 2 2026, 1:18 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF moved T411116: CentralAuth's localuser table contains many nulls and duplicate mappings from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Apr 2 2026, 1:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

The table has been sqooped and I have ran all the queries to check the status:

Apr 2 2026, 1:17 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth

Apr 1 2026

APizzata-WMF updated subscribers of T420974: when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None.

Here are my findings:

-- I want to see the status of the source tables `event.mediawiki_page_content_change_v1` and `event.mediawiki_content_history_reconcile_enriched_v1` therefore I executed the following query
Apr 1 2026, 2:13 PM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF moved T420974: when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Apr 1 2026, 11:31 AM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF claimed T420974: when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None.
Apr 1 2026, 11:31 AM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF added a comment to T420787: Visualizing inconsistencies and reconciles via Superset.

@xcollazo could you remind me the semantics of missing_from_source?

I think I can help you with that!
The missing_from_source shows pages that we have in the wmf_content.mediawiki_content_history_v1 that are not present anymore in the MariaDB tables.

Apr 1 2026, 10:17 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

perfect! Will keep the ticket open for tomorrow run, if everything looks good I will close it.

Apr 1 2026, 9:09 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

The task finished running during the night, @dcausse do the numbers look good now?

Apr 1 2026, 8:02 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Mar 31 2026

APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

The task is running.

Mar 31 2026, 1:28 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".
spark.sql("""
(
select * 
 from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2026-03-16'
 and tag = 'recommendation.image'
 limit 5)
 union all 
 (
select * 
 from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2026-03-16'
 and tag = 'recommendation.image_section'
 limit 5)
  """).show(truncate=False)
Mar 31 2026, 10:51 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Mar 30 2026

APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Tested in the dev env and the results are good:

spark.sql("""
select tag,count(*) from  analytics_platform_eng.image_suggestions_search_index_full
where snapshot = '2026-03-16' and tag in ('recommendation.image','recommendation.image_section')
group by tag""").show()
Mar 30 2026, 4:00 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Unfortunately, due to permission on the delta and full tables, I was not able to update the faulty records. Therefore here is the MR that should fix it, I will run it locally and test the results (will post here) and merge ASAP.

Mar 30 2026, 1:35 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

I have created and validated all my update commands, will shortly run them all and past here all the results with the validations. As a next step we can just rerun the image_suggestions_weekly dag correct?

Mar 30 2026, 11:51 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Hey @dcausse I must have misunderstood and thought the exists| part could be removed. I can update the output of the tables to show the correct form and fix the code to show in the correct form from next run. Does this sound good to you?

Mar 30 2026, 9:51 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF moved T421677: Add overwrite to check_bad_parsing from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 30 2026, 8:47 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF created T421677: Add overwrite to check_bad_parsing.
Mar 30 2026, 8:47 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Also SLIS results look like intended:

Mar 30 2026, 8:30 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Mar 27 2026

APizzata-WMF moved T421128: update image suggestion readme from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 27 2026, 10:30 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
APizzata-WMF moved T421128: update image suggestion readme from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 27 2026, 9:42 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 26 2026

APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

ALIS run correctly and here the results:

spark.sql(
    """
    SELECT
        tag,
        COUNT(*) AS cnt
    FROM analytics_platform_eng.image_suggestions_search_index_delta
    WHERE snapshot = '2026-03-16'
    GROUP BY tag
    """
).show()
Mar 26 2026, 10:42 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Mar 25 2026

APizzata-WMF added a comment to T417596: Data missing from en.wiktionary.org February 2026 "MediaWiki Content File Exports" compared to "XML Database dump".

Pipeline has been merged, will wait for monthly reconcile to check the results and see if there are any other issues. @JeffDoozan please let us know if you find any other inconsistencies.

Mar 25 2026, 1:19 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

forgot to run the same command on snapshot= 2025-09:

sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-09/wiki_db=centralauth/part-m-00017.avro
Mar 25 2026, 11:17 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Merged the MR, waiting for tomorrow's run to validate.

Mar 25 2026, 10:05 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF moved T421128: update image suggestion readme from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 25 2026, 10:04 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 24 2026

APizzata-WMF created T421128: update image suggestion readme.
Mar 24 2026, 3:14 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
APizzata-WMF added a comment to T419204: Fix Image suggestion DagProperty values.

Merged: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2057

Mar 24 2026, 2:05 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Growth-Team, Image-Suggestions
APizzata-WMF moved T419204: Fix Image suggestion DagProperty values from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 24 2026, 2:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Growth-Team, Image-Suggestions
APizzata-WMF moved T419204: Fix Image suggestion DagProperty values from In Review to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 24 2026, 9:20 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T420412: Implement list of JA3N-JA4H pairs to be tagged as automated into the bot detection pipeline.

+1 on

to remember the state of the filtering in the past. If you still don't want it, please do as you see fit :)

and

  • Initially, we set valid_until to NULL.
  • Then, the day we decide to turn off some of those JA3N/JA4H pairs, we manually set their valid_until to that date.
Mar 24 2026, 8:06 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 23 2026

APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

Just ran the following with the watchful eye of @JAllemandou :

sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-10/wiki_db=centralauth/part-m-00017.avro
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-11/wiki_db=centralauth/part-m-00017.avro
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2025-12/wiki_db=centralauth/part-m-00017.avro
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2026-01/wiki_db=centralauth/part-m-00017.avro
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /wmf/data/raw/mediawiki_private/tables/centralauth_localuser/snapshot=2026-02/wiki_db=centralauth/part-m-00017.avro
Mar 23 2026, 2:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF moved T368987: Add an Image: filtering by suggestion "kind" or "confidence" from In progress to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 23 2026, 10:34 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
APizzata-WMF added a comment to T420412: Implement list of JA3N-JA4H pairs to be tagged as automated into the bot detection pipeline.

but I don't think we should automatically disable the filtering when we reach the valid_until date.

So we keep it null for now and when audited that it should be stopped to be used we add the value?

Mar 23 2026, 8:24 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 20 2026

APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

The file I am talking about is only made up of duplicates, therefore deleting it would just remove duplication

Mar 20 2026, 2:17 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF moved T411116: CentralAuth's localuser table contains many nulls and duplicate mappings from In Review to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 20 2026, 8:24 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF moved T417596: Data missing from en.wiktionary.org February 2026 "MediaWiki Content File Exports" compared to "XML Database dump" from In Review to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 20 2026, 8:24 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
APizzata-WMF moved T420244: Section Image Suggestions no longer available? from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 20 2026, 8:24 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Image-Suggestions, Growth-Team
APizzata-WMF added a comment to T420244: Section Image Suggestions no longer available?.

Everything went smooth, check_bad_parsing finished successfully in 19 hrs and so did SLIS, Cassandra, and image_suggestions_weekly.

Mar 20 2026, 8:24 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Image-Suggestions, Growth-Team

Mar 19 2026

APizzata-WMF added a comment to T420412: Implement list of JA3N-JA4H pairs to be tagged as automated into the bot detection pipeline.

If we go for more "future-proof", I'd use Iceberg, with timestamp fields for insertion, update, and valid-until. And in the filtering, we would only consider rows for which the valid-until date is in the future.

+1 on this

Mar 19 2026, 5:07 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

Hm, this would mean partially incomplete data. I'd rather have duplicate in my data than incomplete one.

The file I am talking about is only made up of duplicates, therefore deleting it would just remove duplication

Mar 19 2026, 2:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF added a comment to T420412: Implement list of JA3N-JA4H pairs to be tagged as automated into the bot detection pipeline.

So, I guess we can skip table maintenance altogether?

agreed :)

Mar 19 2026, 11:56 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I think it's easier to make it happen this way

I agree with this!

Mar 19 2026, 10:25 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF added a comment to T416312: Use wmf.mediawiki_history as baseline for slo completeness.

The pipeline ran successfully.

Mar 19 2026, 9:47 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
APizzata-WMF added a comment to T420412: Implement list of JA3N-JA4H pairs to be tagged as automated into the bot detection pipeline.

But we can always look at Iceberg snapshots.

You mean we can look at the insert times and some sort of git log? In case we want to use the snapshots we need to disable the maintenance job that deletes the old snapshots.

Mar 19 2026, 8:41 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
APizzata-WMF added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I can test this easily and come back with the results

Mar 19 2026, 8:28 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
APizzata-WMF moved T420244: Section Image Suggestions no longer available? from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 19 2026, 8:26 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Image-Suggestions, Growth-Team
APizzata-WMF added a comment to T420244: Section Image Suggestions no longer available?.

check_bad_parsing has been re-ran successfully, with this MR we should avoid these problems in the future.
Cassandra has also been re-run to upload the data.

Mar 19 2026, 8:22 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Image-Suggestions, Growth-Team