As notified in a wikitech mailing list thread, T299947: Normalize pagelinks table will drop the pl_title column, which is consumed by the image suggestions data pipeline.
This is the query to change.
As notified in a wikitech mailing list thread, T299947: Normalize pagelinks table will drop the pl_title column, which is consumed by the image suggestions data pipeline.
This is the query to change.
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Update pagelinks query | repos/structured-data/image-suggestions!40 | mfossati | T350007 | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T340437 [EPIC] Data pipelines maintenance | |||
In Progress | • mfossati | T350007 [M] Adapt image suggestions to comply with breaking database schema changes |
@Cparle , this wikitech mailing list thread came a few minutes ago and suggests a call to action.
However, we rely on the wmf_raw.mediawiki_pagelinks Data Lake Hive table, so I think we should wait until its schema gets updated to unblock this ticket. As of now it hasn't yet:
0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks; +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | col_name | data_type | comment | +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | pl_from | bigint | Key to the page_id of the page containing the link | | pl_namespace | int | Key to page_namespace of the target page. The target page may or may not exist, and due to renames and deletions may refer to differe | | pl_title | string | Key to page_title of the target page. The target page may or may not exist, and due to renames and deletions may refer to different p | | pl_from_namespace | int | MediaWiki version: ? 1.24 - page_namespace of the page containing the link | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
Looks like this will directly follow the corresponding MediaWiki table update, see T355139#9467153.
As a result, T299947 still seems the ticket to watch.
T299947#9609372 made me suspect something’s going on with the Hive table we depend on. Then I spotted T345771#9526320 and T345771#9584251, which mean our pipeline is silently consuming null values from affected wikis.
Compare 2024-02 snapshot:
pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-02"') pl.where("pl_title is null").select('wiki_db').distinct().collect() [Row(wiki_db='commonswiki'), Row(wiki_db='testwiki')]
with 2024-01:
pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-01"') pl.where("pl_title is null").select('wiki_db').distinct().collect() []
Luckily enough:
Conclusion:
li = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where(f'snapshot="2024-02-26"').count() 8043945 li = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where(f'snapshot="2024-02-19"').count() 8011180
According to T345771#9526320:
- The old columns have been dropped in testwiki and will be dropped soon (this and next week) on commonswiki and testcommonswiki.
- The rest of wikis will keep the old schema until all wikis have been migrated (or at least almost all of them if we realize wikidata is taking way too long).
TLDR: Use the new schema on testwiki, testcommonswiki and commonswiki. For the rest, use the old one and follow the announcements in wikitech-l.
I haven't see any announcement as of today.
wmf_raw.mediawiki_pagelinks's 2024-03 snapshot is now available and still has null pl_title values for commonswiki and testwiki:
pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-03"') pl.where("pl_title is null").select('wiki_db').distinct().count() [Row(wiki_db='commonswiki'), Row(wiki_db='testwiki')]
As a result, no action from our side is needed until:
Migration will complete in roughly one week and old columns will be dropped in two weeks: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/Y4C7W4TEC7DXXTY3HKDBG7HB56QBRXPY/
If this all lands in April, wmf_raw.mediawiki_pagelinks/snapshot=2024-04 will contain the breaking changes.
Change deployed:
0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks; +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | col_name | data_type | comment | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | pl_from | bigint | Key to the page_id of the page containing the link | | pl_from_namespace | int | MediaWiki version: ? 1.24 - page_namespace of the page containing the link | | pl_target_id | bigint | Foreign key to linktarget. | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+
Note that the 2024-03's snapshot was also affected, so moving to doing straight away.
mfossati opened https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/40
Update pagelinks query
mfossati merged https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/40
Update pagelinks query