Page MenuHomePhabricator

[M] Adapt image suggestions to comply with breaking database schema changes
Open, In Progress, Needs TriagePublic

Description

As notified in a wikitech mailing list thread, T299947: Normalize pagelinks table will drop the pl_title column, which is consumed by the image suggestions data pipeline.

This is the query to change.

Details

TitleReferenceAuthorSource BranchDest Branch
Update pagelinks queryrepos/structured-data/image-suggestions!40mfossatiT350007main
Customize query in GitLab

Event Timeline

MarkTraceur renamed this task from Adapt image suggestions to comply with breaking database schema changes to [M] Adapt image suggestions to comply with breaking database schema changes.Nov 29 2023, 5:52 PM

@mfossati is this still blocked?

Seems so. T299947 must be resolved first.

@Cparle , this wikitech mailing list thread came a few minutes ago and suggests a call to action.
However, we rely on the wmf_raw.mediawiki_pagelinks Data Lake Hive table, so I think we should wait until its schema gets updated to unblock this ticket. As of now it hasn't yet:

0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks;
+--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|         col_name         |       data_type       |                                                                                           comment                                     |
+--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| pl_from                  | bigint                | Key to the page_id of the page containing the link                                                                                    |
| pl_namespace             | int                   | Key to page_namespace of the target page. The target page may or may not exist, and due to renames and deletions may refer to differe |
| pl_title                 | string                | Key to page_title of the target page. The target page may or may not exist, and due to renames and deletions may refer to different p |
| pl_from_namespace        | int                   | MediaWiki version:  ? 1.24 - page_namespace of the page containing the link                                                           |
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)                                                   |
| wiki_db                  | string                | The wiki_db project                                                                                                                   |
|                          | NULL                  | NULL                                                                                                                                  |
| # Partition Information  | NULL                  | NULL                                                                                                                                  |
| # col_name               | data_type             | comment                                                                                                                               |
|                          | NULL                  | NULL                                                                                                                                  |
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)                                                   |
| wiki_db                  | string                | The wiki_db project                                                                                                                   |
+--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------+

we rely on the wmf_raw.mediawiki_pagelinks Data Lake Hive table, so I think we should wait until its schema gets updated to unblock this ticket.

Looks like this will directly follow the corresponding MediaWiki table update, see T355139#9467153.
As a result, T299947 still seems the ticket to watch.

T299947#9609372 made me suspect something’s going on with the Hive table we depend on. Then I spotted T345771#9526320 and T345771#9584251, which mean our pipeline is silently consuming null values from affected wikis.
Compare 2024-02 snapshot:

pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-02"')
pl.where("pl_title is null").select('wiki_db').distinct().collect()
[Row(wiki_db='commonswiki'), Row(wiki_db='testwiki')]

with 2024-01:

pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-01"')
pl.where("pl_title is null").select('wiki_db').distinct().collect()
[]

Luckily enough:

Conclusion:

  • 2024-02-26's lead image dataset seems consistent with the previous one:
li = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where(f'snapshot="2024-02-26"').count()
8043945
li = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where(f'snapshot="2024-02-19"').count()
8011180
  • although T299947 is still the official ticket to watch, it seems we're not safe from side effects of other untracked tickets

According to T345771#9526320:

  • The old columns have been dropped in testwiki and will be dropped soon (this and next week) on commonswiki and testcommonswiki.
    • The rest of wikis will keep the old schema until all wikis have been migrated (or at least almost all of them if we realize wikidata is taking way too long).

TLDR: Use the new schema on testwiki, testcommonswiki and commonswiki. For the rest, use the old one and follow the announcements in wikitech-l.

I haven't see any announcement as of today.
wmf_raw.mediawiki_pagelinks's 2024-03 snapshot is now available and still has null pl_title values for commonswiki and testwiki:

pl = spark.read.table('wmf_raw.mediawiki_pagelinks').where('snapshot="2024-03"')
pl.where("pl_title is null").select('wiki_db').distinct().count()
[Row(wiki_db='commonswiki'), Row(wiki_db='testwiki')]

As a result, no action from our side is needed until:

  • completed migration is announced in wikitech-l
  • wmf_raw.mediawiki_pagelinks subsequent snapshot is generated
NOTE: For extra safety, I suggest to perform the check above at the beginning of every month, i.e., one week before the image suggestions DAG looks for a new wmf_raw.mediawiki_pagelinks monthly snapshot.

Migration will complete in roughly one week and old columns will be dropped in two weeks: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/Y4C7W4TEC7DXXTY3HKDBG7HB56QBRXPY/
If this all lands in April, wmf_raw.mediawiki_pagelinks/snapshot=2024-04 will contain the breaking changes.

NOTE: check at the beginning of May.
mfossati changed the task status from Open to In Progress.Thu, May 2, 9:11 AM
mfossati claimed this task.

Change deployed:

0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks;
+--------------------------+-----------------------+--------------------------------------------------------------------------------------+
|         col_name         |       data_type       |                                       comment                                        |
+--------------------------+-----------------------+--------------------------------------------------------------------------------------+
| pl_from                  | bigint                | Key to the page_id of the page containing the link                                   |
| pl_from_namespace        | int                   | MediaWiki version:  ? 1.24 - page_namespace of the page containing the link          |
| pl_target_id             | bigint                | Foreign key to linktarget.                                                           |
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)  |
| wiki_db                  | string                | The wiki_db project                                                                  |
|                          | NULL                  | NULL                                                                                 |
| # Partition Information  | NULL                  | NULL                                                                                 |
| # col_name               | data_type             | comment                                                                              |
|                          | NULL                  | NULL                                                                                 |
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)  |
| wiki_db                  | string                | The wiki_db project                                                                  |
+--------------------------+-----------------------+--------------------------------------------------------------------------------------+

Note that the 2024-03's snapshot was also affected, so moving to doing straight away.

Fix deployed & pipeline resumed. Needs some monitoring.