This will need to change: https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L695
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T300222 Implement normalizing MediaWiki link tables | |||
Resolved | simon04 | T365497 bldrwnsch update is broken – Unknown column 'pagelinks.pl_title' | |||
Resolved | Ladsgroup | T299947 Normalize pagelinks table | |||
Resolved | JAllemandou | T345771 Adapt Sqoop to pagelinks schema change |
Event Timeline
This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.
Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).
Hi Joal. Xabriel asked about this before and I explained. I thought it was handled. I give you a recap:
- The data is still being migrated in 17-ish wikis. You can see the list in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/997420
- The data is migrated everywhere else.
- The old columns have been dropped in testwiki and will be dropped soon (this and next week) on commonswiki and testcommonswiki.
- The rest of wikis will keep the old schema until all wikis have been migrated (or at least almost all of them if we realize wikidata is taking way too long).
TLDR: Use the new schema on testwiki, testcommonswiki and commonswiki. For the rest, use the old one and follow the announcements in wikitech-l.
We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).
Change 1007413 had a related patch set uploaded (by Joal; author: Joal):
[analytics/refinery@master] [WIP] Fix sqoop for pagelinks normalization migration
Change 1007413 merged by Joal:
[analytics/refinery@master] Fix sqoop for pagelinks normalization migration
Change #1023399 had a related patch set uploaded (by Joal; author: Joal):
[analytics/refinery@master] Update sqoop and schema of pagelink table
Change #1023399 merged by Xcollazo:
[analytics/refinery@master] Update sqoop and schema of pagelink table
Ran the following as the analytics user in an-launcher1002.eqiad.wmnet:
USE wmf_raw; DROP TABLE wmf_raw.mediawiki_pagelinks; CREATE EXTERNAL TABLE `mediawiki_pagelinks`( `pl_from` bigint COMMENT 'Key to the page_id of the page containing the link', `pl_from_namespace` int COMMENT 'MediaWiki version: ≥ 1.24 - page_namespace of the page containing the link', `pl_target_id` bigint COMMENT 'Foreign key to linktarget.' ) COMMENT 'See most up to date documentation at https://www.mediawiki.org/wiki/Manual:Pagelinks_table' PARTITIONED BY ( `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)', `wiki_db` string COMMENT 'The wiki_db project') ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/tables/pagelinks' ; MSCK REPAIR TABLE wmf_raw.mediawiki_pagelinks;
All commands were successful.
Now let's do a sanity check for partition availability:
spark-sql (default)> show partitions mediawiki_pagelinks partition(wiki_db='enwiki'); partition snapshot=2017-07_private/wiki_db=enwiki snapshot=2023-10/wiki_db=enwiki snapshot=2023-11/wiki_db=enwiki snapshot=2023-12/wiki_db=enwiki snapshot=2024-01/wiki_db=enwiki snapshot=2024-02/wiki_db=enwiki snapshot=2024-03/wiki_db=enwiki Time taken: 0.205 seconds, Fetched 7 row(s)
And for data:
spark-sql (default)> select * from mediawiki_pagelinks where snapshot='2024-03' and wiki_db='eswiki' limit 5; pl_from pl_from_namespace pl_target_id snapshot wiki_db 5 4 6145 2024-03 eswiki 5 4 6150 2024-03 eswiki 5 4 6152 2024-03 eswiki 5 4 6161 2024-03 eswiki 5 4 6162 2024-03 eswiki Time taken: 0.615 seconds, Fetched 5 row(s)
We good.