Page MenuHomePhabricator

Adapt Sqoop to pagelinks schema change
Closed, ResolvedPublic3 Estimated Story Points

Event Timeline

@Milimetric please add context for this so I can understand what needs to be done.

This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.

Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).

Hi Joal. Xabriel asked about this before and I explained. I thought it was handled. I give you a recap:

  • The data is still being migrated in 17-ish wikis. You can see the list in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/997420
  • The data is migrated everywhere else.
  • The old columns have been dropped in testwiki and will be dropped soon (this and next week) on commonswiki and testcommonswiki.
    • The rest of wikis will keep the old schema until all wikis have been migrated (or at least almost all of them if we realize wikidata is taking way too long).

TLDR: Use the new schema on testwiki, testcommonswiki and commonswiki. For the rest, use the old one and follow the announcements in wikitech-l.

lbowmaker set the point value for this task to 8.Feb 16 2024, 7:46 PM

We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).

Change 1007413 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] [WIP] Fix sqoop for pagelinks normalization migration

https://gerrit.wikimedia.org/r/1007413

JAllemandou moved this task from Next Up to In Review on the Data-Engineering (Sprint 9) board.
JAllemandou changed the point value for this task from 8 to 3.

Change 1007413 merged by Joal:

[analytics/refinery@master] Fix sqoop for pagelinks normalization migration

https://gerrit.wikimedia.org/r/1007413

Change #1023399 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update sqoop and schema of pagelink table

https://gerrit.wikimedia.org/r/1023399

Change #1023399 merged by Xcollazo:

[analytics/refinery@master] Update sqoop and schema of pagelink table

https://gerrit.wikimedia.org/r/1023399

Ran the following as the analytics user in an-launcher1002.eqiad.wmnet:

USE wmf_raw;

DROP TABLE wmf_raw.mediawiki_pagelinks;

CREATE EXTERNAL TABLE `mediawiki_pagelinks`(
  `pl_from`             bigint      COMMENT 'Key to the page_id of the page containing the link',
  `pl_from_namespace`   int         COMMENT 'MediaWiki version:  ≥ 1.24 - page_namespace of the page containing the link',
  `pl_target_id`        bigint      COMMENT 'Foreign key to linktarget.'
)
COMMENT
  'See most up to date documentation at https://www.mediawiki.org/wiki/Manual:Pagelinks_table'
PARTITIONED BY (
  `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)',
  `wiki_db` string COMMENT 'The wiki_db project')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/tables/pagelinks'
;

MSCK REPAIR TABLE wmf_raw.mediawiki_pagelinks;

All commands were successful.

Now let's do a sanity check for partition availability:

spark-sql (default)> show partitions mediawiki_pagelinks partition(wiki_db='enwiki');
partition
snapshot=2017-07_private/wiki_db=enwiki
snapshot=2023-10/wiki_db=enwiki
snapshot=2023-11/wiki_db=enwiki
snapshot=2023-12/wiki_db=enwiki
snapshot=2024-01/wiki_db=enwiki
snapshot=2024-02/wiki_db=enwiki
snapshot=2024-03/wiki_db=enwiki
Time taken: 0.205 seconds, Fetched 7 row(s)

And for data:

spark-sql (default)> select * from mediawiki_pagelinks where snapshot='2024-03' and wiki_db='eswiki' limit 5;
pl_from	pl_from_namespace	pl_target_id	snapshot	wiki_db
5	4	6145	2024-03	eswiki
5	4	6150	2024-03	eswiki
5	4	6152	2024-03	eswiki
5	4	6161	2024-03	eswiki
5	4	6162	2024-03	eswiki
Time taken: 0.615 seconds, Fetched 5 row(s)

We good.