Page MenuHomePhabricator

[MW History, Dumps] Review impact of recent changes to MW DBs that remove certain columns
Closed, ResolvedPublic5 Estimated Story Points

Description

Description

The Data Persistence team has done work to start dropping the old columns on s4 and s5 this and next week and right after will start dropping the old columns on s3.

The work is covered by this RfC and this task covering the specific changes.

Acceptance Criteria

  • Impacts for MW History(and other Data Products owned pipelines) has been established
  • Impacts on Dumps generation has been established
  • Short write up detailing the impacts

Event Timeline

WDoranWMF created this task.
xcollazo added a subscriber: JAllemandou.

Impacts for MW History

The sqoop job that feeds the MW History pipeline has it systemd definied here, and its list of required tables defined here, but copied here for convenience:

archive,change_tag,change_tag_def,logging,page,revision,user,user_groups

Since this list does not include pagelinks, MW History will not be affected by these DB changes.

Other Pipelines

clickstream_monthly_dag.py - The underlying job, ClickstreamBuilder.scala will break as it reads pagelinks. You'd think fix would be easy enough as per Amir's email thread:

pl_namespace and pl_title columns of pagelinks table will be dropped and you will need to use pl_target_id joining with the linktarget table instead. This is basically identical to the templatelinks normalization that happened a year ago.

However, the caveat is that, as of now, the plan is to drop columns on some sections while other sections would still not have been completely migrated. This means we will have to have code that understands which wikis is being fetched, and then come back again and migrate the code again... The rationale for doing the changes like this is that some section migrations take a long long time. We need to monitor the outcome of this email thread.

image_suggestions_dag.py - This may break. This is owned by the Structured Data team, and they are aware of the incoming changes, and they will fix via T350007.

Sqoop job itself

Given we are dropping two columns, I am not sure how the sqoop job will behave, perhaps @JAllemandou knows if sqoop will do the right thing or if we need to manually intervene?

Additionally If sqoop is ok with target having old columns, then perhaps we should drop the columns manually? Here is how the schema on target table looks as of today:

hive (wmf_raw)> describe mediawiki_pagelinks;
OK
col_name	data_type	comment
pl_from             	bigint              	Key to the page_id of the page containing the link
pl_namespace        	int                 	Key to page_namespace of the target page. The target page may or may not exist, and due to renames and deletions may refer to different page records as time goes by
pl_title            	string              	Key to page_title of the target page. The target page may or may not exist, and due to renames and deletions may refer to different page records as time goes by. Spaces are converted to underscores, and the first letter is automatically capitalized. So for
pl_from_namespace   	int                 	MediaWiki version:  ? 1.24 - page_namespace of the page containing the link
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
wiki_db             	string              	The wiki_db project 
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
wiki_db             	string              	The wiki_db project 
Time taken: 3.172 seconds, Fetched: 12 row(s)

Dumps Generation

The XMLDumpWriter.php does not use pagelinks directly. Thus we do not expect any work from this migration.

We do dump the pagelinks table verbatim via a script that controls mysqldump. This job however doesn't care about the schema of the table. So it will continue working.

Conclusion: Low risk for dumps 1.0.

Update from @Ladsgroup :

On Fri, Jan 19, 2024 at 4:08 PM Xabriel Collazo Mojica <xcollazo@wikimedia.org> wrote:
Amir,

To summarize: the only wiki that will soon get the old columns dropped is commonswiki and the rest of the wikis will keep the old columns until the migration to the new columns is complete on all wikis, at which time there will be a communication.

Is this correct?

Yes, until further communication, only s4 (commonswiki and testcommonswiki) and testwiki (s3) will have their old columns removed.