Page MenuHomePhabricator

Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes
Open, Needs TriagePublic8 Estimated Story Points

Description

clickstream_monthly_dag.py - The underlying job, ClickstreamBuilder.scala will break as it reads pagelinks. You'd think fix would be easy enough as per Amir's email thread:

pl_namespace and pl_title columns of pagelinks table will be dropped and you will need to use pl_target_id joining with the linktarget table instead. This is basically identical to the templatelinks normalization that happened a year ago.

However, the caveat is that, as of now, the plan is to drop columns on some sections while other sections would still not have been completely migrated. This means we will have to have code that understands which wikis is being fetched, and then come back again and migrate the code again... The rationale for doing the changes like this is that some section migrations take a long long time. We need to monitor the outcome of this email thread.

Further:

Update from @Ladsgroup :

On Fri, Jan 19, 2024 at 4:08 PM Xabriel Collazo Mojica <xcollazo@wikimedia.org> wrote:
Amir,

To summarize: the only wiki that will soon get the old columns dropped is commonswiki and the rest of the wikis will keep the old columns until the migration to the new columns is complete on all wikis, at which time there will be a communication.

Is this correct?

Yes, until further communication, only s4 (commonswiki and testcommonswiki) and testwiki (s3) will have their old columns removed.

Details

TitleReferenceAuthorSource BranchDest Branch
Update analytics clickstream jobrepos/data-engineering/airflow-dags!667joalupdate_analytics_clickstreammain
Customize query in GitLab

Event Timeline

lbowmaker set the point value for this task to 8.Feb 16 2024, 7:46 PM

@lbowmaker clickstream_monthly_dag.py sensors typically take till the 3rd of the month to succeed, so we have about 4 days till this breaks.

@xcollazo - we moved this out of scope for our current sprint so we could focus on the sqoop job: https://phabricator.wikimedia.org/T345771

It was thought that we had more time for this change as the wiki’s in scope for ClickStream don’t have the schema changes rolled out yet.

@JAllemandou - let me know if I got this wrong.

Change #1023828 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] [WIP] Update ClickstreamBuilder

https://gerrit.wikimedia.org/r/1023828

Change #1023828 merged by jenkins-bot:

[analytics/refinery/source@master] Update ClickstreamBuilder

https://gerrit.wikimedia.org/r/1023828

( Could not get to merging the airflow-dags MR today, so paused the clickstream_monthly DAG till tomorrow when I have more time. )

Deployed to prod.

As per recent runs, we'll know if the changes are good on the 3rd of the month, which is when we typically run the clickstream_builder operator in the clickstream_monthly DAG.