Page MenuHomePhabricator

Ingest user similarity data for March 2021
Closed, ResolvedPublic1 Estimated Story Points

Description

Similarusers database should be refreshed with March 2021 data.
This is a maintenance ticket to coordinate all parties involved, and set an ETA.

This action requires:

New run of the algorithm that generates user similarity data.
MySQL ingestion.
During ingestion the service will enter a maintenance window of approx 4 to 6 hours. During maintenance,
recommendations won't be served.

References;

Event Timeline

gmodena set the point value for this task to 1.

Hey @Marostegui; we'd need to run the Similarusers pipeline for March data. I'm doing a dry run today, and if possible would like to schedule the actual ingestion after the weekend (CEST work hours). Would you have a preference for a date/time?

Anytime between 7-14 CEST on Tuesday for instance should work for me.
Would that work for you too?

Will this run have the same batches/throttling as the previous one where we didn't notice any significant impact?
If this next run goes well, I think we should be good for this to run without getting blocked on us, does that sound good?

Thanks!

Hey @Marostegui. Tuesday works! I can kick off the job between 8 and 9CEST, and monitor as it chugs along. In the eventuality it goes overtime (past 1400), would it be a problem?

The job will run with the same batch/throttling values as the previous one. Let's touch base once it completes. Happy to take this off your plate if the run goes well!

Hey @Marostegui. Tuesday works! I can kick off the job between 8 and 9CEST, and monitor as it chugs along. In the eventuality it goes overtime (past 1400), would it be a problem?

No, that should be fine!
Adding @Kormat and @LSobanski for visibility here.

Thanks

The ingestion part of the data pipeline kicked off at 2021-04-13 09:05:37,296.
It is set with

SIMILARUSERS_BATCH_SIZE=7000
SIMILARUSERS_THROTTLE_MS=1000

For now I'm seeing the same throughout as last time. I'll keep it monitored throughout the day.

The job has successfully completed at 2021-04-13 15:37:22,710.
Some stats for the ingested datasets:

Loading /home/gmodena/similar-users-private/data/2021-03/temporal.tsv: 18073183rows [53:45, 5603.32rows/s]
Loading /home/gmodena/similar-users-private/data/2021-03/metadata.tsv: 8079707rows [27:36, 4876.22rows/s]
Loading /home/gmodena/similar-users-private/data/2021-03/coedit_counts.tsv: 107514724rows [5:09:38, 5786.94rows/s]
Model=Temporal  Read=18073183   Skipped=0       Inserted=18073183
Model=UserMetadata      Read=8079707    Skipped=0       Inserted=8079707
Model=Coedit    Read=107514724  Skipped=0       Inserted=107514724

All good from my side too.
Let's make the next ingestion without giving us (DBAs) a heads up, to see if it is fully transparent like this one.