Page MenuHomePhabricator

Report progress of Wikibase entity dumps in logs
Closed, ResolvedPublic5 Estimated Story Points

Description

We recently found that it would be useful if the dump scripts had some sort of progress indicator, e.g. to make it easier to see in the logs of a running dump job if the dumps are being generated at a reasonable speed or how close it is to completion.

They currently only report how many entities were processed once a batch has been completed, which means logging messages like "Processed 30490 entities." multiple thousand times per job, which is not that useful.

Note: one considered way to calculate the number of batches completed vs all batch is to calculate the full number as a number of shards times number of batches per shard, and determining the number of completed batches based on the entries in the shared log file

A/C:

  • Airflow logs show percent of batches done after each batch

Event Timeline

WMDE-leszek set the point value for this task to 5.
WMDE-leszek moved this task from Polished to Ready for planning on the Wikibase Reuse Team board.

Change #1219837 had a related patch set uploaded (by Silvan Heintze; author: Silvan Heintze):

[operations/dumps@master] Report progress of Wikibase entity dumps

https://gerrit.wikimedia.org/r/1219837

Change #1219837 merged by Btullis:

[operations/dumps@master] Report progress of Wikibase entity dumps

https://gerrit.wikimedia.org/r/1219837

This is working really nicely! The logs can be found e.g. here (latest lexeme rdf dump). The output at the bottom of the logs now reads

...
[2026-01-16, 23:35:47 UTC] {pod_manager.py:412} INFO - [base] Starting batch 249
[2026-01-16, 23:35:47 UTC] {pod_manager.py:412} INFO - [base] Progress: 2014/2016 batches done (99%)
[2026-01-16, 23:36:22 UTC] {pod_manager.py:412} INFO - [base] Starting batch 250
[2026-01-16, 23:36:22 UTC] {pod_manager.py:412} INFO - [base] Progress: 2015/2016 batches done (99%)
[2026-01-16, 23:36:42 UTC] {pod_manager.py:412} INFO - [base] Starting batch 251
[2026-01-16, 23:40:46 UTC] {pod_manager.py:412} INFO - [base] Progress: 2016/2016 batches done (100%)
[2026-01-16, 23:45:46 UTC] {pod_manager.py:412} INFO - [base] Number of skipped entities: 0
...

If anything, we could probably remove the additional "Starting batch XYZ" log messages now. They weren't super useful to begin with since there was no indication which shard each batch belongs to, or how many there are in total.

lovely! I'd agree with removing the batch numbers

Change #1229127 had a related patch set uploaded (by Jakob; author: Jakob):

[operations/dumps@master] Stop logging batch start

https://gerrit.wikimedia.org/r/1229127

Change #1229127 merged by Brouberol:

[operations/dumps@master] Stop logging batch start

https://gerrit.wikimedia.org/r/1229127