Page MenuHomePhabricator

Scan all images on Wikimedia Commons
Closed, DeclinedPublic

Description

There are ~62M images on Wikimedia Commons that should be scanned.

We have a limit of 10M API calls per month, so we'll need to space out these scans over 7 months.

This is done when:

  • Initial scan of 10000 images
  • Scan group 0 of 1M images
  • Scan group 1 of 5M images
  • Scan group 2 of 10M images
  • Scan group 3 of 10M images
  • Scan group 4 of 10M images
  • Scan group 5 of 10M images
  • Scan group 6 of 10M images
  • Scan group 7 of 2M+ images

It's possible that we'll have more than 70M images in Commons by the end of 7 months, so we might need to run another set.

The scans should run in chronological order from oldest to newest.

It may make sense to script this work (see T254389) to lower the overhead of running the scans manually (remembering to do it, nursing the script).

Event Timeline

eprodromou updated the task description. (Show Details)

I have started a scan of first 10000 images. to continue, use --start=20041210132540

https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?panelId=15&fullscreen&orgId=1 can be used to track how the job queue is progressing over the backlog

First million jobs submitted. To continue: --start=20070331000313

5 more million jobs started. To continue script from this point, run ModerateExistingFiles.php adding argument --start=20100511124657

I'm going to move this to our "Productionizing" epic, since this is kind of a long-term usage issue. It would still be great to have this automated, but I've added a schedule item to my calendar to check on this ticket once a month for the next ~6 months.

For next time: --start=20130424111339

Checking in on this item. From above, it looks like we have scanned 16m images so far. Thanks! Have we started the next scan "Scan group 3 of 10M images" ?

Due to forgotten configuration, half of the last batch got discarded, I've started the process of resubmitting them

--start=20150622112753 for next one.

Next batch: --start=20161119200938

Next batch: --start=20161119200938

This has been started.

@Pchelolo @drochford Are the checkboxes in the task description up-to-date?

@Pchelolo When the script finishes running, how would I get the start timestamp for the next run?

@Pchelolo When the script finishes running, how would I get the start timestamp for the next run?

yes. it should print it.

@Pchelolo When the script finishes running, how would I get the start timestamp for the next run?

yes. it should print it.

If I ssh into the server after a run has completed, and do screen -r, it tells me there's no screen to resume, presumably because the screen session finished when the run finished. Does it log the last timestamp somewhere?

You can go on a prod box where kafkacat is installed (mwmaint has it) and run

kafkacat -b kafka-main1001.eqiad.wmnet -t "eqiad.mediawiki.job.processMediaModeration" -c 1 -o -1

This gives you the last job submitted. It has the title of the last processed page and the timestamp you can use. In this case, it's 20180528185513

@Tchanders Could you kick off the next 10 million images for February please? Also, can we determine what date the last image scanned from January or the first image from this scan is? Thank you!

@Tchanders Could you kick off the next 10 million images for February please? Also, can we determine what date the last image scanned from January or the first image from this scan is? Thank you!

Done - starting from 20191022034544

@Tchanders Could you kick off the next 10 million images for March please? Could you let me know the date of the last image scanned from February or the first image from this scan is please? Thanks

@Tchanders Could you kick off the next 10 million images for March please? Could you let me know the date of the last image scanned from February or the first image from this scan is please? Thanks

Done - starting from 20201127020228

@Tchanders Could you kick off the next 10 million images for May on the 1st please? Please let me know the date of the last image scanned from March, or the first image from this scan. Thanks

You can go on a prod box where kafkacat is installed (mwmaint has it) and run

kafkacat -b kafka-main1001.eqiad.wmnet -t "eqiad.mediawiki.job.processMediaModeration" -c 1 -o -1

This gives you the last job submitted. It has the title of the last processed page and the timestamp you can use. [...]

@Pchelolo I've just tried this to check the final timestamp of the last run, and I'm getting no data - instead just:

% Reached end of topic eqiad.mediawiki.job.processMediaModeration [0] at offset 52511923

I'm wondering if the log has been cleared at some point since the job ended? The main difference between this and previous runs is a longer time between runs. Do you know if the final timestamp would have been logged anywhere else?

Yeah, kafka jobs are only retained for a month...

Hm... I guess if there were any images found during this run, you can start from the last found one... It also wrote the timestamp into stdout in the screen session, but that probably didn't survive either..

Logs from failed jobs persist for the long time, latest was on Mar 28, 2021 @ 04:35:17

Failed executing job: processMediaModeration File:Zentralblatt_der_Bauverwaltung_1885_Seite_347_Fig_3_Längsschnitt.png timestamp=20210306204315 namespace=6 title=Zentralblatt_der_Bauverwaltung_1885_Seite_347_Fig_3_Längsschnitt.png requestId=53bd022e23a988967a9bfb8c

So we got at least up to here. Which is this year, so you'd probably be safe starting from there and finish rather quickly.

Thanks @Pchelolo.

@drochford The next run has been kicked off, starting from 20210306204315

The next image to start from is 20210430135017.

We paused running the script today because of the volume of failures. We want to confirm it's still working as it should because there wasn't enough log data. I've opened https://phabricator.wikimedia.org/T287511 for next steps.

Also, the last job run was on timestamp 20210501132444 according to kafkacat.

I just started the job at 20210430135017.

I paused the script just after 20220104105243 because there are millions of jobs on grafana and I wanted to let it work through those jobs first. When I restart it, I will half the batch count.

@mepps Can we limit the batch to less than 10,000,000 per month?

10 million per month is the limit which we shouldn't ever need to exceed. We have substantially less than 10 million new images per month added to our platforms.

Hmm, I ran the script starting at 20220104105243 with the batch-count halved and now got a message saying "Script processed all files. Nothing left!".

Does Grafana show any files submitted to photodna at all during the run?

@drochford Yup, It looks like a few hundred thousands have already run.

Screen Shot 2022-03-03 at 11.53.20 AM.png (968×2 px, 180 KB)

Most recent timestamp: 20210430135351

Most recent final timestamp: 20220426131656.

--start=20220428135002

note each file run had ' DEBUG: Checking on upload is disabled.' which looking at the code seems to imply that it will go no further - does not check