Page MenuHomePhabricator

After new wikis are created/imported from Incubator, statistics should be updated
Closed, ResolvedPublic

Description

For al least two recent Wikipedias (kcgwiki and blkwiki), and probably others, statistics in Special:Statistics and the API right after the creation/import from Incubator remain at zero pages, articles, users, files... except for active users and bots. Apparently some script or process should be run to initialize/update the data; it would be useful that https://wikistats.wmcloud.org/ could show accurate figures right away or from the next midnight update, if we think about the potential visitors looking for "new stuff". Thanks.

Event Timeline

Ah, yes.. so this should be:

[mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.service

which I have manually ran before or on request.. but sometimes we just wait for the next automatic timer run

The code is:

profile::mediawiki::periodic_job { 'initsitestats':
    command  => '/usr/local/bin/foreachwiki initSiteStats.php --update',
    interval => '*-1,15 05:39',
}

It would be more efficient to run it only on new wikis and not on foreachwiki ..

Mentioned in SAL (#wikimedia-operations) [2022-08-12T23:38:41Z] <mutante> [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer T315121

Mentioned in SAL (#wikimedia-operations) [2022-08-12T23:41:22Z] <mutante> wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121

The real fix should be that the create wiki script also runs the update command.

/usr/local/bin/mw-cli-wrapper /usr/local/bin/foreachwiki initSiteStats.php --update

or the version of that _just for the new wiki_. Running it for all also was very fast though.
`

It would be more efficient to run it only on new wikis and not on foreachwiki ..

AFAIK that script is confusingly named, as it is also used for updating the site stats (not only initing). See its docs for more details.

The real fix should be that the create wiki script also runs the update command.

/usr/local/bin/mw-cli-wrapper /usr/local/bin/foreachwiki initSiteStats.php --update

or the version of that _just for the new wiki_. Running it for all also was very fast though.
`

I'm not sure whether that would actually do the trick. The wiki creation script runs while the wiki actually is empty. In other words, once addWiki.php completes, all-zeros in Special:Statistics are the expected behavior. It breaks a bit later, when new wikis importers populate the wiki with come content (happens a few days after addWiki.php completes).

The importers use regular importing endpoints (action=import API and Special:Import) to populate the wiki with content. WikiImporter class appears to include some code to update site stats, but apparently it doesn't work. Considering addWiki.php runs INSERT INTO site_stats(ss_row_id) VALUES (1) (which results in most columns of the table being NULL), it might be that WikiImporter only updates site stats when the value fields in site_stats are not NULL? Might be worth testing with the next batch of wikis.

Is there a reason to not just run the initsitestats period job on a daily basis instead of a biweekly one?

@Zabe I think that's the right question. Probably we should simply run this more often and be done with it.It seemed fast to me when I manually ran it.

This issue happened again with the most recent Wikipedia, pcmwiki (I haven't checked other recent sister projects). @Dzahn has run the script manually again (thanks!).

Change 825424 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/puppet@production] Run the initsitestats period job on a daily basis

https://gerrit.wikimedia.org/r/825424

Change 825424 merged by RLazarus:

[operations/puppet@production] Run the initsitestats period job on a daily basis

https://gerrit.wikimedia.org/r/825424

Thanks! Just another suggestion: I understand in the code that the job is run at 05:39 UTC; could it be moved to around 23:45 UTC (or sooner, if it takes more than 15 minutes to complete) so that the Wikistats midnight update gets the data as fresh as possible, specially for new projects created in that day? (Well, another option could be to move the Wikistats update, if @Dzahn agrees...).

@-jem- Yes, I can move either of them or both. I'll take a look later.

Mentioned in SAL (#wikimedia-operations) [2022-08-24T16:26:11Z] <mutante> mwmaint1002 systemctl start mediawiki_job_initsitestats T315121

doing a manual run and checking how long it actually takes

It took ~ 2h25m to finish for all wikis.

Change 826347 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mediwiki/initsitestats: change time of day to run initsitestats

https://gerrit.wikimedia.org/r/826347

@-jem- I uploaded the change above but then realized that I already spread out the other timers that you want to sync with, by project. so it's like this currently on the other side:

        'wp' : ensure => $ensure, hour => 0;  # Wikipedias
..
        'wt' : ensure => $ensure, hour => 2;  # Wiktionaries
        'ws' : ensure => $ensure, hour => 3;  # Wikisources
        'wn' : ensure => $ensure, hour => 4;  # Wikinews
        'wb' : ensure => $ensure, hour => 5;  # Wikibooks
        'wq' : ensure => $ensure, hour => 6;  # Wikiquotes
..
        'wy' : ensure => $ensure, hour => 10; # Wikivoyage
..
        'wx' : ensure => $ensure, hour => 18; # Wikimedia Special
        'mh' : ensure => $ensure, hour => 18; #

Change 826394 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] wikistats: run updates of WMF-operated wikis earlier in the day

https://gerrit.wikimedia.org/r/826394

I'm assuming the previous time (05:39 UTC) was chosen to be during a relatively low-traffic/low-load time of day. Should that be considered in rescheduling this?

EDIT Wait a minute... 05:39 UTC is in the late/middle evening in the U.S., so maybe not!

Personally I don't know why 05:39 was chosen. Assumed it was just about spreading out all jobs randomly across the day. Digging for the original commit _might_ reveal something but would require quite some digging I expect.

@Dzahn, thanks, I assume that your changes are the best solution that doesn't involve deeper changes that would be of greater magnitude than the problem to be solved. Bot operators as me can just check the update times in Wikistats for each project family and adapt their running times, and Wikistats operators (currently that would be you) can try to add new projects near but before 21 h UTC (but it wouldn't be a big deal if not). Just one detail: you approached all Wikimedia families to 0 h UTC, except for Wikiversities...

the original commit _might_ reveal something but would require quite some digging I expect.

05:39 was specified here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/415066/4/modules/mediawiki/manifests/maintenance/initsitestats.pp

@Dcljr Thank you. And since Chad is back as well, I left the question on that gerrit change from years ago. It might just work:)

Yea, sounds like it was just a random number to avoid running it at the same time with other things: https://gerrit.wikimedia.org/r/c/operations/puppet/+/415066

Change 826394 merged by Dzahn:

[operations/puppet@production] wikistats: run updates of WMF-operated wikis earlier in the day

https://gerrit.wikimedia.org/r/826394

Change 826347 merged by Dzahn:

[operations/puppet@production] mediwiki/initsitestats: change time of day to run initsitestats

https://gerrit.wikimedia.org/r/826347

We are running this daily now and at a different time of day. I think that's as close as we'll get for now. Please reopen if it's still too slow this way.