Page MenuHomePhabricator

Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages
Closed, ResolvedPublic

Description

Special pages that took > 1 hour run script was running on codfw (Note: script is currently at frwiki, so this only includes things alphabetically before frwiki):

bawolff@mwmaint2001:~$ grep 'in.*[0-9]h' /var/log/mediawiki/updateSpecialPages.log | head -50
commonswiki:  ListDuplicatedFiles            [QueryPage] got 3554 rows in 2h 44m 37.65s
commonswiki:  MediaStatistics                [QueryPage] got 20 rows in 1h 22m 7.48s
commonswiki:  Mostcategories                 [QueryPage] got 5000 rows in 7h 34m 9.54s
commonswiki:  Mostlinkedtemplates            [QueryPage] got 5000 rows in 4h 48m 47.34s
commonswiki:  Uncategorizedcategories        [QueryPage] got 154 rows in 3h 51m 43.43s
commonswiki:  Uncategorizedpages             [QueryPage] got 142 rows in 6h 37m 22.52s
commonswiki:  Unusedimages                   [QueryPage] got 5 rows in 2h 15m 47.26s
commonswiki:  Wantedfiles                    [QueryPage] got 5000 rows in 2h 14m 2.39s
commonswiki:  Wantedtemplates                [QueryPage] got 5000 rows in 3h 5m 47.84s
commonswiki:  GadgetUsage                    [QueryPage] got 126 rows in 1h 57m 42.65s
commonswiki:  MostGloballyLinkedFiles        [QueryPage] got 5000 rows in 2h 43m 38.13s
commonswiki:  GloballyWantedFiles            [QueryPage] got 5000 rows in 4h 3m 38.47s
commonswiki:  GloballyUnusedFiles            [QueryPage] got 10 rows in 9h 45.37s
enwiki:  Uncategorizedpages             [QueryPage] got 129 rows in 2h 47.81s

For comparison, from sept 10 log file on mwmaint1001 there are no special pages that took more than an hour. The one's that took more than 10 minutes are:

bawolff@mwmaint1001:~$ grep 'in.*[0-9][0-9]m' /var/log/mediawiki/updateSpecialPages.log | head -50
cebwiki:  Mostcategories                 [QueryPage] got 5000 rows in 12m 55.93s
cebwiki:  Mostlinkedtemplates            [QueryPage] got 4688 rows in 14m 17.10s
cebwiki:  Wantedtemplates                [QueryPage] got 697 rows in 12m 44.98s
commonswiki:  ListDuplicatedFiles            [QueryPage] got 5000 rows in 30m 27.59s
commonswiki:  Mostlinkedtemplates            [QueryPage] got 5000 rows in 37m 12.61s
commonswiki:  Wantedfiles                    [QueryPage] got 5000 rows in 11m 0.70s
commonswiki:  GadgetUsage                    [QueryPage] got 125 rows in 26m 12.58s
commonswiki:  MostGloballyLinkedFiles        [QueryPage] got 5000 rows in 23m 18.99s
commonswiki:  GloballyWantedFiles            [QueryPage] got 5000 rows in 15m 15.49s
dewiki:  GadgetUsage                    [QueryPage] got 56 rows in 14m 0.42s
eswiki:  GadgetUsage                    [QueryPage] got 96 rows in 15m 26.02s
frwiki:  GadgetUsage                    [QueryPage] got 120 rows in 11m 39.93s
ruwiki:  GadgetUsage                    [QueryPage] got 103 rows in 10m 21.76s

As a result, the script seems to be taking > 72 hours, where users expect it to take less than 72 hours (As we run it as a cron job every 3 days).

The obvious difference would be that during switchover the script is hitting db2058 instead of db1121.

WP:VPT thread at https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=863329531#Special:WantedCategories_updates

Event Timeline

That could be the HW difference we have between codfw and eqiad (which we have plans to address) .
Not only db1121 is a lot more powerful than db2058 (db1121 has 512GB of RAM + SSDs disks whereas db2058 has 160GB RAM and HDDs), but db1121 is a dedicated slave for vslow, whereas db2058 has also some main traffic arriving to it, which means it has a bit more load.
We are switching back to eqiad today (10th Oct), so we can check if the script goes back to "normal" runtime values.

So we are back in equiad. Can you try the script and check if it is back to the previous run-time?

Im away this next 1.5 weeks for a conference. I will follow up when i get back

Im away this next 1.5 weeks for a conference. I will follow up when i get back

Great - let us know what you see.
Thanks

Please give us the times on eqiad again so we can verify it was the switch indeed.

A 60x increase in performance shouldn't be explained by hardware differences, but it could be due to cold caches or bad indexes/query plan.

@Bawolff did you have a moment to look into the values?
Thanks!

looks like this is back to normal values:

root@mwmaint1002:~# grep 'in.*[0-9][0-9]m' /var/log/mediawiki/updateSpecialPages.log | head -50
cebwiki:  Mostlinkedtemplates            [QueryPage] got 4703 rows in 17m 57.60s
cebwiki:  Wantedtemplates                [QueryPage] got 706 rows in 12m 6.75s
commonswiki:  ListDuplicatedFiles            [QueryPage] got 1619 rows in 34m 11.39s
commonswiki:  Mostlinkedtemplates            [QueryPage] got 5000 rows in 35m 5.66s
commonswiki:  Wantedfiles                    [QueryPage] got 5000 rows in 11m 43.77s
commonswiki:  GadgetUsage                    [QueryPage] got 126 rows in 25m 29.31s
commonswiki:  MostGloballyLinkedFiles        [QueryPage] got 5000 rows in 23m 48.76s
commonswiki:  GloballyWantedFiles            [QueryPage] got 5000 rows in 17m 8.58s
dewiki:  GadgetUsage                    [QueryPage] got 59 rows in 16m 17.52s
eswiki:  GadgetUsage                    [QueryPage] got 96 rows in 15m 39.45s
frwiki:  GadgetUsage                    [QueryPage] got 120 rows in 13m 24.53s
ruwiki:  GadgetUsage                    [QueryPage] got 103 rows in 12m 11.50s

I think the most probably cause is HW+cold caches and all that jazz as pointed out at T206592#4653743 and T206592#4703593