Page MenuHomePhabricator

WMF ParserCache disk space exhaustion
Closed, ResolvedPublic

Description

On Saturday 10 June, ops noticed that disk space usage on the parser cache servers was rapidly increasing. An alert was triggered.

Removing binlogs and running the purge script earlier than normal (it is a weekly cron job) was the temporary solution.

The current theory on why this is happening is that it is due to https://gerrit.wikimedia.org/r/#/c/354504/ . Before that change, on page view, if the parser cache entry was found to be expired, a parse would be done and the result would be saved back to the parser cache with the same key as the old entry, so there was no significant change in space usage. But after that change, an expired entry with the old key would not be deleted or replaced, rather it would be left in place, and a parser cache entry with a new key would be written. So MW is trying to slowly duplicate the parser cache.

On July 25, a similar incident happened, this time due to Wikidata demand, apparently: T167784#3473685

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 359905 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool pc2004,5,6 after maintenance

https://gerrit.wikimedia.org/r/359905

Change 359905 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool pc2004,5,6 after maintenance

https://gerrit.wikimedia.org/r/359905

Parsercaches have been rebuilt from zero, and in the process of defragmenting, old key-format rows have been deleted (a patch may not be needed anymore).

We have to continue monitoring the disk usage, so I have not repooled them (pc1004,5,6) on eqiad the old server, keeping the temporary 4TB hosts to maybe tuning the purge speed and frequency- it happens weekly, and maybe we should do it daily.

tstarling claimed this task.

That's close enough to resolved for me.

jcrespo claimed this task.

I do not think this is resolved, but ongoing (based on disk space trends).

Change 361656 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Parsercache: Purge rows every day, and reduce TTL to 22 days

https://gerrit.wikimedia.org/r/361656

Change 361659 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] Parsercache: Reduce expiration time to 22 days

https://gerrit.wikimedia.org/r/361659

I do not think this is resolved, but ongoing (based on disk space trends).

The disk usage trend looks fine to me. You'd expect it to refill rapidly at first, and eventually plateau at the old value of ~78%. The first derivative is indeed decreasing, see my graph of it from pc1005:

pc1005-disk-usage-trend.png (776×996 px, 46 KB)

Current disk usage is around 67%.

Note the current active parsercache on eqiad is db1096 (none of the pc* hosts are pooled there) which started throwing disk space warnings on my Sunday: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&from=now-9d&to=now&var-server=db1096&var-network=bond0

The graph you showed me would be normal if those were row insertions- on InnoDB, however, after a row purge, the expectancy is that disk size growth would be exactly 0 due to tablespace not srinking on delete/purge, but other rows taking the new tablespace empty space, at least for some time (eventually, for 7 days). That doesn't seem to be the case as I mentioned on my previous comment.

A linear interpolation would support your case, the quadratic one on my mind would say growth stabilizes at 1-1.5%. I am not saying you are wrong, but I think we shouldn't close this ticket for now with the limited amount of data available (ok with not changing configuration, either).

I would like to know, however, your thoughts on the cache policy- for I can see, if there was a larger cache misses or any other caching pattern change (like the one that triggered this issue), there is little we could do to limit the size usage because the simple -expiration based- algorithm for purging. That means the cache is not really size-bound, and could grow indefinitely. I would like to address that so I do not have to stay up looking at the graphs and deleting random rows.

Change 363375 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Revert parsercaches to pc100[456]

https://gerrit.wikimedia.org/r/363375

Change 361656 abandoned by Jcrespo:
Parsercache: Purge rows every day, and reduce TTL to 22 days

https://gerrit.wikimedia.org/r/361656

Change 361659 abandoned by Jcrespo:
Parsercache: Reduce expiration time to 22 days

https://gerrit.wikimedia.org/r/361659

Change 363375 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Revert parsercaches to pc100[456]

https://gerrit.wikimedia.org/r/363375

Change 363546 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Retire db1096, db1099 and db1101 from the parsercache role

https://gerrit.wikimedia.org/r/363546

Change 363546 merged by Jcrespo:
[operations/puppet@production] mariadb: Retire db1096, db1099 and db1101 from the parsercache role

https://gerrit.wikimedia.org/r/363546

Change 361656 restored by Jcrespo:
Parsercache: Purge rows every day, and reduce TTL to 22 days

https://gerrit.wikimedia.org/r/361656

Change 361659 restored by Jcrespo:
Parsercache: Reduce expiration time to 22 days

https://gerrit.wikimedia.org/r/361659

I would say these^ are back on the table, unless someone can point to another explicit reason why this is happening again.

Hi,

There has been an spike on the pc disk usage again - starting yesterday at around 22:30 UTC - the disk space consumption has been increasing for the last month, without reaching any point where it has stabilized:
https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=pc1005&var-network=bond0&from=1498384298968&to=1500976298969

There is also a big spike on activity starting around 21:30: https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=18&fullscreen&orgId=1&var-server=pc1005&var-network=bond0&from=1500859448288&to=1500976384499

The servers are now again with only 10% available disk space. this merits some more investigation, any idea MediaWiki-Platform-Team or MediaWiki-Parser ?

Marostegui raised the priority of this task from High to Unbreak Now!.Jul 25 2017, 10:12 AM

Change 361659 restored by Jcrespo:
Parsercache: Reduce expiration time to 22 days

https://gerrit.wikimedia.org/r/361659

We are going to merge this

Change 361659 merged by jenkins-bot:
[operations/mediawiki-config@master] Parsercache: Reduce expiration time to 22 days

https://gerrit.wikimedia.org/r/361659

Mentioned in SAL (#wikimedia-operations) [2017-07-25T10:28:06Z] <marostegui@tin> Synchronized wmf-config/InitialiseSettings.php: Parsercache: Reduce expiration time to 22 days - T167784 (duration: 00m 44s)

Change 361656 merged by Marostegui:
[operations/puppet@production] Parsercache: Purge rows every day, and reduce TTL to 22 days

https://gerrit.wikimedia.org/r/361656

Mentioned in SAL (#wikimedia-operations) [2017-07-25T10:41:45Z] <marostegui> Run mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 500 from terbium - T167784

So, we have merged both patches and we are running:

/usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 500

From a root screen in terbium called purge_old_rows

Change 367665 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] Parsercache: Purge only certain days

https://gerrit.wikimedia.org/r/367665

Mentioned in SAL (#wikimedia-operations) [2017-07-25T11:13:24Z] <marostegui> Killing old running instances of purgeParserCache.php in terbium - https://phabricator.wikimedia.org/T167784

Change 367666 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] parserCachePurge: Run it every day, not only on Sunday

https://gerrit.wikimedia.org/r/367666

Mentioned in SAL (#wikimedia-operations) [2017-07-25T11:19:59Z] <marostegui> Start a run of "timeout 10h purgeParserCache.php" on terbium, which will be killed at around 21:00 UTC so it doesn't overlap with the normal cron run - T167784

Change 367666 merged by Jcrespo:
[operations/puppet@production] parserCachePurge: Run it every day, not only on Sunday

https://gerrit.wikimedia.org/r/367666

Change 367665 abandoned by Marostegui:
Parsercache: Purge only certain days

https://gerrit.wikimedia.org/r/367665

rows
root@pc1005[parsercache]> pager grep Rows
PAGER set to 'grep Rows'
root@pc1005[parsercache]> SHOW TABLE STATUS\G
           Rows: 1126032
           Rows: 815085
           Rows: 352350
           Rows: 411625
           Rows: 1493095
           Rows: 938984
           Rows: 840167
           Rows: 774002
           Rows: 832863
           Rows: 1223545
           Rows: 345948
           Rows: 912299
           Rows: 1236808
           Rows: 1781421
           Rows: 677777
           Rows: 1063325
           Rows: 965612
           Rows: 570982
           Rows: 670001
           Rows: 4230081
           Rows: 2331150
           Rows: 1556366
           Rows: 374040
           Rows: 667920
           Rows: 392875
           Rows: 1000903
           Rows: 441057
           Rows: 346060
           Rows: 438993
           Rows: 1444701
           Rows: 692031
           Rows: 721226
           Rows: 1009055
           Rows: 453690
           Rows: 801631
           Rows: 949343
           Rows: 857893
           Rows: 377692
           Rows: 1037742
           Rows: 676935
           Rows: 2072163
           Rows: 799386
           Rows: 818681
           Rows: 1001520
           Rows: 2086120
           Rows: 332074
           Rows: 1101020
           Rows: 407665
           Rows: 1772483
           Rows: 321560
           Rows: 659055
           Rows: 332985
           Rows: 1500641
           Rows: 381888
           Rows: 537063
           Rows: 407632
           Rows: 1110617
           Rows: 1073473
           Rows: 670629
           Rows: 593679
           Rows: 463900
           Rows: 1084768
           Rows: 234276
           Rows: 1172763
           Rows: 436552
           Rows: 2101901
           Rows: 1106847
           Rows: 430121
           Rows: 783522
           Rows: 1010388
           Rows: 1370232
           Rows: 1131267
           Rows: 365209
           Rows: 573792
           Rows: 2636082
           Rows: 2028633
           Rows: 1045218
           Rows: 801176
           Rows: 1178066
           Rows: 2358465
           Rows: 329579
           Rows: 1197308
           Rows: 430476
           Rows: 2483185
           Rows: 1038728
           Rows: 1242621
           Rows: 1026945
           Rows: 338962
           Rows: 812463
           Rows: 1806944
           Rows: 734415
           Rows: 1536309
           Rows: 1211917
           Rows: 1630779
           Rows: 1128120
           Rows: 434369
           Rows: 1303530
           Rows: 1215084
           Rows: 360187
           Rows: 2039618
           Rows: 357749
           Rows: 1166865
           Rows: 1537714
           Rows: 970177
           Rows: 337512
           Rows: 803197
           Rows: 1561450
           Rows: 1168858
           Rows: 1301827
           Rows: 398400
           Rows: 555724
           Rows: 495278
           Rows: 1249160
           Rows: 1434203
           Rows: 399983
           Rows: 1446318
           Rows: 1404093
           Rows: 2507350
           Rows: 1190391
           Rows: 1928976
           Rows: 456791
           Rows: 910273
           Rows: 1247370
           Rows: 766381
           Rows: 410579
           Rows: 1262416
           Rows: 1745008
           Rows: 1023674
           Rows: 1129573
           Rows: 1133219
           Rows: 1398691
           Rows: 535843
           Rows: 1039514
           Rows: 1090511
           Rows: 1255897
           Rows: 928832
           Rows: 2173079
           Rows: 491190
           Rows: 463401
           Rows: 352212
           Rows: 1121662
           Rows: 490006
           Rows: 1478361
           Rows: 1238158
           Rows: 1867389
           Rows: 363678
           Rows: 783622
           Rows: 732467
           Rows: 1603246
           Rows: 1299190
           Rows: 831984
           Rows: 847151
           Rows: 634318
           Rows: 989145
           Rows: 1096040
           Rows: 1347750
           Rows: 537056
           Rows: 509348
           Rows: 2224129
           Rows: 1414873
           Rows: 442460
           Rows: 1905923
           Rows: 924366
           Rows: 502743
           Rows: 931154
           Rows: 586016
           Rows: 778271
           Rows: 932656
           Rows: 381173
           Rows: 1009949
           Rows: 885525
           Rows: 897531
           Rows: 963737
           Rows: 1296133
           Rows: 1433933
           Rows: 1717443
           Rows: 492963
           Rows: 481446
           Rows: 949000
           Rows: 430655
           Rows: 1207208
           Rows: 1906627
           Rows: 2110357
           Rows: 1697623
           Rows: 1435945
           Rows: 436058
           Rows: 1505971
           Rows: 1110930
           Rows: 526639
           Rows: 490939
           Rows: 1132939
           Rows: 1280816
           Rows: 1471026
           Rows: 866458
           Rows: 1180057
           Rows: 1154207
           Rows: 1668095
           Rows: 425734
           Rows: 314688
           Rows: 739937
           Rows: 1702937
           Rows: 1434695
           Rows: 1878476
           Rows: 1222519
           Rows: 1866772
           Rows: 2048090
           Rows: 1513810
           Rows: 943930
           Rows: 2212149
           Rows: 377402
           Rows: 1790172
           Rows: 1352803
           Rows: 929897
           Rows: 441228
           Rows: 1804506
           Rows: 803150
           Rows: 1431669
           Rows: 894790
           Rows: 959409
           Rows: 334298
           Rows: 754473
           Rows: 833284
           Rows: 1623071
           Rows: 1246972
           Rows: 852932
           Rows: 1293583
           Rows: 1809673
           Rows: 425520
           Rows: 370726
           Rows: 1310079
           Rows: 363907
           Rows: 448973
           Rows: 1101812
           Rows: 923805
           Rows: 663054
           Rows: 1297840
           Rows: 650935
           Rows: 768965
           Rows: 578667
           Rows: 1006432
           Rows: 531024
           Rows: 1380862
           Rows: 1145024
           Rows: 1402936
           Rows: 1400910
           Rows: 1601636
           Rows: 765128
           Rows: 1429734
           Rows: 1301498
           Rows: 464943
           Rows: 1401316
           Rows: 997156
           Rows: 407395
           Rows: 524855
           Rows: 1176743
           Rows: 1424804
256 rows in set (0.01 sec)
disk size
root@pc1005:/srv/sqldata-cache/parsercache$ ls -lha *.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc000.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc001.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc002.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc003.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc004.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc005.ibd
-rw-rw---- 1 mysql mysql 6.9G Jul 25 11:39 pc006.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc007.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc008.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc009.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc010.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc011.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc012.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc013.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc014.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc015.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc016.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc017.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc018.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc019.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc020.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc021.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc022.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc023.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc024.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc025.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc026.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc027.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc028.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc029.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc030.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc031.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc032.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc033.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc034.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc035.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc036.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc037.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc038.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc039.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc040.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc041.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc042.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc043.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc044.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc045.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc046.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc047.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc048.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc049.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc050.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc051.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc052.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc053.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc054.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc055.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc056.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc057.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc058.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc059.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc060.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc061.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc062.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc063.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc064.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc065.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc066.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc067.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc068.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc069.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc070.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc071.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc072.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc073.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc074.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc075.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc076.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc077.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc078.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc079.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc080.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc081.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc082.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc083.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc084.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc085.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc086.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc087.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc088.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc089.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc090.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc091.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc092.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc093.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc094.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc095.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc096.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc097.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc098.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc099.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc100.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc101.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc102.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc103.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc104.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc105.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc106.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc107.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc108.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc109.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc110.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc111.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc112.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc113.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc114.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc115.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc116.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc117.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc118.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc119.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc120.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc121.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc122.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc123.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc124.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc125.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc126.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc127.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc128.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc129.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc130.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc131.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc132.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc133.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc134.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc135.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc136.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc137.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc138.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc139.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc140.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc141.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc142.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc143.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc144.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc145.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc146.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc147.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc148.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc149.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc150.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc151.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc152.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc153.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc154.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc155.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc156.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc157.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc158.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc159.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc160.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc161.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc162.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc163.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc164.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc165.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc166.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc167.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc168.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc169.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc170.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc171.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc172.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc173.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc174.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc175.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc176.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc177.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc178.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc179.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc180.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc181.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc182.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc183.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc184.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc185.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc186.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc187.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc188.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc189.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc190.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc191.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc192.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc193.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc194.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc195.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc196.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc197.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc198.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc199.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc200.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc201.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc202.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc203.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc204.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc205.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc206.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc207.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc208.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc209.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc210.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc211.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc212.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc213.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc214.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc215.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc216.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc217.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc218.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc219.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc220.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc221.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc222.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc223.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc224.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc225.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc226.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc227.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc228.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc229.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc230.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc231.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc232.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc233.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc234.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc235.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc236.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc237.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc238.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc239.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc240.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc241.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc242.ibd
-rw-rw---- 1 mysql mysql 7.0G Jul 25 11:39 pc243.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc244.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc245.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc246.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc247.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc248.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc249.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc250.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc251.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc252.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc253.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc254.ibd
-rw-rw---- 1 mysql mysql 7.1G Jul 25 11:39 pc255.ibd
jcrespo lowered the priority of this task from Unbreak Now! to High.Jul 25 2017, 12:05 PM

Running ALTER TABLE pc000 ENGINE=InnoDB, LOCK=NONE;allowed us to defragment on the fly, and reduced the filesize from 7230980096 to 5561647104 bytes.

I will leave a process defragmenting pc2004 and see the impact/how much time it takes.

This is in no way fixed, but the immediate issues are taken care. However, a more long-term fix should be done unless we want paging to happen frequently.

I added disk free space, and the derivative of disk free space, to the parser cache dashboard in Grafana: https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1&from=1500783303871&to=1501027037629

You can see that the original event was demand-driven, with the anomalous traffic disappearing by around 10:20 on July 25, an hour before space was freed up by purging. But after purging, the disk usage rate went from ~3GB per hour to ~20 GB per hour.

I had a look at cache-fragmenting parser options on pc1004 parsercache.pc001.

Only canonical options40%
Rows with wrapclass21%
Rows with responsiveimages18%
Rows with userlang8%
Rows with zh or sr variant6%

Removing wrapclass from the cache key seems like low-hanging fruit. But it's obviously not a solution to the main problem, which is the lack of either automatic disk space management or a good eviction policy. We really need either an LRU overlay on top of MySQL, controlling eviction, or we need to use some non-MySQL solution.

The cache miss spike I mentioned earlier was apparently due to wikidata: 75% of the cache entries written with the relevant expiry time were for wikidatawiki, whereas that wiki is normally a very small percentage. Of those cache entries, 99% had options "!canonical!wb3", whereas normally wikidata cache options are highly fragmented. $wgCacheEpoch was updated for wikidatawiki, about 8 hours before the start of the spike (https://gerrit.wikimedia.org/r/#/c/367391/ for T170668). It's possible that someone ran a bot or crawler to fetch a lot of wikidatawiki pages, either coincidentally after the cache epoch bump, or in an attempt to fix a related problem.

I'm looking at this because disk free space is still trending downwards quite severely, another action is apparently required. It might help to delete all rows from the parsercache databases with keyname LIKE 'wikidatawiki:%'.

Change 367853 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Revert "Bump cache epoch for Wikidata"

https://gerrit.wikimedia.org/r/367853

Note that the whole wikidata request rate spike only reduced disk free space from 11% to 9% -- so by deleting the relevant rows, we might expect a similar 2% increase in free space. It's not the main culprit for increased disk space usage in the long term, that award apparently goes to wrapclass and responsiveimages, which are cumulatively responsible for ~38% of rows.

@tstarling Do not worry about the current state- we are taking care of that with purging and defragmenting- my worry is for the long term- a way, even if hacky for this to not demand our attention every few weeks so it self-regulates. Maybe we can setup a local cron to delete rows based on table size, until a better long-term solution is in place.

I think the main issue is when there is something that technically invalidates all or a large part of the rows (format changes), or the spiders asking for every single page on enwiki, forcing a full reparse. Maybe we could find a way to share resources between the parsercache and restbase?

Mentioned in SAL (#wikimedia-operations) [2017-07-26T07:53:44Z] <jynus> start defragmenging on pc1* hosts T167784

Removing wrapclass from the cache key seems like low-hanging fruit.

Recall that it was added to fix T165115 and T165161, any removal should be careful to not rebreak those.

Simply wrapping could easily enough be done post-parse by ParserOutput (in the same place it handles section edit tokens), if that's all it were. But TemplateStyles uses the class name in the embedded stylesheets to scope the styles, so it'd have to be a more involved search-and-replace to handle it correctly.

Or we could change the parser option to a boolean, meaning to either wrap with mw-parser-output or don't wrap at all. Then TemplateStyles would just always scope to mw-parser-output and we could easily add the wrapper in ParserOutput. It's currently a string because there was talk about being able to use it for side-by-side display of multiple pages' content for comparisons like diffs or Translate, each "side" of the comparison would use a different wrapping class so TemplateStyles in one wouldn't affect the other.

Removing wrapclass from the cache key seems like low-hanging fruit.

Recall that it was added to fix T165115 and T165161, any removal should be careful to not rebreak those.

Simply wrapping could easily enough be done post-parse by ParserOutput (in the same place it handles section edit tokens), if that's all it were. But TemplateStyles uses the class name in the embedded stylesheets to scope the styles, so it'd have to be a more involved search-and-replace to handle it correctly.

Or we could change the parser option to a boolean, meaning to either wrap with mw-parser-output or don't wrap at all. Then TemplateStyles would just always scope to mw-parser-output and we could easily add the wrapper in ParserOutput. It's currently a string because there was talk about being able to use it for side-by-side display of multiple pages' content for comparisons like diffs or Translate, each "side" of the comparison would use a different wrapping class so TemplateStyles in one wouldn't affect the other.

Hm.. the varying class name is a good use case for keeping it as a vary. However in the current state we're not (yet) concerned about fragmentation resulting from different class names. The default and main page views happen with a class name, and they always use the same (default) class name. The fragmentation is coming from ParsoidBatchAPI/ApiParsoidBatch, TextExtracts/ApiQueryExtracts, and MobileFrontend/ApiMobileView where the wrapper can be disabled.

  • TextExtracts (ApiQueryExtracts): Can be unwrapped at run-time. Styling not a concern.
  • MobileFrontend (ApiMobileView): Should be unwrapped at run-time. The content is rewrapped client-side. Which means it's probably not working well right now, given it presumably doesn't prefix TemplateStyles and thus has its styles needlessly leak out of scope.
  • ParsoidBatchAPI (ApiParsoidBatch): Should be unwrapped at run-time "Parsoid doesn't want the output wrapper", but it does add it later presumably. Once again, like MobileFrontend, sounds like it actually needs internal use of the wrapclass (like TemplateStyles) to have been invoked as-if with a wrapper.
  • mediawiki-core (ApiParse): Allowing a different class name here seems sensible for the future use case presented, although I'm not sure we need that right now. Perhaps better to only expose boolean run-time wrap/unwrap at this time.

EDIT: See T171797: Provide post-cache ParserOutput transformations

Change 367853 abandoned by Krinkle:
Revert "Bump cache epoch for Wikidata"

https://gerrit.wikimedia.org/r/367853

After defragmenting (not fully finished, but most of it done), things seems a bit more stable. Purging, however, has not yet finished the 7 week of rows deletion. Even if later we decide to increase the expiration time, I think the frequency should be daily, not weekly.

Change 368624 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] parsercache: Retire temporary parsercaches from monitoring

https://gerrit.wikimedia.org/r/368624

Change 368624 merged by Jcrespo:
[operations/puppet@production] parsercache: Retire temporary parsercaches from monitoring

https://gerrit.wikimedia.org/r/368624

This is the one-liner I am thinking of cronify on parsercaches so we can forget about this task, even if there is in the future there is again issues:

for host in pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet; do echo $host...; mysql -BN -h $host parsercache -e "SHOW TABLES" | while read table; do echo $table...; mysql -h $host parsercache -e "set sql_log_bin=0; ALTER TABLE $table ENGINE=InnoDB, LOCK=NONE;"; done; done

This is the one-liner I am thinking of cronify on parsercaches so we can forget about this task, even if there is in the future there is again issues:

for host in pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet; do echo $host...; mysql -BN -h $host parsercache -e "SHOW TABLES" | while read table; do echo $table...; mysql -h $host parsercache -e "set sql_log_bin=0; ALTER TABLE $table ENGINE=InnoDB, LOCK=NONE;"; done; done

To keep defragmenting on a regular basis?

jcrespo claimed this task.

To keep defragmenting on a regular basis?

Yes that is a horrible thing to do, but seeing regular problems, it was the only thing I could think about. It seems no further issues in the last month: https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&panelId=3&fullscreen&orgId=1&from=1500982345670&to=1503574345670

Let's resolve this and propose to improve the parsercache handling and sharding T133523

@tstarling, @Anomie after 4 months it is unlikely that the rests of a bug are still here- with 21 days (reduced) of TTL instead of 30 and purges every day, however, they are at an uncomfortable 74% usage; additionally pc1004-6 and pc2004-6 end its lease time on December 2018. Should we buy larger disk servers, or do you think by that time the architecture may have changed significantly? You do not need to answer now, but we need a definitive answer by July 2018.

We could still do T181846: Use post-cache transforms to remove `wrapclass` from the parser cache key to reduce the cache fragmentation. It shouldn't be a lot of work left to do that.

We could still do T181846: Use post-cache transforms to remove `wrapclass` from the parser cache key to reduce the cache fragmentation. It shouldn't be a lot of work left to do that.

That change is now live.

We could still do T181846: Use post-cache transforms to remove `wrapclass` from the parser cache key to reduce the cache fragmentation. It shouldn't be a lot of work left to do that.

That change is now live.

Looking at the graphs now, I see:

  • Hit rate reached 75% for the first time in a long time.
  • Disk usage didn't reach its usual daily peak.

If the trend keeps going, then over the next few days (upto 2-3 weeks), we'll hopefully start to see a more significant and consistent reduction in disk usage as well.

reduction in disk usage

Logical disk reduction, filesystem level requires defragmentation- ping me (but better on a separate ticket) if you want me to do that at some point.

On a related note, I have asked to buy larger disk on the next (a few month's time) purchase, I do not think fixing things at code level are worth some extra HDs. I have also proposed to buy 4 pc's instead of 3 for service redundancy reasons.

Marostegui mentioned this in Unknown Object (Task).Aug 3 2018, 8:02 AM