Page MenuHomePhabricator

Gerrit GC thrashing during branch cut
Closed, ResolvedPublic

Description

I first noticed this happening during the June 30th train.

I've tried string deduplication to lower overall memory usage, and branch cleanup on the gerrit server itself; however, neither have alleviated the problem.

gerrit-gc-2019-09-03.png (509×886 px, 37 KB)

Throwing more hardware at the issue would fix it most likely, but there should be some way to mitigate it in the interim.

Event Timeline

That is definitely started with the branch cut. Can't we raise the HEAP and or throw some more RAM at the machine?

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:26:09Z] <hashar> Gerrit should be fine again, apparently was due to the wmf branch cut taking too much resources (sic) - T231872 filled to investigate

Happened again today apparently, gerrit's been down/unresponsive and its monitoring data has been unavailable since ~12.43

Mentioned in SAL (#wikimedia-operations) [2019-09-10T13:11:20Z] <hashar> Gerrit experimenting difficulty due to ongoing wmf branch cut - T231872

Restarted Gerrit.

gerrit-gc-pause-2019-09-10.png (505×867 px, 23 KB)

42 second(!!!) GC pause at 12:47:28 means the service likely wouldn't have recovered.

On cobalt from lsof -p $(pidof java) (F30290036), I see a LOT of pack files opened:

COMMAND   PID    USER   FD   TYPE             DEVICE   SIZE/OFF      NODE NAME
...
java    10368 gerrit2  505r   REG              253,0    1174218 216008382 /srv/gerrit/git/mediawiki/core.git/objects/pack/pack-21a4e5cb6a9533b0565cf59cf1282181ac4db897.pack
java    10368 gerrit2  506r   REG              253,0       2677  23789870 /srv/gerrit/git/mediawiki/skins/MinervaNeue.git/objects/pack/pack-3a03431737fe31619ca3e417c1b348c739be2807.pack
java    10368 gerrit2  507r   REG              253,0    3959314  22544407 /srv/gerrit/git/mediawiki/extensions/cldr.git/objects/pack/pack-c83ea98d8e971cb4b941fd42d105c853a6d22308.pack
java    10368 gerrit2  508r   REG              253,0      54243 209518637 /srv/gerrit/git/VisualEditor/VisualEditor.git/objects/pack/pack-1a81f2a690fd1ee9ecd78bf940c22488118f1683.pack
java    10368 gerrit2  509r   REG              253,0       6084  22478850 /srv/gerrit/git/mediawiki/extensions/cldr.git/objects/pack/pack-f060e07631f88aaf5f95278ac44517d92cbd66c8.pack
java    10368 gerrit2  510r   REG              253,0      21598  22478865 /srv/gerrit/git/mediawiki/extensions/cldr.git/objects/pack/pack-fb3d19ca6c000577695974fac3a8057f54f12c70.pack
java    10368 gerrit2  511r   REG              253,0      63734 228720668 /srv/gerrit/git/mediawiki/extensions/Wikibase.git/objects/pack/pack-7010654bf9d6f6aed9a4165bfd954e474738a04a.pack
java    10368 gerrit2  512r   REG              253,0      45714  23789758 /srv/gerrit/git/mediawiki/skins/MinervaNeue.git/objects/pack/pack-c53391d4f0dc0d66d796513ea640d23595566450.pack
java    10368 gerrit2  513r   REG              253,0      44481  23789699 /srv/gerrit/git/mediawiki/skins/MinervaNeue.git/objects/pack/pack-47adf4c23b92cc568f2614e0df467f1544e6ba13.pack
java    10368 gerrit2  514r   REG              253,0      21984 222626182 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-23bdb3e781c25d410daf8443305a2eb376a5c33b.pack
java    10368 gerrit2  515r   REG              253,0   12009872  23789619 /srv/gerrit/git/mediawiki/skins/MinervaNeue.git/objects/pack/pack-2d079071d270527642a3963bd95f32f7fdaeba6a.pack
java    10368 gerrit2  516r   REG              253,0      21978 222626139 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-107e4c3f1faa0f8e8f3c4885f90206bdb0347520.pack
java    10368 gerrit2  517r   REG              253,0      39891 222626136 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-5542ab6534bd8b14a126fd0232e174985a732093.pack
java    10368 gerrit2  518r   REG              253,0      39803 222626131 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-11dd62eec2ae54961aa353ca529512ae9413b2b5.pack
java    10368 gerrit2  519r   REG              253,0      43802 222626099 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-4a258daaaf07b3575502320c63968e5a514c78cc.pack
java    10368 gerrit2  520r   REG              253,0     248337 228196645 /srv/gerrit/git/mediawiki/extensions/CirrusSearch.git/objects/pack/pack-157ee708fee44a6d69b8576d532cbb8c364e22ae.pack
java    10368 gerrit2  521r   DIR              253,0       4096 216072251 /srv/gerrit/git/mediawiki/extensions/WebToolsManager.git/refs
java    10368 gerrit2  522r   REG              253,0      27540 213582495 /srv/gerrit/git/operations/puppet.git/objects/pack/pack-d58545b73758f5cbdfc0dcfca6bbdd207defd24d.pack
java    10368 gerrit2  523r   REG              253,0       7896 222625825 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-e277a2d6ca44afe2fbb4f4a8458f4dc4292f0999.pack
java    10368 gerrit2  524r   REG              253,0      11129 222626059 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-174112c2f74dbcf4214d972d4392ab06cfd61385.pack
java    10368 gerrit2  525r   REG              253,0      25595 222626052 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-992ef531d86650c607d70e89344521a8592a98cf.pack
java    10368 gerrit2  526r   REG              253,0       7896 222625817 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-378bf123ed4d1bab4c35c2d298287a86714253c8.pack
java    10368 gerrit2  527r   REG              253,0      67215 222626089 /srv/gerrit/git/mediawiki/extensions/AbuseFilter.git/objects/pack/pack-d80a460c54f45246b28eabb17341cfb4f81f4016.pack
java    10368 gerrit2  528r   REG              253,0    3286278 224722950 /srv/gerrit/git/mediawiki/extensions/ArticlePlaceholder.git/objects/pack/pack-190be94825c1fb546ef6a27ff12004296ae9ce86.pack
java    10368 gerrit2  529r   REG              253,0    2613984 225116187 /srv/gerrit/git/mediawiki/extensions/Babel.git/objects/pack/pack-9b1671f6b4908d29153bf6a177184f6e368f8734.pack
java    10368 gerrit2  530r   REG              253,0      51057 231079972 /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/pack-c1dd456c2c984fcdd9ce7c2616fa79cc23017ef6.pack
java    10368 gerrit2  531r   REG              253,0    2758152 225378313 /srv/gerrit/git/mediawiki/extensions/BetaFeatures.git/objects/pack/pack-c6458cc00a9df63efd7dbbb8cdafb490b511d6a1.pack
java    10368 gerrit2  532r   DIR              253,0       4096 216072302 /srv/gerrit/git/mediawiki/extensions/WebToolsManager.git/refs/meta
java    10368 gerrit2  533r   REG              253,0      35967 231080098 /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/pack-eb8be589724cba1f3c953e6ee6080d48512cddf7.pack
java    10368 gerrit2  534r   REG              253,0      35963 231080088 /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/pack-bbef63114963a008e075ac1d48ec582429d913b9.pack
java    10368 gerrit2  535r   DIR              253,0       4096 216072302 /srv/gerrit/git/mediawiki/extensions/WebToolsManager.git/refs/meta
java    10368 gerrit2  536r   REG              253,0      22046 231080017 /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/pack-b69c1b3382654c38e91d553cbaf4acdae24c5ae5.pack
java    10368 gerrit2  537r   REG              253,0       4253 231079947 /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/pack-6487c0d9fa3fb23f5349d6725f7dc2aa9534f511.pack
...

though they are rather small, they pillup.

Top 20 at the time of the issue, with space added for human consumption.

$ cut -f 7,9 gerritpidof|sort -gr|head -n20
2 068 117 446	/srv/gerrit/git/mediawiki/services/ores/editquality.git/objects/pack/pack-040f862c4b3994aa4ee9f64ffc1a03bbd0b911f1.pack
672 617 588	/srv/gerrit/git/mediawiki/services/ores/editquality.git/objects/pack/pack-e005323c520f90e9d5f7a61f1a93a378b5d2ec3d.pack
489 693 968	/srv/gerrit/git/mediawiki/core.git/objects/pack/pack-6b1e63a3e0e27a0e434fad72f6fea0bd913ad669.pack
313 292 315	/srv/gerrit/git/operations/puppet.git/objects/pack/pack-4006e2dd35757fb7e9d81d7e14bb460911ea4cc6.pack
245 008 370	/srv/gerrit/git/mediawiki/core.git/objects/pack/pack-c147f5cffdd5dd9876ccba295d06e9b2bb15680b.pack
212 897 685	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f.fdt
170 787 746	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.tim
129 765 702	/srv/gerrit/git/mediawiki/extensions/Wikibase.git/objects/pack/pack-0b6c9a176510cd8df8b3f8c7a8d80315e4fd9b6b.pack
128 484 991	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.pos
118 139 986	/srv/gerrit/git/operations/puppet.git/objects/pack/pack-6b1ae1a9d65a732cd12e9edae493255b96eb96e2.pack
112 303 380	/srv/gerrit/git/operations/mediawiki-config.git/objects/pack/pack-479b4177a6c07911124f6a0f6dba1719cf8ac772.pack
111 743 297	/srv/gerrit/git/mediawiki/extensions.git/objects/pack/pack-92594be30e306d406a694103fe275a4a3b9e956f.pack
110 320 267	/srv/gerrit/git/mediawiki/extensions/MobileFrontend.git/objects/pack/pack-97e360360f69ea1ae30543db2cce7523bc37c951.pack
84 871 965	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.doc
81 301 349	/srv/gerrit/git/operations/mediawiki-config.git/objects/pack/pack-fba3fc256e452708dc2f1a79100888c066a5d0eb.pack
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
69 436 603	/srv/gerrit/git/mediawiki/extensions/UniversalLanguageSelector.git/objects/pack/pack-f8754cb289f7b8747fff5e9f8f9476b9a5dfa1ff.pack
68 417 669	/srv/gerrit/git/mediawiki/extensions/VisualEditor.git/objects/pack/pack-c0da11482aab712c03640bca569be1c33e6f738c.pack

mediawiki/services/ores/editquality.git is rather large which might not help.

lsof: F30290036

On cobalt from lsof -p $(pidof java) (F30290036), I see a LOT of pack files opened:

It's pretty typical to have a lot of pack files open. In fact, the number of packfiles opened before restart was lower than after (which is to be expected since there's nothing in memory at startup time):

https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&from=1568079721750&to=1568122921750&var-Application=&var-Window=5m

Top 20 at the time of the issue, with space added for human consumption.

$ cut -f 7,9 gerritpidof|sort -gr|head -n20
2 068 117 446	/srv/gerrit/git/mediawiki/services/ores/editquality.git/objects/pack/pack-040f862c4b3994aa4ee9f64ffc1a03bbd0b911f1.pack
672 617 588	/srv/gerrit/git/mediawiki/services/ores/editquality.git/objects/pack/pack-e005323c520f90e9d5f7a61f1a93a378b5d2ec3d.pack
489 693 968	/srv/gerrit/git/mediawiki/core.git/objects/pack/pack-6b1e63a3e0e27a0e434fad72f6fea0bd913ad669.pack
313 292 315	/srv/gerrit/git/operations/puppet.git/objects/pack/pack-4006e2dd35757fb7e9d81d7e14bb460911ea4cc6.pack
245 008 370	/srv/gerrit/git/mediawiki/core.git/objects/pack/pack-c147f5cffdd5dd9876ccba295d06e9b2bb15680b.pack
212 897 685	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f.fdt
170 787 746	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.tim
129 765 702	/srv/gerrit/git/mediawiki/extensions/Wikibase.git/objects/pack/pack-0b6c9a176510cd8df8b3f8c7a8d80315e4fd9b6b.pack
128 484 991	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.pos
118 139 986	/srv/gerrit/git/operations/puppet.git/objects/pack/pack-6b1ae1a9d65a732cd12e9edae493255b96eb96e2.pack
112 303 380	/srv/gerrit/git/operations/mediawiki-config.git/objects/pack/pack-479b4177a6c07911124f6a0f6dba1719cf8ac772.pack
111 743 297	/srv/gerrit/git/mediawiki/extensions.git/objects/pack/pack-92594be30e306d406a694103fe275a4a3b9e956f.pack
110 320 267	/srv/gerrit/git/mediawiki/extensions/MobileFrontend.git/objects/pack/pack-97e360360f69ea1ae30543db2cce7523bc37c951.pack
84 871 965	/var/lib/gerrit2/review_site/index/changes_0048/closed/_fo2f_Lucene50_0.doc
81 301 349	/srv/gerrit/git/operations/mediawiki-config.git/objects/pack/pack-fba3fc256e452708dc2f1a79100888c066a5d0eb.pack
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
76 177 134	/srv/deployment/gerrit/gerrit-cache/revs/40d88dc46992e015c404295643cbfef15e025496/gerrit.war
69 436 603	/srv/gerrit/git/mediawiki/extensions/UniversalLanguageSelector.git/objects/pack/pack-f8754cb289f7b8747fff5e9f8f9476b9a5dfa1ff.pack
68 417 669	/srv/gerrit/git/mediawiki/extensions/VisualEditor.git/objects/pack/pack-c0da11482aab712c03640bca569be1c33e6f738c.pack

mediawiki/services/ores/editquality.git is rather large which might not help.

That is definitely a large repo. git gc for that repo in particular (on Fridays) takes as much time as all other repos combined.

There is a setting in the gerrit.config: packedGitLimit which is the size of pack files to keep in memory at a given time. It might make sense to increase this so that all the train repos fit in memory at the same time (if that is not currently the case): https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core.packedGitLimit

It is also the case that this machine needs more memory T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster)

thcipriani triaged this task as Medium priority.Sep 10 2019, 2:52 PM

@thcipriani mentioned high CPU occurring Sep 28 which solved after a reboot:

gerrit-cpu-2019-09-27.png (497×1 px, 40 KB)

This morning (Oct 1st 07:00) I forced a reindex of all changes and after that went to look for CPU usage. The same similar plateau was showing.

Looking at htop:

gerrit_high_cpu_nameless_threads.png (1×960 px, 311 KB)

Note all the thread simply named java. The PID is the thread number and can be mapped in jstack output via nid=0xYYYY (which is hexadecimal, how convenient is that?). Anyway that partly match with:

"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680021000 nid=0x6989 runnable 
"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680023000 nid=0x698a runnable 
"Gang worker#2 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680024800 nid=0x698b runnable 
"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680026800 nid=0x698c runnable 
"Gang worker#4 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680028000 nid=0x698d runnable 
"Gang worker#5 (Parallel GC Threads)" os_prio=0 tid=0x00007fc68002a000 nid=0x698e runnable 
"Gang worker#6 (Parallel GC Threads)" os_prio=0 tid=0x00007fc68002b800 nid=0x698f runnable 
"Gang worker#7 (Parallel GC Threads)" os_prio=0 tid=0x00007fc68002d800 nid=0x6990 runnable 
"Gang worker#8 (Parallel GC Threads)" os_prio=0 tid=0x00007fc68002f000 nid=0x6991 runnable 
"Gang worker#9 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680031000 nid=0x6992 runnable 
"Gang worker#10 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680032800 nid=0x6993 runnable 
"Gang worker#11 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680034800 nid=0x6994 runnable 

"Gang worker#12 (Parallel GC Threads)" os_prio=0 tid=0x00007fc680036000 nid=0x6995 runnable 
"G1 Main Concurrent Mark GC Thread" os_prio=0 tid=0x00007fc680061800 nid=0x69a4 runnable 
"Gang worker#0 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007fc680063800 nid=0x69a5 runnable 
"Gang worker#1 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007fc680065000 nid=0x69a6 runnable 
"Gang worker#2 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007fc680066800 nid=0x69a7 runnable 

"G1 Concurrent Refinement Thread#0" os_prio=0 tid=0x00007fc680050000 nid=0x69a3 runnable 
"G1 Concurrent Refinement Thread#1" os_prio=0 tid=0x00007fc68004e000 nid=0x69a2 runnable 
"G1 Concurrent Refinement Thread#2" os_prio=0 tid=0x00007fc68004c800 nid=0x69a1 runnable 
"G1 Concurrent Refinement Thread#3" os_prio=0 tid=0x00007fc68004a800 nid=0x69a0 runnable 
"G1 Concurrent Refinement Thread#4" os_prio=0 tid=0x00007fc680049000 nid=0x699f runnable 
"G1 Concurrent Refinement Thread#5" os_prio=0 tid=0x00007fc680047000 nid=0x699e runnable 
"G1 Concurrent Refinement Thread#6" os_prio=0 tid=0x00007fc680045800 nid=0x699d runnable 
"G1 Concurrent Refinement Thread#7" os_prio=0 tid=0x00007fc680043800 nid=0x699c runnable 
"G1 Concurrent Refinement Thread#8" os_prio=0 tid=0x00007fc680042000 nid=0x699b runnable 
"G1 Concurrent Refinement Thread#9" os_prio=0 tid=0x00007fc680040000 nid=0x699a runnable 
"G1 Concurrent Refinement Thread#10" os_prio=0 tid=0x00007fc68003e000 nid=0x6999 runnable 
"G1 Concurrent Refinement Thread#11" os_prio=0 tid=0x00007fc68003c800 nid=0x6998 runnable 
"G1 Concurrent Refinement Thread#12" os_prio=0 tid=0x00007fc68003a800 nid=0x6997 runnable 
"G1 Concurrent Refinement Thread#13" os_prio=0 tid=0x00007fc680039000 nid=0x6996 runnable 
"String Deduplication Thread" os_prio=0 tid=0x00007fc681aca800 nid=0x69a8 runnable

There are also a bunch of "C1 CompilerThread" and "C2 CompilerThread", a
"Surrogate Locker Thread (Concurrent GC)".

All those gc related threads started at the same time: 02:40 UTC.

I also note prometheus-node-exporter has threads that started at 02:40 UTC, at least according to htop. But that seems to be an artifact of some sort.

Change 540094 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] gerrit: install openjdk dbg package

https://gerrit.wikimedia.org/r/540094

Change 540094 merged by Dzahn:
[operations/puppet@production] gerrit: install openjdk dbg package

https://gerrit.wikimedia.org/r/540094

Paladox claimed this task.

Going to boldly close this as resolved. It was mentioned on IRC that things look better since we've moved to new hardware with more resources.