Page MenuHomePhabricator

proton experienced a period of high CPU usage, busy queue, lockups
Open, HighPublic

Description

Proton at eqiad just experienced a period of high CPU usage, busy queue, lockups and out of memory.

Period at eqiad started at 16:30 UTC Jan 30 2019 and ended 17:20 UTC Jan 30 2019, so a period of around 50 mins. During the investigation the proton instances in eqiad were restarted a couple of times by @akosiaris to see if the problem would repeat itself or was a transient event. It did repeat itself.

Proton graphs at https://grafana.wikimedia.org/d/000000563/proton?orgId=1&from=1548864354759&to=1548869144641 show the increase in request rates for a4 desktop. At the same time successfully rendered jobs dropped by 66%.

During that period there was a 100% CPU usage for all cores on 1 of the hosts although after restarts the load shifted to the second host. See https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=proton&var-instance=All&from=1548864746695&to=1548869780745 for the patterns.

An increase in memory usage is also evident, with the linux kernel out of memory killer showing up and killing chromium instances.

kern.log:Jan 30 16:16:22 proton1002 kernel: [5424559.684917] Out of memory: Kill process 31932 (chromium) score 393 or sacrifice child
kern.log:Jan 30 16:17:43 proton1002 kernel: [5424641.063894] Out of memory: Kill process 6612 (chromium) score 353 or sacrifice child
kern.log:Jan 30 16:18:49 proton1002 kernel: [5424707.210946] Out of memory: Kill process 8295 (chromium) score 377 or sacrifice child
kern.log:Jan 30 16:24:11 proton1002 kernel: [5425028.894293] Out of memory: Kill process 24208 (chromium) score 316 or sacrifice child
kern.log:Jan 30 16:24:39 proton1002 kernel: [5425056.465943] Out of memory: Kill process 23744 (chromium) score 316 or sacrifice child

The proton logs indicate busy queues

{"name":"proton","hostname":"proton1001","pid":4,"level":30,"levelPath":"info/service-runner","msg":"master(4) initializing 10 workers","time":"2019-01-30T17:15:14.351Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":30,"levelPath":"info","msg":"Worker 14 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:16.013Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":30,"levelPath":"info","msg":"Worker 25 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:17.372Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":30,"levelPath":"info","msg":"Worker 190 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:18.930Z","v":0}

{"name":"proton","hostname":"proton1001","pid":357,"level":30,"levelPath":"info","msg":"Worker 357 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:20.502Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":30,"levelPath":"info","msg":"Worker 576 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:21.897Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":30,"levelPath":"info","msg":"Worker 851 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:23.374Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1089,"level":30,"levelPath":"info","msg":"Worker 1089 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:25.379Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":30,"levelPath":"info","msg":"Worker 1442 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:27.688Z","v":0}
{"name":"proton","hostname":"proton1001","pid":2206,"level":30,"levelPath":"info","msg":"Worker 2206 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:35.179Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":30,"msg":"Render failed. Page not found.","id":"af79e25f-24b2-11e9-8508-abc2bf60aecd|en.wikipedia.org|Nonexistent_title","levelPath":"info/render","time":"2019-01-30T17:15:47.896Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:02.656Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:03.058Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":30,"msg":"Render failed. Page not found.","id":"c098f8c2-24b2-11e9-8d97-1a693972c568|en.wikipedia.org|Nonexistent_title","levelPath":"info/render","time":"2019-01-30T17:16:16.621Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:18.121Z","v":0}
{"name":"proton","hostname":"proton1001","pid":3174,"level":30,"levelPath":"info","msg":"Worker 3174 listening on 0.0.0.0:24766","time":"2019-01-30T17:16:26.289Z","v":0}
{"name":"proton","hostname":"proton1001","pid":4,"level":40,"levelPath":"warn/service-runner","msg":"startup finished","time":"2019-01-30T17:16:26.311Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"a9f5fd0b-24b2-11e9-990f-520db093b7e4|fr.wikipedia.org|Îlotage","waitTime":60189,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.156Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"ab438bac-24b2-11e9-8c66-1947b2cf4250|fr.wikipedia.org|Brexit","waitTime":60223,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.199Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"ab2d1d88-24b2-11e9-afe9-70402a3aa690|fr.wikipedia.org|Valence_(chimie)","waitTime":60400,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.244Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"a9e645a7-24b2-11e9-94c6-fcf58c342633|de.wikisource.org|Die_Gartenlaube_(1877)/Heft_2","waitTime":60912,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.750Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"ab4168d9-24b2-11e9-86b4-e834627a5e62|it.wikipedia.org|Protocollo_di_Kyoto","waitTime":60971,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.806Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"aba051ba-24b2-11e9-bdc6-63528b1085aa|ar.wikipedia.org|صوفي_بسيس","waitTime":60195,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.815Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"abb62397-24b2-11e9-9bed-c02c54ca7f2f|fr.wikipedia.org|Fianarantsoa","waitTime":60329,"levelPath":"warn/queue","time":"2019-01-30T17:16:44.637Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"adcc3258-24b2-11e9-8eaf-40e55e1b0086|de.wikipedia.org|Operatives_Leasing","waitTime":60001,"levelPath":"warn/queue","time":"2019-01-30T17:16:45.095Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"ae07187f-24b2-11e9-8055-130bd3268fba|es.wikipedia.org|Producto_Nacional_Bruto","waitTime":60000,"levelPath":"warn/queue","time":"2019-01-30T17:16:45.554Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"aefcea31-24b2-11e9-8952-d829a75953d4|es.wikipedia.org|Ford_Motor_Company","waitTime":60213,"levelPath":"warn/queue","time":"2019-01-30T17:16:47.692Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"aef60c66-24b2-11e9-a73e-ba611e6a0f5c|en.wikipedia.org|Philopatry","waitTime":60705,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.098Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"af580279-24b2-11e9-9ac4-15b835dfe43f|sa.wikisource.org|आपस्तम्ब-धर्मसूत्रम्","waitTime":60592,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.329Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"afc5b874-24b2-11e9-bb13-e782c2b8377d|fr.wikipedia.org|Activité_physique","waitTime":60027,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.433Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"b15c8518-24b2-11e9-8bbd-383afbe7a167|en.wikibooks.org|Practical_Electronics/PCB_Layout","waitTime":60866,"levelPath":"warn/queue","time":"2019-01-30T17:16:51.949Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b2b5ac75-24b2-11e9-9fd7-6db4b4788f51|ckb.wikipedia.org|حەپسەخانی_نەقیب","waitTime":60027,"levelPath":"warn/queue","time":"2019-01-30T17:16:57.126Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"b4e18d2f-24b2-11e9-b6a5-b4cc022c9859|en.wikipedia.org|Vasconic_substratum_theory","waitTime":60237,"levelPath":"warn/queue","time":"2019-01-30T17:16:57.675Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b4d07626-24b2-11e9-82d3-cb5d8aa24dc7|de.wikipedia.org|Oxford_3000","waitTime":62052,"levelPath":"warn/queue","time":"2019-01-30T17:16:59.048Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"b48bf312-24b2-11e9-8589-3d59965fd234|de.wikipedia.org|Blasenkirschen","waitTime":62790,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.121Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"b46fb880-24b2-11e9-84c8-b9245888acb7|de.wikipedia.org|Sonett","waitTime":62061,"levelPath":"warn/queue","time":"2019-01-30T17:16:59.596Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"b618fbb3-24b2-11e9-9ac7-95febb7c5e83|fr.wikipedia.org|Prieuré_de_Sion","waitTime":60551,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.139Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"b2335d0f-24b2-11e9-a7e3-05a806b1a6e3|es.wikipedia.org|Liniers_(historietista)","waitTime":62598,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.139Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"b6aab46a-24b2-11e9-b3bb-259d5e98ef49|fr.wikipedia.org|Référendum_sur_l'appartenance_du_Royaume-Uni_à_l'Union_européenne","waitTime":60286,"levelPath":"warn/queue","time":"2019-01-30T17:17:02.088Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b7c2b417-24b2-11e9-8e37-4c5731fca605|de.wikipedia.org|Härte","waitTime":60562,"levelPath":"warn/queue","time":"2019-01-30T17:17:02.693Z","v":0}

The behavior finally stopped and the problem was resolved on its own, without any corrective action being taken, meaning that this will probably show up again

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2019, 5:43 PM
akosiaris triaged this task as High priority.Jan 30 2019, 5:43 PM
akosiaris updated the task description. (Show Details)Jan 30 2019, 5:46 PM
ovasileva moved this task from Triage to Backlog on the Proton board.Feb 21 2019, 10:14 AM

Mentioned in SAL (#wikimedia-operations) [2019-04-24T17:23:46Z] <mutante> proton1001 - restarting proton service - low RAM caused facter/puppet fails (https://tickets.puppetlabs.com/browse/PUP-8048) freed memory and fixed puppet run (cc: T219456 T214975)

jcrespo added a subscriber: jcrespo.May 8 2019, 4:08 PM

I think it overloaded again today (times are CEST):

[15:58:46] <icinga-wm>	 PROBLEM - puppet last run on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[15:59:02] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[15:59:12] <icinga-wm>	 PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[15:59:34] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[15:59:38] <icinga-wm>	 PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[15:59:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:00:48] <icinga-wm>	 PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[16:00:54] <icinga-wm>	 PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[16:01:00] <icinga-wm>	 PROBLEM - Disk space on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[16:01:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack

[16:01:18] <jynus>	 what is the issue with proton1001?
44) (owner: 10Jbond)

[16:05:54] <jynus>	 I think it overloaded
root@proton1001:~$ uptime
 16:07:52 up 41 days, 18:07,  1 user,  load average: 1.60, 11.25, 18.19
Joe added a subscriber: Joe.May 9 2019, 7:07 AM

So what happened is that the server ran out of available memory and OOM'd.

See

https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=4&fullscreen&orgId=1&var-server=proton1001&var-datasource=eqiad%20prometheus%2Fops&from=1557301813966&to=1557343447171

the memory usage went down when I restarted the service and peaked again.

One stopgap could be to reduce the memory available to proton and restart it always via systemd.

Now it says proton1001: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues

Mentioned in SAL (#wikimedia-operations) [2019-05-09T10:05:48Z] <akosiaris> restart proton on proton1001. Host Out of memory T214975

Tgr added a subscriber: Tgr.May 10 2019, 2:43 AM

Maybe related to T213362#5171787? That was deployed on April 11, and this bug is somehow related to not properly killing threads that hang....
Then again, Puppeteer might just produce such hung threads with some constant low rate for all we know.
Anyway will deploy the assumed fix (se T217724) next week.

Tgr added a comment.May 10 2019, 2:53 AM

OTOH looking at the graphs I don't see any obvious correlation with the deploy. CPU load / network traffic / number of requests correlate well with each other, but not so much with the memory use, which has huge sudden spikes during high load, except the last such spikes happened under normal load...