Page MenuHomePhabricator

proton experienced a period of high CPU usage, busy queue, lockups
Open, HighPublic

Description

Proton at eqiad just experienced a period of high CPU usage, busy queue, lockups and out of memory.

Period at eqiad started at 16:30 UTC Jan 30 2019 and ended 17:20 UTC Jan 30 2019, so a period of around 50 mins. During the investigation the proton instances in eqiad were restarted a couple of times by @akosiaris to see if the problem would repeat itself or was a transient event. It did repeat itself.

Proton graphs at https://grafana.wikimedia.org/d/000000563/proton?orgId=1&from=1548864354759&to=1548869144641 show the increase in request rates for a4 desktop. At the same time successfully rendered jobs dropped by 66%.

During that period there was a 100% CPU usage for all cores on 1 of the hosts although after restarts the load shifted to the second host. See https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=proton&var-instance=All&from=1548864746695&to=1548869780745 for the patterns.

An increase in memory usage is also evident, with the linux kernel out of memory killer showing up and killing chromium instances.

kern.log:Jan 30 16:16:22 proton1002 kernel: [5424559.684917] Out of memory: Kill process 31932 (chromium) score 393 or sacrifice child
kern.log:Jan 30 16:17:43 proton1002 kernel: [5424641.063894] Out of memory: Kill process 6612 (chromium) score 353 or sacrifice child
kern.log:Jan 30 16:18:49 proton1002 kernel: [5424707.210946] Out of memory: Kill process 8295 (chromium) score 377 or sacrifice child
kern.log:Jan 30 16:24:11 proton1002 kernel: [5425028.894293] Out of memory: Kill process 24208 (chromium) score 316 or sacrifice child
kern.log:Jan 30 16:24:39 proton1002 kernel: [5425056.465943] Out of memory: Kill process 23744 (chromium) score 316 or sacrifice child

The proton logs indicate busy queues

{"name":"proton","hostname":"proton1001","pid":4,"level":30,"levelPath":"info/service-runner","msg":"master(4) initializing 10 workers","time":"2019-01-30T17:15:14.351Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":30,"levelPath":"info","msg":"Worker 14 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:16.013Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":30,"levelPath":"info","msg":"Worker 25 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:17.372Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":30,"levelPath":"info","msg":"Worker 190 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:18.930Z","v":0}

{"name":"proton","hostname":"proton1001","pid":357,"level":30,"levelPath":"info","msg":"Worker 357 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:20.502Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":30,"levelPath":"info","msg":"Worker 576 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:21.897Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":30,"levelPath":"info","msg":"Worker 851 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:23.374Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1089,"level":30,"levelPath":"info","msg":"Worker 1089 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:25.379Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":30,"levelPath":"info","msg":"Worker 1442 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:27.688Z","v":0}
{"name":"proton","hostname":"proton1001","pid":2206,"level":30,"levelPath":"info","msg":"Worker 2206 listening on 0.0.0.0:24766","time":"2019-01-30T17:15:35.179Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":30,"msg":"Render failed. Page not found.","id":"af79e25f-24b2-11e9-8508-abc2bf60aecd|en.wikipedia.org|Nonexistent_title","levelPath":"info/render","time":"2019-01-30T17:15:47.896Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:02.656Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:03.058Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":30,"msg":"Render failed. Page not found.","id":"c098f8c2-24b2-11e9-8d97-1a693972c568|en.wikipedia.org|Nonexistent_title","levelPath":"info/render","time":"2019-01-30T17:16:16.621Z","v":0}
{"name":"proton","hostname":"proton1001","pid":851,"level":50,"msg":"Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded","trace":"TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded\n    at Promise.then (/srv/deployment/proton/deploy-cache/revs/ff7c8a2f8a09ae0893e27f2b25df65edff7a9b32/node_modules/puppeteer-core/node6/lib/FrameManager.js:1984:21)","levelPath":"error/request","time":"2019-01-30T17:16:18.121Z","v":0}
{"name":"proton","hostname":"proton1001","pid":3174,"level":30,"levelPath":"info","msg":"Worker 3174 listening on 0.0.0.0:24766","time":"2019-01-30T17:16:26.289Z","v":0}
{"name":"proton","hostname":"proton1001","pid":4,"level":40,"levelPath":"warn/service-runner","msg":"startup finished","time":"2019-01-30T17:16:26.311Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"a9f5fd0b-24b2-11e9-990f-520db093b7e4|fr.wikipedia.org|Îlotage","waitTime":60189,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.156Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"ab438bac-24b2-11e9-8c66-1947b2cf4250|fr.wikipedia.org|Brexit","waitTime":60223,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.199Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"ab2d1d88-24b2-11e9-afe9-70402a3aa690|fr.wikipedia.org|Valence_(chimie)","waitTime":60400,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.244Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"a9e645a7-24b2-11e9-94c6-fcf58c342633|de.wikisource.org|Die_Gartenlaube_(1877)/Heft_2","waitTime":60912,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.750Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"ab4168d9-24b2-11e9-86b4-e834627a5e62|it.wikipedia.org|Protocollo_di_Kyoto","waitTime":60971,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.806Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"aba051ba-24b2-11e9-bdc6-63528b1085aa|ar.wikipedia.org|صوفي_بسيس","waitTime":60195,"levelPath":"warn/queue","time":"2019-01-30T17:16:41.815Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"abb62397-24b2-11e9-9bed-c02c54ca7f2f|fr.wikipedia.org|Fianarantsoa","waitTime":60329,"levelPath":"warn/queue","time":"2019-01-30T17:16:44.637Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"adcc3258-24b2-11e9-8eaf-40e55e1b0086|de.wikipedia.org|Operatives_Leasing","waitTime":60001,"levelPath":"warn/queue","time":"2019-01-30T17:16:45.095Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"ae07187f-24b2-11e9-8055-130bd3268fba|es.wikipedia.org|Producto_Nacional_Bruto","waitTime":60000,"levelPath":"warn/queue","time":"2019-01-30T17:16:45.554Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"aefcea31-24b2-11e9-8952-d829a75953d4|es.wikipedia.org|Ford_Motor_Company","waitTime":60213,"levelPath":"warn/queue","time":"2019-01-30T17:16:47.692Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"aef60c66-24b2-11e9-a73e-ba611e6a0f5c|en.wikipedia.org|Philopatry","waitTime":60705,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.098Z","v":0}
{"name":"proton","hostname":"proton1001","pid":1442,"level":40,"msg":"Queue is still busy.","id":"af580279-24b2-11e9-9ac4-15b835dfe43f|sa.wikisource.org|आपस्तम्ब-धर्मसूत्रम्","waitTime":60592,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.329Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"afc5b874-24b2-11e9-bb13-e782c2b8377d|fr.wikipedia.org|Activité_physique","waitTime":60027,"levelPath":"warn/queue","time":"2019-01-30T17:16:48.433Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"b15c8518-24b2-11e9-8bbd-383afbe7a167|en.wikibooks.org|Practical_Electronics/PCB_Layout","waitTime":60866,"levelPath":"warn/queue","time":"2019-01-30T17:16:51.949Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b2b5ac75-24b2-11e9-9fd7-6db4b4788f51|ckb.wikipedia.org|حەپسەخانی_نەقیب","waitTime":60027,"levelPath":"warn/queue","time":"2019-01-30T17:16:57.126Z","v":0}
{"name":"proton","hostname":"proton1001","pid":14,"level":40,"msg":"Queue is still busy.","id":"b4e18d2f-24b2-11e9-b6a5-b4cc022c9859|en.wikipedia.org|Vasconic_substratum_theory","waitTime":60237,"levelPath":"warn/queue","time":"2019-01-30T17:16:57.675Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b4d07626-24b2-11e9-82d3-cb5d8aa24dc7|de.wikipedia.org|Oxford_3000","waitTime":62052,"levelPath":"warn/queue","time":"2019-01-30T17:16:59.048Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"b48bf312-24b2-11e9-8589-3d59965fd234|de.wikipedia.org|Blasenkirschen","waitTime":62790,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.121Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"b46fb880-24b2-11e9-84c8-b9245888acb7|de.wikipedia.org|Sonett","waitTime":62061,"levelPath":"warn/queue","time":"2019-01-30T17:16:59.596Z","v":0}
{"name":"proton","hostname":"proton1001","pid":357,"level":40,"msg":"Queue is still busy.","id":"b618fbb3-24b2-11e9-9ac7-95febb7c5e83|fr.wikipedia.org|Prieuré_de_Sion","waitTime":60551,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.139Z","v":0}
{"name":"proton","hostname":"proton1001","pid":576,"level":40,"msg":"Queue is still busy.","id":"b2335d0f-24b2-11e9-a7e3-05a806b1a6e3|es.wikipedia.org|Liniers_(historietista)","waitTime":62598,"levelPath":"warn/queue","time":"2019-01-30T17:17:00.139Z","v":0}
{"name":"proton","hostname":"proton1001","pid":25,"level":40,"msg":"Queue is still busy.","id":"b6aab46a-24b2-11e9-b3bb-259d5e98ef49|fr.wikipedia.org|Référendum_sur_l'appartenance_du_Royaume-Uni_à_l'Union_européenne","waitTime":60286,"levelPath":"warn/queue","time":"2019-01-30T17:17:02.088Z","v":0}
{"name":"proton","hostname":"proton1001","pid":190,"level":40,"msg":"Queue is still busy.","id":"b7c2b417-24b2-11e9-8e37-4c5731fca605|de.wikipedia.org|Härte","waitTime":60562,"levelPath":"warn/queue","time":"2019-01-30T17:17:02.693Z","v":0}

The behavior finally stopped and the problem was resolved on its own, without any corrective action being taken, meaning that this will probably show up again