ORES overload incident, 2017-11-28
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Nov 28 2017, 6:01 PM

Description

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20171128-ORES

scb1001 and scb1002 are unable to keep up with a long barrage of extra scoring requests.

https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1511881217180&to=1511891957181

Strange log message seen on scb1001:

OSError: write error

also concerning, that error was written at INFO severity. See P6389.

Details

Subject	Repo	Branch	Lines +/-
Increase ORES queue_maxsize by 20%	operations/puppet	production	+1 -0
Increase celery verbosity; use message format including timestamp	mediawiki/services/ores/deploy	master	+2 -1
ORES: lower celery concurrency for scb100{1,2}	operations/puppet	production	+2 -0
ORES: Fix $wgOresBaseUrl	operations/mediawiki-config	master	+1 -1
ORES: Use the internal discovery URL	operations/mediawiki-config	master	+1 -1
Failover ORES to codfw	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T181538 ORES overload incident, 2017-11-28
Resolved	akosiaris	T181544 Investigate scb1001 and scb1002 available memory graphs in Grafana
Resolved	Ladsgroup	T181546 Let the ORES application set log severity, not uWSGI
Resolved	Halfak	T181563 Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*
Resolved	awight	T181567 Rate limit thresholds requests when the service is down
Declined	None	T182256 Clean up ORES thresholds cache: pre-emptively check before expiry
Resolved	Ladsgroup	T181621 What is causing ORES celery workers to suddenly require more CPU?
Resolved	Ladsgroup	T181630 Send celery and wsgi service logs to logstash
Resolved	Ladsgroup	T181632 Celery manager implodes horribly if Redis goes down
Resolved	Ladsgroup	T181559 Investigate redis-cluster or other techniques for making Redis not a single point of failure.
Declined	None	T122676 Implement sentinel for ORES production Redis
Resolved	Halfak	T167149 Test if ORES celery can use the unix socket
Resolved	Ladsgroup	T196889 Investigate what is creating Redis transactions and whether it can be fixed
Declined	None	T210577 Build a test setup for redis sentinel in cloud VPS
Resolved	Ladsgroup	T210579 Add support for redis-sentinel in score cache
Invalid	None	T210580 Write puppet for redis-sentinel
Declined	None	T210582 New node request: oresrdb[12]003
Declined	None	T210605 Run a test failover in labs before migrating prod to sentinel
Resolved	awight	T181634 Investigate overload condition, seems that we lose nodes
Resolved	awight	T181795 Create an incident report for ORES overload incident 2017

Event Timeline

awight created this task.Nov 28 2017, 6:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 6:01 PM

awight updated the task description. (Show Details)Nov 28 2017, 6:03 PM

We saw very different available memory levels using top directly on scb100[1-2], vs. the ORES Grafana dashboard which never showed a dip below c. 20GB. This needs to be fixed.

Change 393820 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Failover ORES to codfw

https://gerrit.wikimedia.org/r/393820

Change 393820 merged by Alexandros Kosiaris:
[operations/puppet@production] Failover ORES to codfw

https://gerrit.wikimedia.org/r/393820

Mentioned in SAL (#wikimedia-operations) [2017-11-28T18:32:50Z] <akosiaris> force puppet run on cache::misc boxes T181538

Strange memory behavior on scb1001 and scb1002 for the last week: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All&from=1511289213149&to=1511894013149

scb1001 and scb1002 had OOMs show up.

[12:04:04] <akosiaris> now on scb1001 OOM showed up
...
[12:05:32] <akosiaris> and scb1002
[12:05:41] <akosiaris> both had OOM show up

We had a sudden increase in requests/min for ORES around 1600 UTC. But we've seen bigger spikes that did not cause timeouts or memory issues around 1300 UTC.

External requests
Overload errors

We've failed over to CODFW.

awight created subtask T181544: Investigate scb1001 and scb1002 available memory graphs in Grafana.Nov 28 2017, 6:42 PM

greg added a project: Wikimedia-Incident.Nov 28 2017, 6:45 PM

greg moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.

awight created subtask T181546: Let the ORES application set log severity, not uWSGI.Nov 28 2017, 6:49 PM

From https://grafana.wikimedia.org/dashboard/db/ores?panelId=14&fullscreen&orgId=1&from=1511872559429&to=1511894099429, we can see the the memory consumption of web workers on scb1001/scb1002 begins to fall shortly after 1600 UTC. Could this be due to OOM killing? There's no spike in ORES memory usage before this process begins. @akosiaris has offered to note the OOM event timestamps.

Looks like we've been hitting memory limits for quite a while, at least since Oct 26th:
https://logstash.wikimedia.org/goto/4e642cc6677ef824ec3397a507249637

All nodes in CODFW just went down at the same time. For a short period. See

[13:42:38] <mutante> https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2001.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw
[13:42:49] <mutante> there you can see it, network just stops
[13:44:01] <mutante> and on the other server https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2002.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw

My hypothesis is that our redis nodes had a network blip and that caused all scoring requests to back up for a period.

awight created subtask T181559: Investigate redis-cluster or other techniques for making Redis not a single point of failure..Nov 28 2017, 7:52 PM

@Dzahn found a clue to the latest *codfw* outage, in which oresrdb Redis network traffic spikes and then crashes to zero:

[3:13pm] mutante: "1514 [510] 28 Nov 19:17:31.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[3:13pm] mutante: ^ this happened shortly before the outage
[3:14pm] mutante: only shows up on 2001, not on 2002

Halfak mentioned this in T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*.Nov 28 2017, 8:30 PM

Halfak created subtask T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*.

Change 393924 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] ORES: Use the internal discovery URL

https://gerrit.wikimedia.org/r/393924

Change 393924 merged by Alexandros Kosiaris:
[operations/mediawiki-config@master] ORES: Use the internal discovery URL

https://gerrit.wikimedia.org/r/393924

Mentioned in SAL (#wikimedia-operations) [2017-11-28T22:31:34Z] <akosiaris> deploy wmf-config/CommonSettings.php for ORES internal discovery URL, https://gerrit.wikimedia.org/r/#/c/393924/ T181538

Change 393930 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] ORES: Fix $wgOresBaseUrl

https://gerrit.wikimedia.org/r/393930

Change 393930 merged by Alexandros Kosiaris:
[operations/mediawiki-config@master] ORES: Fix $wgOresBaseUrl

https://gerrit.wikimedia.org/r/393930

OK. Dominant hypothesis is that we are DOS-ing ourselves via the ORES Ext. When we accidentally broke the $wgOresBaseUrl, the service returned to normal for a period.

Mentioned in SAL (#wikimedia-operations) [2017-11-28T23:07:44Z] <akosiaris@tin> Synchronized wmf-config/CommonSettings.php: T181538 (duration: 00m 49s)

BTW, see T181567 where we initially describe the correlation between failed "test_stats" request from MediaWiki and the downtime events.

Halfak added a subtask: T181567: Rate limit thresholds requests when the service is down.Nov 28 2017, 11:08 PM

Halfak closed subtask T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* as Resolved.

It looks like we have causal evidence that MW is hammering ORES into the ground. Now that $wgOresBaseUrl is fixed, ORES is again overloaded and suffering.

This graph shows the whole period of time that ORES was getting hammered with "test_stats" requests including the 35 minute period when $wgOresBaseUrl was broken: https://logstash.wikimedia.org/goto/91568910c65b23afcbfab5f15120c7e1

This graph shows that, during that time period, ORES was not overloaded: https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=now-6h&to=now-1m

Adding OOM kernel logs per host for posterity's sake.

Feel free to ignore electron in the logs. It's already being memory limited due to

MemoryLimit=2G in it's configuration. See https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/pdfrender/templates/initscripts/pdfrender.systemd.erb;3b73a1d95bbfbe8eddc095b9c06713aee87a4042$10

===== NODE GROUP =====                                                                                                                      
(1) scb2005.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:16:06 scb2005 kernel: [1818194.621901] Out of memory: Kill process 11579 (celery) score 53 or sacrifice child                     
Nov 28 19:16:09 scb2005 kernel: [1818195.042352] Out of memory: Kill process 9721 (celery) score 51 or sacrifice child                      
Nov 28 19:16:56 scb2005 kernel: [1818242.388117] Out of memory: Kill process 10231 (celery) score 51 or sacrifice child
Nov 28 19:17:46 scb2005 kernel: [1818291.854272] Out of memory: Kill process 10557 (celery) score 54 or sacrifice child
Nov 28 19:18:25 scb2005 kernel: [1818329.970783] Out of memory: Kill process 11004 (celery) score 52 or sacrifice child
Nov 28 19:19:16 scb2005 kernel: [1818384.372686] Out of memory: Kill process 11189 (celery) score 53 or sacrifice child
Nov 28 19:19:43 scb2005 kernel: [1818410.753576] Out of memory: Kill process 11306 (celery) score 53 or sacrifice child
Nov 28 19:19:48 scb2005 kernel: [1818416.323430] Out of memory: Kill process 11461 (celery) score 53 or sacrifice child
Nov 28 19:20:18 scb2005 kernel: [1818444.178684] Out of memory: Kill process 10920 (celery) score 53 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2004.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:14:44 scb2004 kernel: [1818432.259586] Out of memory: Kill process 14164 (celery) score 51 or sacrifice child                     
Nov 28 19:15:08 scb2004 kernel: [1818458.900757] Out of memory: Kill process 14661 (celery) score 56 or sacrifice child                     
Nov 28 19:15:33 scb2004 kernel: [1818484.900006] Out of memory: Kill process 15195 (celery) score 51 or sacrifice child
Nov 28 19:15:53 scb2004 kernel: [1818503.443654] Out of memory: Kill process 14372 (celery) score 56 or sacrifice child
Nov 28 19:16:19 scb2004 kernel: [1818529.093477] Out of memory: Kill process 15020 (celery) score 51 or sacrifice child
Nov 28 19:17:09 scb2004 kernel: [1818579.232931] Out of memory: Kill process 14186 (celery) score 54 or sacrifice child
Nov 28 20:16:23 scb2004 kernel: [1822131.567736] Out of memory: Kill process 8118 (celery) score 53 or sacrifice child
Nov 28 20:16:50 scb2004 kernel: [1822160.431118] Out of memory: Kill process 8443 (celery) score 53 or sacrifice child
Nov 28 20:17:18 scb2004 kernel: [1822188.640353] Out of memory: Kill process 8227 (celery) score 52 or sacrifice child
Nov 28 20:18:15 scb2004 kernel: [1822242.896041] Out of memory: Kill process 8082 (celery) score 51 or sacrifice child
Nov 28 20:18:45 scb2004 kernel: [1822274.719252] Out of memory: Kill process 8878 (celery) score 52 or sacrifice child
Nov 28 20:19:15 scb2004 kernel: [1822306.235702] Out of memory: Kill process 8330 (celery) score 50 or sacrifice child
Nov 28 20:19:50 scb2004 kernel: [1822341.461230] Out of memory: Kill process 8291 (celery) score 49 or sacrifice child
Nov 28 20:20:33 scb2004 kernel: [1822383.418588] Out of memory: Kill process 8345 (celery) score 50 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2006.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:18:16 scb2006 kernel: [1817966.004881] Out of memory: Kill process 27002 (celery) score 52 or sacrifice child                     
Nov 28 20:15:39 scb2006 kernel: [1821411.106727] Out of memory: Kill process 32070 (celery) score 62 or sacrifice child                     
Nov 28 20:16:09 scb2006 kernel: [1821440.560616] Out of memory: Kill process 31978 (celery) score 56 or sacrifice child
Nov 28 20:19:03 scb2006 kernel: [1821613.764689] Out of memory: Kill process 464 (celery) score 59 or sacrifice child
Nov 28 20:19:22 scb2006 kernel: [1821633.840057] Out of memory: Kill process 31833 (celery) score 56 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1002.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 07:56:30 scb1002 kernel: [1546177.910238] Out of memory: Kill process 32183 (celery) score 43 or sacrifice child                     
Nov 27 07:56:53 scb1002 kernel: [1546201.216850] Out of memory: Kill process 32075 (celery) score 43 or sacrifice child                     
Nov 27 08:18:25 scb1002 kernel: [1547492.777898] Out of memory: Kill process 31576 (celery) score 50 or sacrifice child
Nov 27 14:06:52 scb1002 kernel: [1568399.586493] Out of memory: Kill process 28843 (celery) score 49 or sacrifice child
Nov 27 17:38:57 scb1002 kernel: [1581125.736152] Out of memory: Kill process 10831 (celery) score 40 or sacrifice child
Nov 27 18:07:48 scb1002 kernel: [1582856.104279] Out of memory: Kill process 8376 (celery) score 45 or sacrifice child
Nov 28 05:06:55 scb1002 kernel: [1622402.561065] Out of memory: Kill process 32369 (celery) score 44 or sacrifice child
Nov 28 13:37:59 scb1002 kernel: [1653066.841579] Out of memory: Kill process 12422 (celery) score 40 or sacrifice child
Nov 28 16:05:32 scb1002 kernel: [1661920.501555] Out of memory: Kill process 1186 (celery) score 41 or sacrifice child
Nov 28 16:38:34 scb1002 kernel: [1663719.526667] Out of memory: Kill process 674 (electron) score 40 or sacrifice child
Nov 28 16:38:34 scb1002 kernel: [1663719.579351] Out of memory: Kill process 674 (electron) score 40 or sacrifice child
Nov 28 16:50:11 scb1002 kernel: [1664595.040206] Out of memory: Kill process 2317 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664596.022986] Out of memory: Kill process 2277 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664597.421139] Out of memory: Kill process 2376 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664598.075725] Out of memory: Kill process 1570 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664600.149305] Out of memory: Kill process 2390 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664600.457883] Out of memory: Kill process 6403 (celery) score 40 or sacrifice child
Nov 28 17:38:49 scb1002 kernel: [1667513.784895] Out of memory: Kill process 28241 (electron) score 301 or sacrifice child
Nov 28 20:36:50 scb1002 kernel: [1678190.284445] Out of memory: Kill process 28298 (electron) score 300 or sacrifice child
Nov 28 20:36:50 scb1002 kernel: [1678196.021986] Out of memory: Kill process 28414 (electron) score 300 or sacrifice child
Nov 28 20:40:30 scb1002 kernel: [1678417.488314] Out of memory: Kill process 28246 (electron) score 300 or sacrifice child
Nov 28 20:40:52 scb1002 kernel: [1678440.015374] Out of memory: Kill process 1049 (celery) score 52 or sacrifice child
Nov 28 20:41:42 scb1002 kernel: [1678489.810097] Out of memory: Kill process 809 (celery) score 47 or sacrifice child
Nov 28 21:31:34 scb1002 kernel: [1681482.312840] Out of memory: Kill process 32234 (celery) score 47 or sacrifice child
Nov 28 22:12:34 scb1002 kernel: [1683941.896235] Out of memory: Kill process 23184 (celery) score 48 or sacrifice child
Nov 28 23:12:33 scb1002 kernel: [1687540.863042] Out of memory: Kill process 32054 (celery) score 48 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2002.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:15:12 scb2002 kernel: [1819508.530270] Out of memory: Kill process 29606 (celery) score 64 or sacrifice child                     
Nov 28 19:16:14 scb2002 kernel: [1819572.064765] Out of memory: Kill process 28399 (celery) score 56 or sacrifice child                     
Nov 28 20:17:46 scb2002 kernel: [1823264.261442] Out of memory: Kill process 13362 (celery) score 57 or sacrifice child
Nov 28 20:18:24 scb2002 kernel: [1823300.812261] Out of memory: Kill process 13532 (celery) score 56 or sacrifice child
Nov 28 20:18:55 scb2002 kernel: [1823333.191774] Out of memory: Kill process 13505 (celery) score 56 or sacrifice child
Nov 28 20:19:27 scb2002 kernel: [1823364.559618] Out of memory: Kill process 12629 (celery) score 58 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1001.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 18:17:53 scb1001 kernel: [1583846.626530] Out of memory: Kill process 15244 (electron) score 300 or sacrifice child                  
Nov 27 18:17:53 scb1001 kernel: [1583846.667526] Out of memory: Kill process 15738 (electron) score 300 or sacrifice child                  
Nov 27 18:47:14 scb1001 kernel: [1585613.251738] Out of memory: Kill process 21875 (electron) score 301 or sacrifice child
Nov 27 18:49:28 scb1001 kernel: [1585753.832883] Out of memory: Kill process 19760 (electron) score 300 or sacrifice child
Nov 27 18:51:15 scb1001 kernel: [1585854.631345] Out of memory: Kill process 17838 (electron) score 300 or sacrifice child
Nov 27 18:51:16 scb1001 kernel: [1585859.243415] Out of memory: Kill process 18953 (electron) score 300 or sacrifice child
Nov 27 18:51:25 scb1001 kernel: [1585863.360719] Out of memory: Kill process 20557 (electron) score 300 or sacrifice child
Nov 27 18:51:25 scb1001 kernel: [1585866.487870] Out of memory: Kill process 15998 (electron) score 300 or sacrifice child
Nov 27 18:51:26 scb1001 kernel: [1585869.979075] Out of memory: Kill process 14219 (celery) score 44 or sacrifice child
Nov 28 12:47:20 scb1001 kernel: [1650404.019531] Out of memory: Kill process 12738 (electron) score 300 or sacrifice child
Nov 28 12:47:20 scb1001 kernel: [1650426.186987] Out of memory: Kill process 30398 (celery) score 42 or sacrifice child
Nov 28 16:15:25 scb1001 kernel: [1662905.270102] Out of memory: Kill process 30931 (electron) score 301 or sacrifice child
Nov 28 16:16:58 scb1001 kernel: [1663003.554926] Out of memory: Kill process 9409 (electron) score 44 or sacrifice child
Nov 28 16:16:58 scb1001 kernel: [1663004.249501] Out of memory: Kill process 9409 (electron) score 44 or sacrifice child
Nov 28 17:46:52 scb1001 kernel: [1668370.625754] Out of memory: Kill process 28494 (electron) score 300 or sacrifice child
Nov 28 17:46:52 scb1001 kernel: [1668396.187860] Out of memory: Kill process 25720 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668414.806314] Out of memory: Kill process 27927 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668424.622775] Out of memory: Kill process 26882 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668431.823590] Out of memory: Kill process 14328 (celery) score 41 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668676.498039] Out of memory: Kill process 30886 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668708.461017] Out of memory: Kill process 1476 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668721.010736] Out of memory: Kill process 1450 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668734.680660] Out of memory: Kill process 29856 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668742.246089] Out of memory: Kill process 14407 (celery) score 44 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2003.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 20:58:00 scb2003 kernel: [1738717.699534] Out of memory: Kill process 12113 (nodejs) score 66 or sacrifice child                     
Nov 28 20:15:03 scb2003 kernel: [1822542.503895] Out of memory: Kill process 605 (celery) score 57 or sacrifice child                       
Nov 28 20:15:56 scb2003 kernel: [1822595.553718] Out of memory: Kill process 736 (celery) score 56 or sacrifice child
Nov 28 20:16:27 scb2003 kernel: [1822627.273686] Out of memory: Kill process 636 (celery) score 55 or sacrifice child
Nov 28 20:17:03 scb2003 kernel: [1822662.909382] Out of memory: Kill process 834 (celery) score 51 or sacrifice child
Nov 28 20:18:35 scb2003 kernel: [1822755.305824] Out of memory: Kill process 18077 (celery) score 51 or sacrifice child
Nov 28 20:19:13 scb2003 kernel: [1822793.103879] Out of memory: Kill process 318 (celery) score 51 or sacrifice child
Nov 28 20:19:54 scb2003 kernel: [1822834.439551] Out of memory: Kill process 452 (celery) score 52 or sacrifice child
Nov 28 20:20:41 scb2003 kernel: [1822880.687558] Out of memory: Kill process 600 (celery) score 53 or sacrifice child
Nov 28 20:23:25 scb2003 kernel: [1823046.354950] Out of memory: Kill process 20020 (celery) score 55 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2001.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:15:10 scb2001 kernel: [1819954.568507] Out of memory: Kill process 13811 (celery) score 52 or sacrifice child                     
Nov 28 19:16:52 scb2001 kernel: [1820056.177766] Out of memory: Kill process 14406 (celery) score 51 or sacrifice child                     
Nov 28 19:17:33 scb2001 kernel: [1820097.252114] Out of memory: Kill process 14393 (celery) score 55 or sacrifice child
Nov 28 19:18:14 scb2001 kernel: [1820138.430891] Out of memory: Kill process 13702 (celery) score 50 or sacrifice child
Nov 28 19:19:24 scb2001 kernel: [1820208.961718] Out of memory: Kill process 14263 (celery) score 51 or sacrifice child
Nov 28 19:19:24 scb2001 kernel: [1820209.589660] Out of memory: Kill process 14159 (celery) score 50 or sacrifice child
Nov 28 19:19:54 scb2001 kernel: [1820238.675306] Out of memory: Kill process 13688 (celery) score 52 or sacrifice child
Nov 28 19:22:12 scb2001 kernel: [1820377.530969] Out of memory: Kill process 14603 (celery) score 53 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1003.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 21:55:20 scb1003 kernel: [1682428.632717] Out of memory: Kill process 41989 (electron) score 300 or sacrifice child                  
Nov 28 21:55:25 scb1003 kernel: [1682429.417041] Out of memory: Kill process 32909 (electron) score 300 or sacrifice child                  
Nov 28 21:55:26 scb1003 kernel: [1682429.777013] Out of memory: Kill process 42310 (electron) score 300 or sacrifice child
Nov 28 21:55:26 scb1003 kernel: [1682430.582462] Out of memory: Kill process 42475 (electron) score 300 or sacrifice child
Nov 28 21:55:27 scb1003 kernel: [1682435.295307] Out of memory: Kill process 18197 (celery) score 33 or sacrifice child

akosiaris reopened subtask T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* as Open.Nov 29 2017, 8:02 AM

Change 394037 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ORES: lower celery concurrency for scb100{1,2}

https://gerrit.wikimedia.org/r/394037

Change 394037 merged by Alexandros Kosiaris:
[operations/puppet@production] ORES: lower celery concurrency for scb100{1,2}

https://gerrit.wikimedia.org/r/394037

akosiaris closed subtask T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* as Resolved.Nov 29 2017, 11:19 AM

Change 394047 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Increase ORES queue_maxsize by 20%

https://gerrit.wikimedia.org/r/394047

Looks like we'll be doing the same thing today. There have been intermittent overload incidents for the last two hours, during which eqiad performance has dropped to nearly zero:

https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511948780946&to=1511958356189

awight created subtask T181621: What is causing ORES celery workers to suddenly require more CPU?.Nov 29 2017, 1:03 PM

• Mholloway subscribed.Nov 29 2017, 1:54 PM

Most of our worker boxes are down. Here's the app.log from one that's down, last written to 2 hours ago:

Connection to Redis lost: Retry (0/20) now.
Connection to Redis lost: Retry (1/20) in 1.00 second.
Connection to Redis lost: Retry (2/20) in 1.00 second.
Connection to Redis lost: Retry (3/20) in 1.00 second.
Connection to Redis lost: Retry (4/20) in 1.00 second.
Connection to Redis lost: Retry (5/20) in 1.00 second.

awight added a comment.Nov 29 2017, 2:25 PM

This comment was removed by awight.

Change 394060 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Increase celery verbosity; use message format including timestamp

https://gerrit.wikimedia.org/r/394060

awight mentioned this in rORESDEPLOYe231dde3e3aa: Increase celery verbosity; use message format including timestamp.Nov 29 2017, 2:32 PM

awight created subtask T181630: Send celery and wsgi service logs to logstash.Nov 29 2017, 2:34 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:35:42Z] <awight@tin> Started deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:35:58Z] <awight@tin> Finished deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538 (duration: 00m 16s)

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:37:27Z] <awight@tin> Started restart [ores/deploy@e58bfbf]: Restart ORES services (take 2), T181538

awight created subtask T181632: Celery manager implodes horribly if Redis goes down.Nov 29 2017, 2:58 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-29T15:00:30Z] <awight> Restarting ORES celery workers manually, T181538

awight created subtask T181634: Investigate overload condition, seems that we lose nodes.Nov 29 2017, 3:11 PM

Change 394060 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Increase celery verbosity; use message format including timestamp

https://gerrit.wikimedia.org/r/394060

akosiaris closed subtask T181544: Investigate scb1001 and scb1002 available memory graphs in Grafana as Resolved.Nov 30 2017, 6:14 PM

akosiaris created subtask T181795: Create an incident report for ORES overload incident 2017.Dec 1 2017, 11:49 AM

awight reopened subtask T181544: Investigate scb1001 and scb1002 available memory graphs in Grafana as Open.Dec 1 2017, 2:38 PM

akosiaris closed subtask T181544: Investigate scb1001 and scb1002 available memory graphs in Grafana as Resolved.Dec 1 2017, 3:01 PM

awight closed subtask T181567: Rate limit thresholds requests when the service is down as Resolved.Dec 6 2017, 9:47 PM

awight closed this task as Resolved.Dec 6 2017, 9:53 PM

awight removed a project: Patch-For-Review.