ORES overload incident, 2017-11-28
Closed, ResolvedPublic

Description

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20171128-ORES

scb1001 and scb1002 are unable to keep up with a long barrage of extra scoring requests.

https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1511881217180&to=1511891957181

Strange log message seen on scb1001:

OSError: write error

also concerning, that error was written at INFO severity. See P6389.

Related Objects

awight created this task.Nov 28 2017, 6:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 6:01 PM
awight updated the task description. (Show Details)Nov 28 2017, 6:03 PM

We saw very different available memory levels using top directly on scb100[1-2], vs. the ORES Grafana dashboard which never showed a dip below c. 20GB. This needs to be fixed.

Change 393820 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Failover ORES to codfw

https://gerrit.wikimedia.org/r/393820

Change 393820 merged by Alexandros Kosiaris:
[operations/puppet@production] Failover ORES to codfw

https://gerrit.wikimedia.org/r/393820

Mentioned in SAL (#wikimedia-operations) [2017-11-28T18:32:50Z] <akosiaris> force puppet run on cache::misc boxes T181538

scb1001 and scb1002 had OOMs show up.

[12:04:04] <akosiaris> now on scb1001 OOM showed up
...
[12:05:32] <akosiaris> and scb1002
[12:05:41] <akosiaris> both had OOM show up
Halfak added a comment.EditedNov 28 2017, 6:38 PM

We had a sudden increase in requests/min for ORES around 1600 UTC. But we've seen bigger spikes that did not cause timeouts or memory issues around 1300 UTC.

External requests
Overload errors

We've failed over to CODFW.

greg moved this task from To Triage to Active Situation on the Wikimedia-Incident board.

From https://grafana.wikimedia.org/dashboard/db/ores?panelId=14&fullscreen&orgId=1&from=1511872559429&to=1511894099429, we can see the the memory consumption of web workers on scb1001/scb1002 begins to fall shortly after 1600 UTC. Could this be due to OOM killing? There's no spike in ORES memory usage before this process begins. @akosiaris has offered to note the OOM event timestamps.

Looks like we've been hitting memory limits for quite a while, at least since Oct 26th:
https://logstash.wikimedia.org/goto/4e642cc6677ef824ec3397a507249637

All nodes in CODFW just went down at the same time. For a short period. See

[13:42:38] <mutante> https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2001.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw
[13:42:49] <mutante> there you can see it, network just stops
[13:44:01] <mutante> and on the other server https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2002.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw

My hypothesis is that our redis nodes had a network blip and that caused all scoring requests to back up for a period.

awight added a subscriber: Dzahn.Nov 28 2017, 8:21 PM

@Dzahn found a clue to the latest *codfw* outage, in which oresrdb Redis network traffic spikes and then crashes to zero:

[3:13pm] mutante: "1514 [510] 28 Nov 19:17:31.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[3:13pm] mutante: ^ this happened shortly before the outage
[3:14pm] mutante: only shows up on 2001, not on 2002

Change 393924 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] ORES: Use the internal discovery URL

https://gerrit.wikimedia.org/r/393924

Change 393924 merged by Alexandros Kosiaris:
[operations/mediawiki-config@master] ORES: Use the internal discovery URL

https://gerrit.wikimedia.org/r/393924

Mentioned in SAL (#wikimedia-operations) [2017-11-28T22:31:34Z] <akosiaris> deploy wmf-config/CommonSettings.php for ORES internal discovery URL, https://gerrit.wikimedia.org/r/#/c/393924/ T181538

Change 393930 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] ORES: Fix $wgOresBaseUrl

https://gerrit.wikimedia.org/r/393930

Change 393930 merged by Alexandros Kosiaris:
[operations/mediawiki-config@master] ORES: Fix $wgOresBaseUrl

https://gerrit.wikimedia.org/r/393930

OK. Dominant hypothesis is that we are DOS-ing ourselves via the ORES Ext. When we accidentally broke the $wgOresBaseUrl, the service returned to normal for a period.

Mentioned in SAL (#wikimedia-operations) [2017-11-28T23:07:44Z] <akosiaris@tin> Synchronized wmf-config/CommonSettings.php: T181538 (duration: 00m 49s)

BTW, see T181567 where we initially describe the correlation between failed "test_stats" request from MediaWiki and the downtime events.

It looks like we have causal evidence that MW is hammering ORES into the ground. Now that $wgOresBaseUrl is fixed, ORES is again overloaded and suffering.

This graph shows the whole period of time that ORES was getting hammered with "test_stats" requests including the 35 minute period when $wgOresBaseUrl was broken: https://logstash.wikimedia.org/goto/91568910c65b23afcbfab5f15120c7e1

This graph shows that, during that time period, ORES was not overloaded: https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=now-6h&to=now-1m

Adding OOM kernel logs per host for posterity's sake.

Feel free to ignore electron in the logs. It's already being memory limited due to

MemoryLimit=2G in it's configuration. See https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/pdfrender/templates/initscripts/pdfrender.systemd.erb;3b73a1d95bbfbe8eddc095b9c06713aee87a4042$10

===== NODE GROUP =====                                                                                                                      
(1) scb2005.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:16:06 scb2005 kernel: [1818194.621901] Out of memory: Kill process 11579 (celery) score 53 or sacrifice child                     
Nov 28 19:16:09 scb2005 kernel: [1818195.042352] Out of memory: Kill process 9721 (celery) score 51 or sacrifice child                      
Nov 28 19:16:56 scb2005 kernel: [1818242.388117] Out of memory: Kill process 10231 (celery) score 51 or sacrifice child
Nov 28 19:17:46 scb2005 kernel: [1818291.854272] Out of memory: Kill process 10557 (celery) score 54 or sacrifice child
Nov 28 19:18:25 scb2005 kernel: [1818329.970783] Out of memory: Kill process 11004 (celery) score 52 or sacrifice child
Nov 28 19:19:16 scb2005 kernel: [1818384.372686] Out of memory: Kill process 11189 (celery) score 53 or sacrifice child
Nov 28 19:19:43 scb2005 kernel: [1818410.753576] Out of memory: Kill process 11306 (celery) score 53 or sacrifice child
Nov 28 19:19:48 scb2005 kernel: [1818416.323430] Out of memory: Kill process 11461 (celery) score 53 or sacrifice child
Nov 28 19:20:18 scb2005 kernel: [1818444.178684] Out of memory: Kill process 10920 (celery) score 53 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2004.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:14:44 scb2004 kernel: [1818432.259586] Out of memory: Kill process 14164 (celery) score 51 or sacrifice child                     
Nov 28 19:15:08 scb2004 kernel: [1818458.900757] Out of memory: Kill process 14661 (celery) score 56 or sacrifice child                     
Nov 28 19:15:33 scb2004 kernel: [1818484.900006] Out of memory: Kill process 15195 (celery) score 51 or sacrifice child
Nov 28 19:15:53 scb2004 kernel: [1818503.443654] Out of memory: Kill process 14372 (celery) score 56 or sacrifice child
Nov 28 19:16:19 scb2004 kernel: [1818529.093477] Out of memory: Kill process 15020 (celery) score 51 or sacrifice child
Nov 28 19:17:09 scb2004 kernel: [1818579.232931] Out of memory: Kill process 14186 (celery) score 54 or sacrifice child
Nov 28 20:16:23 scb2004 kernel: [1822131.567736] Out of memory: Kill process 8118 (celery) score 53 or sacrifice child
Nov 28 20:16:50 scb2004 kernel: [1822160.431118] Out of memory: Kill process 8443 (celery) score 53 or sacrifice child
Nov 28 20:17:18 scb2004 kernel: [1822188.640353] Out of memory: Kill process 8227 (celery) score 52 or sacrifice child
Nov 28 20:18:15 scb2004 kernel: [1822242.896041] Out of memory: Kill process 8082 (celery) score 51 or sacrifice child
Nov 28 20:18:45 scb2004 kernel: [1822274.719252] Out of memory: Kill process 8878 (celery) score 52 or sacrifice child
Nov 28 20:19:15 scb2004 kernel: [1822306.235702] Out of memory: Kill process 8330 (celery) score 50 or sacrifice child
Nov 28 20:19:50 scb2004 kernel: [1822341.461230] Out of memory: Kill process 8291 (celery) score 49 or sacrifice child
Nov 28 20:20:33 scb2004 kernel: [1822383.418588] Out of memory: Kill process 8345 (celery) score 50 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2006.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:18:16 scb2006 kernel: [1817966.004881] Out of memory: Kill process 27002 (celery) score 52 or sacrifice child                     
Nov 28 20:15:39 scb2006 kernel: [1821411.106727] Out of memory: Kill process 32070 (celery) score 62 or sacrifice child                     
Nov 28 20:16:09 scb2006 kernel: [1821440.560616] Out of memory: Kill process 31978 (celery) score 56 or sacrifice child
Nov 28 20:19:03 scb2006 kernel: [1821613.764689] Out of memory: Kill process 464 (celery) score 59 or sacrifice child
Nov 28 20:19:22 scb2006 kernel: [1821633.840057] Out of memory: Kill process 31833 (celery) score 56 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1002.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 07:56:30 scb1002 kernel: [1546177.910238] Out of memory: Kill process 32183 (celery) score 43 or sacrifice child                     
Nov 27 07:56:53 scb1002 kernel: [1546201.216850] Out of memory: Kill process 32075 (celery) score 43 or sacrifice child                     
Nov 27 08:18:25 scb1002 kernel: [1547492.777898] Out of memory: Kill process 31576 (celery) score 50 or sacrifice child
Nov 27 14:06:52 scb1002 kernel: [1568399.586493] Out of memory: Kill process 28843 (celery) score 49 or sacrifice child
Nov 27 17:38:57 scb1002 kernel: [1581125.736152] Out of memory: Kill process 10831 (celery) score 40 or sacrifice child
Nov 27 18:07:48 scb1002 kernel: [1582856.104279] Out of memory: Kill process 8376 (celery) score 45 or sacrifice child
Nov 28 05:06:55 scb1002 kernel: [1622402.561065] Out of memory: Kill process 32369 (celery) score 44 or sacrifice child
Nov 28 13:37:59 scb1002 kernel: [1653066.841579] Out of memory: Kill process 12422 (celery) score 40 or sacrifice child
Nov 28 16:05:32 scb1002 kernel: [1661920.501555] Out of memory: Kill process 1186 (celery) score 41 or sacrifice child
Nov 28 16:38:34 scb1002 kernel: [1663719.526667] Out of memory: Kill process 674 (electron) score 40 or sacrifice child
Nov 28 16:38:34 scb1002 kernel: [1663719.579351] Out of memory: Kill process 674 (electron) score 40 or sacrifice child
Nov 28 16:50:11 scb1002 kernel: [1664595.040206] Out of memory: Kill process 2317 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664596.022986] Out of memory: Kill process 2277 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664597.421139] Out of memory: Kill process 2376 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664598.075725] Out of memory: Kill process 1570 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664600.149305] Out of memory: Kill process 2390 (electron) score 300 or sacrifice child
Nov 28 16:50:13 scb1002 kernel: [1664600.457883] Out of memory: Kill process 6403 (celery) score 40 or sacrifice child
Nov 28 17:38:49 scb1002 kernel: [1667513.784895] Out of memory: Kill process 28241 (electron) score 301 or sacrifice child
Nov 28 20:36:50 scb1002 kernel: [1678190.284445] Out of memory: Kill process 28298 (electron) score 300 or sacrifice child
Nov 28 20:36:50 scb1002 kernel: [1678196.021986] Out of memory: Kill process 28414 (electron) score 300 or sacrifice child
Nov 28 20:40:30 scb1002 kernel: [1678417.488314] Out of memory: Kill process 28246 (electron) score 300 or sacrifice child
Nov 28 20:40:52 scb1002 kernel: [1678440.015374] Out of memory: Kill process 1049 (celery) score 52 or sacrifice child
Nov 28 20:41:42 scb1002 kernel: [1678489.810097] Out of memory: Kill process 809 (celery) score 47 or sacrifice child
Nov 28 21:31:34 scb1002 kernel: [1681482.312840] Out of memory: Kill process 32234 (celery) score 47 or sacrifice child
Nov 28 22:12:34 scb1002 kernel: [1683941.896235] Out of memory: Kill process 23184 (celery) score 48 or sacrifice child
Nov 28 23:12:33 scb1002 kernel: [1687540.863042] Out of memory: Kill process 32054 (celery) score 48 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2002.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:15:12 scb2002 kernel: [1819508.530270] Out of memory: Kill process 29606 (celery) score 64 or sacrifice child                     
Nov 28 19:16:14 scb2002 kernel: [1819572.064765] Out of memory: Kill process 28399 (celery) score 56 or sacrifice child                     
Nov 28 20:17:46 scb2002 kernel: [1823264.261442] Out of memory: Kill process 13362 (celery) score 57 or sacrifice child
Nov 28 20:18:24 scb2002 kernel: [1823300.812261] Out of memory: Kill process 13532 (celery) score 56 or sacrifice child
Nov 28 20:18:55 scb2002 kernel: [1823333.191774] Out of memory: Kill process 13505 (celery) score 56 or sacrifice child
Nov 28 20:19:27 scb2002 kernel: [1823364.559618] Out of memory: Kill process 12629 (celery) score 58 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1001.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 18:17:53 scb1001 kernel: [1583846.626530] Out of memory: Kill process 15244 (electron) score 300 or sacrifice child                  
Nov 27 18:17:53 scb1001 kernel: [1583846.667526] Out of memory: Kill process 15738 (electron) score 300 or sacrifice child                  
Nov 27 18:47:14 scb1001 kernel: [1585613.251738] Out of memory: Kill process 21875 (electron) score 301 or sacrifice child
Nov 27 18:49:28 scb1001 kernel: [1585753.832883] Out of memory: Kill process 19760 (electron) score 300 or sacrifice child
Nov 27 18:51:15 scb1001 kernel: [1585854.631345] Out of memory: Kill process 17838 (electron) score 300 or sacrifice child
Nov 27 18:51:16 scb1001 kernel: [1585859.243415] Out of memory: Kill process 18953 (electron) score 300 or sacrifice child
Nov 27 18:51:25 scb1001 kernel: [1585863.360719] Out of memory: Kill process 20557 (electron) score 300 or sacrifice child
Nov 27 18:51:25 scb1001 kernel: [1585866.487870] Out of memory: Kill process 15998 (electron) score 300 or sacrifice child
Nov 27 18:51:26 scb1001 kernel: [1585869.979075] Out of memory: Kill process 14219 (celery) score 44 or sacrifice child
Nov 28 12:47:20 scb1001 kernel: [1650404.019531] Out of memory: Kill process 12738 (electron) score 300 or sacrifice child
Nov 28 12:47:20 scb1001 kernel: [1650426.186987] Out of memory: Kill process 30398 (celery) score 42 or sacrifice child
Nov 28 16:15:25 scb1001 kernel: [1662905.270102] Out of memory: Kill process 30931 (electron) score 301 or sacrifice child
Nov 28 16:16:58 scb1001 kernel: [1663003.554926] Out of memory: Kill process 9409 (electron) score 44 or sacrifice child
Nov 28 16:16:58 scb1001 kernel: [1663004.249501] Out of memory: Kill process 9409 (electron) score 44 or sacrifice child
Nov 28 17:46:52 scb1001 kernel: [1668370.625754] Out of memory: Kill process 28494 (electron) score 300 or sacrifice child
Nov 28 17:46:52 scb1001 kernel: [1668396.187860] Out of memory: Kill process 25720 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668414.806314] Out of memory: Kill process 27927 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668424.622775] Out of memory: Kill process 26882 (electron) score 300 or sacrifice child
Nov 28 17:47:26 scb1001 kernel: [1668431.823590] Out of memory: Kill process 14328 (celery) score 41 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668676.498039] Out of memory: Kill process 30886 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668708.461017] Out of memory: Kill process 1476 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668721.010736] Out of memory: Kill process 1450 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668734.680660] Out of memory: Kill process 29856 (electron) score 300 or sacrifice child
Nov 28 17:52:37 scb1001 kernel: [1668742.246089] Out of memory: Kill process 14407 (celery) score 44 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2003.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 27 20:58:00 scb2003 kernel: [1738717.699534] Out of memory: Kill process 12113 (nodejs) score 66 or sacrifice child                     
Nov 28 20:15:03 scb2003 kernel: [1822542.503895] Out of memory: Kill process 605 (celery) score 57 or sacrifice child                       
Nov 28 20:15:56 scb2003 kernel: [1822595.553718] Out of memory: Kill process 736 (celery) score 56 or sacrifice child
Nov 28 20:16:27 scb2003 kernel: [1822627.273686] Out of memory: Kill process 636 (celery) score 55 or sacrifice child
Nov 28 20:17:03 scb2003 kernel: [1822662.909382] Out of memory: Kill process 834 (celery) score 51 or sacrifice child
Nov 28 20:18:35 scb2003 kernel: [1822755.305824] Out of memory: Kill process 18077 (celery) score 51 or sacrifice child
Nov 28 20:19:13 scb2003 kernel: [1822793.103879] Out of memory: Kill process 318 (celery) score 51 or sacrifice child
Nov 28 20:19:54 scb2003 kernel: [1822834.439551] Out of memory: Kill process 452 (celery) score 52 or sacrifice child
Nov 28 20:20:41 scb2003 kernel: [1822880.687558] Out of memory: Kill process 600 (celery) score 53 or sacrifice child
Nov 28 20:23:25 scb2003 kernel: [1823046.354950] Out of memory: Kill process 20020 (celery) score 55 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb2001.codfw.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 19:15:10 scb2001 kernel: [1819954.568507] Out of memory: Kill process 13811 (celery) score 52 or sacrifice child                     
Nov 28 19:16:52 scb2001 kernel: [1820056.177766] Out of memory: Kill process 14406 (celery) score 51 or sacrifice child                     
Nov 28 19:17:33 scb2001 kernel: [1820097.252114] Out of memory: Kill process 14393 (celery) score 55 or sacrifice child
Nov 28 19:18:14 scb2001 kernel: [1820138.430891] Out of memory: Kill process 13702 (celery) score 50 or sacrifice child
Nov 28 19:19:24 scb2001 kernel: [1820208.961718] Out of memory: Kill process 14263 (celery) score 51 or sacrifice child
Nov 28 19:19:24 scb2001 kernel: [1820209.589660] Out of memory: Kill process 14159 (celery) score 50 or sacrifice child
Nov 28 19:19:54 scb2001 kernel: [1820238.675306] Out of memory: Kill process 13688 (celery) score 52 or sacrifice child
Nov 28 19:22:12 scb2001 kernel: [1820377.530969] Out of memory: Kill process 14603 (celery) score 53 or sacrifice child
===== NODE GROUP =====                                                                                                                      
(1) scb1003.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'grep "Out of" /var/log/kern.log' -----                                                                                     
Nov 28 21:55:20 scb1003 kernel: [1682428.632717] Out of memory: Kill process 41989 (electron) score 300 or sacrifice child                  
Nov 28 21:55:25 scb1003 kernel: [1682429.417041] Out of memory: Kill process 32909 (electron) score 300 or sacrifice child                  
Nov 28 21:55:26 scb1003 kernel: [1682429.777013] Out of memory: Kill process 42310 (electron) score 300 or sacrifice child
Nov 28 21:55:26 scb1003 kernel: [1682430.582462] Out of memory: Kill process 42475 (electron) score 300 or sacrifice child
Nov 28 21:55:27 scb1003 kernel: [1682435.295307] Out of memory: Kill process 18197 (celery) score 33 or sacrifice child

Change 394037 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ORES: lower celery concurrency for scb100{1,2}

https://gerrit.wikimedia.org/r/394037

Change 394037 merged by Alexandros Kosiaris:
[operations/puppet@production] ORES: lower celery concurrency for scb100{1,2}

https://gerrit.wikimedia.org/r/394037

Change 394047 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Increase ORES queue_maxsize by 20%

https://gerrit.wikimedia.org/r/394047

Looks like we'll be doing the same thing today. There have been intermittent overload incidents for the last two hours, during which eqiad performance has dropped to nearly zero:

https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511948780946&to=1511958356189

Most of our worker boxes are down. Here's the app.log from one that's down, last written to 2 hours ago:

Connection to Redis lost: Retry (0/20) now.
Connection to Redis lost: Retry (1/20) in 1.00 second.
Connection to Redis lost: Retry (2/20) in 1.00 second.
Connection to Redis lost: Retry (3/20) in 1.00 second.
Connection to Redis lost: Retry (4/20) in 1.00 second.
Connection to Redis lost: Retry (5/20) in 1.00 second.
This comment was removed by awight.

Change 394060 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Increase celery verbosity; use message format including timestamp

https://gerrit.wikimedia.org/r/394060

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:35:42Z] <awight@tin> Started deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:35:58Z] <awight@tin> Finished deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538 (duration: 00m 16s)

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:37:27Z] <awight@tin> Started restart [ores/deploy@e58bfbf]: Restart ORES services (take 2), T181538

Mentioned in SAL (#wikimedia-operations) [2017-11-29T15:00:30Z] <awight> Restarting ORES celery workers manually, T181538

Change 394060 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Increase celery verbosity; use message format including timestamp

https://gerrit.wikimedia.org/r/394060

awight closed this task as Resolved.Dec 6 2017, 9:53 PM
awight removed a project: Patch-For-Review.
awight updated the task description. (Show Details)

Change 394047 abandoned by Alexandros Kosiaris:
Increase ORES queue_maxsize by 20%

https://gerrit.wikimedia.org/r/394047