Page MenuHomePhabricator

Stress/capacity test new ores* cluster
Closed, ResolvedPublic

Description

We should run a stress test on the new hardware as we prepare to bring it online. I want to get a clear measure of the type of throughput we can expect and where we start to overload. I feel like we've been relying on assumptions for a bit too long and this is a good opportunity to re-calibrate. If we move ORES fully to the ores* nodes, we can safely do such a test before moving over fully. If we want to do the hybrid setup, I'd like to switch from active-active to one datacenter to run the test against the other datacenter.

This task is done when we do a series of test runs at different score request frequencies to determine the capacity of the new cluster.

  • Write a script to produce scoring requests at variable frequencies (use a param to skip the cache -- like "?features")
  • Run the script against a test installation of ORES on ores* nodes
  • Document the results.

Our highest traffic load over the past year was 4k requests/minute, https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=1486192159051&to=1486517049462 . Our mean average load over the year was 522 requests/minute.

Details

Related Gerrit Patches:
operations/puppet : productionTune ORES Celery workers up a bit
operations/puppet : productionReduce the number of Celery workers for ORES
operations/puppet : productionReduce the number of Celery workers for ORES stress testing
operations/puppet : productionores: More configs for stress testing
operations/puppet : productionores: Add hieradata for number of celery workers
operations/puppet : productionAdd temporary role::ores::stresstest

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@akosiaris, I could use a hand making sure I'm hitting the new cluster in a reasonable way. I just invited you to sit down with me at 1400 UTC on Friday. Let me know if that won't work.

Joe added a subscriber: Joe.

I've completed a few tests. TL;DR: we need to up our celery worker count before we'll get an accurate reflection of the capacity of this cluster. I think we can safely multiply by 4.

1We're only testing ores1002-9 because 1001 is running the redis node, so we're testing at lower than expected capacity. However, 1001 will get pooled in via celery so I don't think it will matter that we're not hitting its uwsgi
2
3
4halfak@ores1001:~/ores$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.03 --verbose
5Fri Jul 28 14:59:50 UTC 2017
6Stats:
7 Duration: 545.949s
8 Requests: 17993 (1977.437 per minute)
9 Responses: 17869
10 - 200: 17630
11 - 503: 239
12 Score requests:
13 - errored: 239
14 - returned: 17630
15 Model scores:
16 - successes:
17 - wp10: 17255
18 - goodfaith: 17255
19 - reverted: 17255
20 - damaging: 17255
21 - draftquality: 17255
22 - errored:
23 - wp10:
24 - TimeoutError: 169
25 - TextDeleted: 25
26 - UserDeleted: 1
27 - RevisionNotFound: 180
28 - goodfaith:
29 - TimeoutError: 169
30 - TextDeleted: 25
31 - UserDeleted: 1
32 - RevisionNotFound: 180
33 - reverted:
34 - TimeoutError: 169
35 - TextDeleted: 25
36 - UserDeleted: 1
37 - RevisionNotFound: 180
38 - damaging:
39 - TimeoutError: 169
40 - TextDeleted: 25
41 - UserDeleted: 1
42 - RevisionNotFound: 180
43 - draftquality:
44 - TimeoutError: 169
45 - TextDeleted: 25
46 - UserDeleted: 1
47 - RevisionNotFound: 180
48
49
50Note that the 503's happened for a very short time period. Was likely some sort of hiccup. See https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&panelId=9&fullscreen&from=1501251117427&to=1501254657428
51
52halfak@ores1001:~/ores$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.01 --verbose
53Fri Jul 28 15:11:23 UTC 2017
54Stats:
55 Duration: 64.842s
56 Requests: 6092 (5637.086 per minute)
57 Responses: 5677
58 - 200: 2772
59 - 503: 2905
60 Score requests:
61 - errored: 2905
62 - returned: 2772
63 Model scores:
64 - successes:
65 - damaging: 2725
66 - goodfaith: 2725
67 - reverted: 2725
68 - wp10: 2725
69 - draftquality: 2725
70 - errored:
71 - damaging:
72 - TimeoutError: 8
73 - TextDeleted: 7
74 - RevisionNotFound: 32
75 - goodfaith:
76 - TimeoutError: 8
77 - TextDeleted: 7
78 - RevisionNotFound: 32
79 - reverted:
80 - TimeoutError: 8
81 - TextDeleted: 7
82 - RevisionNotFound: 32
83 - wp10:
84 - TimeoutError: 8
85 - TextDeleted: 7
86 - RevisionNotFound: 32
87 - draftquality:
88 - TimeoutError: 8
89 - TextDeleted: 7
90 - RevisionNotFound: 32
91
92
93Definitely not happy at this rate. I figured that our ceiling was somewhere around here. See https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&panelId=9&fullscreen&from=1501254379591&to=1501254802607 It looks like at the peak, we were able to process about 2196 scores per minute.
94
95Next question is, what does this look like if I do it in batches of 50?
96halfak@ores1001:~/ores$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.5 --batch-size=50 --verbose
97Fri Jul 28 15:16:57 UTC 2017
98Stats:
99 Duration: 180.051s
100 Requests: 357 (118.966 per minute)
101 Responses: 344
102 - 200: 133
103 - 503: 211
104 Score requests:
105 - errored: 211
106 - returned: 133
107 Model scores:
108 - successes:
109 - wp10: 6234
110 - reverted: 6234
111 - goodfaith: 6234
112 - draftquality: 6234
113 - damaging: 6234
114 - errored:
115 - wp10:
116 - RevisionNotFound: 77
117 - TimeoutError: 83
118 - UserDeleted: 1
119 - TextDeleted: 7
120 - TaskRevokedError: 248
121 - reverted:
122 - RevisionNotFound: 77
123 - TimeoutError: 83
124 - UserDeleted: 1
125 - TextDeleted: 7
126 - TaskRevokedError: 248
127 - goodfaith:
128 - RevisionNotFound: 77
129 - TimeoutError: 83
130 - UserDeleted: 1
131 - TextDeleted: 7
132 - TaskRevokedError: 248
133 - draftquality:
134 - RevisionNotFound: 77
135 - TimeoutError: 83
136 - UserDeleted: 1
137 - TextDeleted: 7
138 - TaskRevokedError: 248
139 - damaging:
140 - RevisionNotFound: 77
141 - TimeoutError: 83
142 - UserDeleted: 1
143 - TextDeleted: 7
144 - TaskRevokedError: 248
145
146So that didn't go any better. One interesting bit is that we're getting a bunch of TaskRevokedError. I'm not sure what that is all about. I'm guessing that some of the TimeoutErrors have turned into TaskRevokedError. We should look into ORES to see why it is doing this. I think we should probably just let those tasks timeout.

[10:03:27] <_joe_> so we need to have more celery workers
[10:03:33] <halfak> right. 
[10:03:47] <halfak> How's memory usage?
[10:03:51] <halfak> That's our ceiling for workers.
[10:04:04] <_joe_> well, if you have a task open for this, add date/times (in UTC)
[10:04:06] <_joe_> 65881080 total, 14889904 used, 50991176 free

I scheduled some time with @Joe for running another test with more workers on Monday at 1300 UTC.

Halfak claimed this task.Jul 31 2017, 4:54 PM

Change 369915 had a related patch set uploaded (by Halfak; owner: halfak):
[operations/puppet@production] Adds hieradata for ores::celery::workers with default.

https://gerrit.wikimedia.org/r/369915

Halfak added a comment.Aug 7 2017, 2:04 PM

Looks like we have missed the scheduled time. I'm just waiting on review of the above patchset so that I can continue testing.

I 've commented on the above change. @Halfak I am around again so let's schedule some time to look into this again. There is availability on my part is on Thu/Fri this week and the whole of next week

Change 369915 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Add hieradata for number of celery workers

https://gerrit.wikimedia.org/r/369915

Changed above merged, I did a puppet test ran on both production (a noop as expected) and a stresstest node (120 CELERYD_CONCURRENCY setting change). I guess we are ready for one more test ?

Ran another test. Looks like we can handle 2000 requests per minute without much trouble. But we barf at 3000 requests per minute.

See https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1503329065175&to=1503330745949

1$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.1 --verbose
2Mon Aug 21 15:30:57 UTC 2017
3Stats:
4 Duration: 389.46s
5 Requests: 3865 (595.439 per minute)
6 Responses: 3851
7 - 200: 3851
8 Score requests:
9 - errored: 0
10 - returned: 3851
11 Model scores:
12 - successes:
13 - damaging: 3803
14 - reverted: 3803
15 - draftquality: 3803
16 - wp10: 3803
17 - goodfaith: 3803
18 - errored:
19 - damaging:
20 - TimeoutError: 1
21 - RevisionNotFound: 40
22 - TextDeleted: 7
23 - reverted:
24 - TimeoutError: 1
25 - RevisionNotFound: 40
26 - TextDeleted: 7
27 - draftquality:
28 - TimeoutError: 1
29 - RevisionNotFound: 40
30 - TextDeleted: 7
31 - wp10:
32 - TimeoutError: 1
33 - RevisionNotFound: 40
34 - TextDeleted: 7
35 - goodfaith:
36 - TimeoutError: 1
37 - RevisionNotFound: 40
38 - TextDeleted: 7
39
40$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 httqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.01 --verbose
41Mon Aug 21 15:38:07 UTC 2017
42Stats:
43 Duration: 152.583s
44 Requests: 14695 (5778.506 per minute)
45 Responses: 14303
46 - 200: 5544
47 - 503: 8759
48 Score requests:
49 - errored: 8759
50 - returned: 5544
51 Model scores:
52 - successes:
53 - draftquality: 5185
54 - reverted: 5185
55 - wp10: 5185
56 - damaging: 5185
57 - goodfaith: 5185
58 - errored:
59 - draftquality:
60 - TextDeleted: 7
61 - TaskRevokedError: 25
62 - RevisionNotFound: 71
63 - TimeoutError: 256
64 - reverted:
65 - TextDeleted: 7
66 - TaskRevokedError: 25
67 - RevisionNotFound: 71
68 - TimeoutError: 256
69 - wp10:
70 - TextDeleted: 7
71 - TaskRevokedError: 25
72 - RevisionNotFound: 71
73 - TimeoutError: 256
74 - damaging:
75 - TextDeleted: 7
76 - TaskRevokedError: 25
77 - RevisionNotFound: 71
78 - TimeoutError: 256
79 - goodfaith:
80 - TextDeleted: 7
81 - TaskRevokedError: 25
82 - RevisionNotFound: 71
83 - TimeoutError: 256
84
85
86$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.03 --verbose
87Mon Aug 21 15:43:09 UTC 2017
88
89Stats:
90 Duration: 388.702s
91 Requests: 12791 (1974.416 per minute)
92 Responses: 12626
93 - 200: 12465
94 - 503: 161
95 Score requests:
96 - errored: 161
97 - returned: 12465
98 Model scores:
99 - successes:
100 - wp10: 12155
101 - damaging: 12155
102 - goodfaith: 12155
103 - draftquality: 12155
104 - reverted: 12155
105 - errored:
106 - wp10:
107 - RevisionNotFound: 137
108 - TimeoutError: 97
109 - TaskRevokedError: 58
110 - TextDeleted: 18
111 - damaging:
112 - RevisionNotFound: 137
113 - TimeoutError: 97
114 - TaskRevokedError: 58
115 - TextDeleted: 18
116 - goodfaith:
117 - RevisionNotFound: 137
118 - TimeoutError: 97
119 - TaskRevokedError: 58
120 - TextDeleted: 18
121 - draftquality:
122 - RevisionNotFound: 137
123 - TimeoutError: 97
124 - TaskRevokedError: 58
125 - TextDeleted: 18
126 - reverted:
127 - RevisionNotFound: 137
128 - TimeoutError: 97
129 - TaskRevokedError: 58
130 - TextDeleted: 18
131
132
133$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1001.eqiad.wmnet:8081 http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.02 --verbose
134Mon Aug 21 15:51:10 UTC 2017
135
136Stats:
137 Duration: 122.359s
138 Requests: 5939 (2912.25 per minute)
139 Responses: 5582
140 - 200: 4775
141 - 503: 807
142 Score requests:
143 - errored: 807
144 - returned: 4775
145 Model scores:
146 - successes:
147 - damaging: 4471
148 - wp10: 4471
149 - draftquality: 4471
150 - goodfaith: 4471
151 - reverted: 4471
152 - errored:
153 - damaging:
154 - TextDeleted: 12
155 - RevisionNotFound: 56
156 - TaskRevokedError: 1
157 - TimeoutError: 235
158 - wp10:
159 - TextDeleted: 12
160 - RevisionNotFound: 56
161 - TaskRevokedError: 1
162 - TimeoutError: 235
163 - draftquality:
164 - TextDeleted: 12
165 - RevisionNotFound: 56
166 - TaskRevokedError: 1
167 - TimeoutError: 235
168 - goodfaith:
169 - TextDeleted: 12
170 - RevisionNotFound: 56
171 - TaskRevokedError: 1
172 - TimeoutError: 235
173 - reverted:
174 - TextDeleted: 12
175 - RevisionNotFound: 56
176 - TaskRevokedError: 1
177 - TimeoutError: 235
178
179

So my thought is that we can certainly bump up the number or workers. I also think we should increase the max size for the celery queue.

See https://phabricator.wikimedia.org/source/ores-deploy/browse/master/config/00-main.yaml;835d848131c1f9c9628aa85e36630b8f2f02df67$89 for where we get it from the config.

Halfak reassigned this task from Halfak to Ladsgroup.Aug 21 2017, 4:42 PM
Halfak moved this task from Review to Active on the Scoring-platform-team (Current) board.
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptAug 21 2017, 4:42 PM

Change 372866 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] ores: More configs for stress testing

https://gerrit.wikimedia.org/r/372866

Change 372866 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: More configs for stress testing

https://gerrit.wikimedia.org/r/372866

Halfak added a comment.EditedAug 28 2017, 8:01 PM

New test today. Moral of the story is TOO MANY FILE HANDLES.

https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1503928941970&to=1503935541790&panelId=3&fullscreen

1After increasing file handle max to 16k and redis connection limit to 16k.
2
3halfak@ores1001:~/ores$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1001.eqiad.wmnet:8081 http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.03 --verbose
4Mon Aug 28 15:14:17 UTC 2017
5Stats:
6 Duration: 234.408s
7 Requests: 7692 (1968.875 per minute)
8 Responses: 7641
9 - 200: 7641
10 Score requests:
11 - errored: 0
12 - returned: 7641
13 Model scores:
14 - successes:
15 - wp10: 7474
16 - goodfaith: 7474
17 - damaging: 7474
18 - draftquality: 7474
19 - reverted: 7474
20 - errored:
21 - wp10:
22 - RevisionNotFound: 76
23 - TextDeleted: 15
24 - TimeoutError: 1
25 - TaskRevokedError: 75
26 - goodfaith:
27 - RevisionNotFound: 76
28 - TextDeleted: 15
29 - TimeoutError: 1
30 - TaskRevokedError: 75
31 - damaging:
32 - RevisionNotFound: 76
33 - TextDeleted: 15
34 - TimeoutError: 1
35 - TaskRevokedError: 75
36 - draftquality:
37 - RevisionNotFound: 76
38 - TextDeleted: 15
39 - TimeoutError: 1
40 - TaskRevokedError: 75
41 - reverted:
42 - RevisionNotFound: 76
43 - TextDeleted: 15
44 - TimeoutError: 1
45 - TaskRevokedError: 75
46
47
48halfak@ores1001:~/ores$ date; cat ~/enwiki.200k.sample.json | /srv/deployment/ores/venv/bin/python ./utility stress_test enwiki http://ores1001.eqiad.wmnet:8081 http://ores1002.eqiad.wmnet:8081 http://ores1003.eqiad.wmnet:8081 http://ores1004.eqiad.wmnet:8081 http://ores1005.eqiad.wmnet:8081 http://ores1006.eqiad.wmnet:8081 http://ores1007.eqiad.wmnet:8081 http://ores1008.eqiad.wmnet:8081 http://ores1009.eqiad.wmnet:8081 --delay 0.02 --verbose
49Mon Aug 28 15:19:59 UTC 2017
50
51Stats:
52 Duration: 2254.692s
53 Requests: 111509 (2967.386 per minute)
54 Responses: 111508
55 - 200: 59923
56 - 503: 51585
57 Score requests:
58 - errored: 51585
59 - returned: 59923
60 Model scores:
61 - successes:
62 - wp10: 58916
63 - damaging: 58916
64 - goodfaith: 58916
65 - reverted: 58916
66 - draftquality: 58916
67 - errored:
68 - wp10:
69 - TimeoutError: 119
70 - CommentDeleted: 6
71 - TextDeleted: 98
72 - TaskRevokedError: 76
73 - RevisionNotFound: 708
74 - damaging:
75 - TimeoutError: 119
76 - CommentDeleted: 6
77 - TextDeleted: 98
78 - TaskRevokedError: 76
79 - RevisionNotFound: 708
80 - goodfaith:
81 - TimeoutError: 119
82 - CommentDeleted: 6
83 - TextDeleted: 98
84 - TaskRevokedError: 76
85 - RevisionNotFound: 708
86 - reverted:
87 - TimeoutError: 119
88 - CommentDeleted: 6
89 - TextDeleted: 98
90 - TaskRevokedError: 76
91 - RevisionNotFound: 708
92 - draftquality:
93 - TimeoutError: 119
94 - CommentDeleted: 6
95 - TextDeleted: 98
96 - TaskRevokedError: 76
97 - RevisionNotFound: 708
98
99This ran fine for 5 minutes and then errored for a few minutes. @akosiaris reported the following error took out the celery workers. See https://phabricator.wikimedia.org/P5930#32584

So, it looks like there's still some errors happening that are related to file handles. I dug a bit and it seems there's discussion of using select.poll() vs. select.select(). https://github.com/celery/celery/issues/3397

BTW, the last error is here: P5930#32584

awight added a subscriber: awight.Sep 13 2017, 5:07 PM

Random tool which seems to give a good estimate of RSS not including copy-on-write pages from the parent process: https://gist.github.com/Eugeny/04ccfe8accf4bc74b0ca

Just checking in here because I've been OoO. Did any tests happen after my last test?

Ran a few tests today, and found that the filehandle issue is not solved. The celery service died on several nodes, with the file descriptor error. Good news is that even with a degraded cluster, we can easily serve 2k requests/minute, at 6% CPU usage per machine.

Here are results from one of the tests,

1Stats:
2 Duration: 1057.565s
3 Requests: 34891 (1979.509 per minute)
4 Responses: 34841
5 - 200: 34841
6 Score requests:
7 - errored: 0
8 - returned: 34841
9 Model scores:
10 - successes:
11 - damaging: 34338
12 - goodfaith: 34338
13 - draftquality: 34338
14 - wp10: 34338
15 - reverted: 34338
16 - errored:
17 - damaging:
18 - UserDeleted: 1
19 - TimeoutError: 29
20 - CommentDeleted: 1
21 - TextDeleted: 54
22 - RevisionNotFound: 418
23 - goodfaith:
24 - UserDeleted: 1
25 - TimeoutError: 29
26 - CommentDeleted: 1
27 - TextDeleted: 54
28 - RevisionNotFound: 418
29 - draftquality:
30 - UserDeleted: 1
31 - TimeoutError: 29
32 - CommentDeleted: 1
33 - TextDeleted: 54
34 - RevisionNotFound: 418
35 - wp10:
36 - UserDeleted: 1
37 - TimeoutError: 29
38 - CommentDeleted: 1
39 - TextDeleted: 54
40 - RevisionNotFound: 418
41 - reverted:
42 - UserDeleted: 1
43 - TimeoutError: 29
44 - CommentDeleted: 1
45 - TextDeleted: 54
46 - RevisionNotFound: 418

Halfak claimed this task.Oct 16 2017, 6:58 PM
Halfak added a subscriber: Ladsgroup.
Halfak reassigned this task from Halfak to awight.Oct 30 2017, 4:55 PM

Change 389493 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Reduce the number of Celery workers for ORES stress testing

https://gerrit.wikimedia.org/r/389493

awight added a comment.Nov 6 2017, 2:57 PM

@akosiaris FYI, I'm backing off from our attempt to use 480 workers. The change is ready for review in https://gerrit.wikimedia.org/r/#/c/389493/.

I tried working around the problem with Celery 4, which has replaced some of its own select() calls with poll(), but from what I'm seeing, even if Celery itself uses poll(), other libraries still use select() and run into the limit simply because 1024 file descriptors have been opened.

Change 389493 merged by Alexandros Kosiaris:
[operations/puppet@production] Reduce the number of Celery workers for ORES stress testing

https://gerrit.wikimedia.org/r/389493

awight added a comment.Nov 7 2017, 2:26 PM

Turns out we can't use the other concurrency pools for Celery. The other options, gevent and eventlet, are both single-threaded and are only meant for IO-bound tasks.

awight updated the task description. (Show Details)Dec 5 2017, 4:23 PM
awight added a comment.Dec 5 2017, 4:48 PM

Running a low-ish test at 1,200 req/min, https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1512490129920&to=1512491206473&panelId=3&fullscreen

I'm seeing the same behavior as before, where celery workers die off a node at a time. For example, at 2017-12-05 16:22, ores1002 stopped processing scores. So far I haven't been able to find any explanation in logstash.

awight added a comment.Dec 5 2017, 4:57 PM

I got this shred of stack trace from service celery-ores-worker status -l, but can't get at anything more with my lack of permissions.

Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: stop_if_not_current(self._result_handler)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 151, in stop_if_not_current
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: thread.stop(timeout)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 500, in stop
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: self.on_stop_not_started()
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/asynpool.py", line 317, in on_stop_not_started
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: join_exited_workers(shutdown=True)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 1119, in _join_exited_workers
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: del self._poolctrl[worker.pid]
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: KeyError: None

@akosiaris Would you help me get some diagnostics about this?

awight added a comment.Dec 5 2017, 5:42 PM

I'm pretty sure it's just an OOM, still it would be nice to be able to read more logs.

The available memory graphs show the memory bottoming-out then rebounding as the celery workers collapse: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1512493702185&to=1512494542186&panelId=24&fullscreen

We should tune the number of celery workers down further. Taking the reported resident memory literally (we believe it's not accounting for some shared memory and is therefore reporting too high), each worker takes 1GB under no load and up to 1.4GB when fully loaded. We should look into whether this is a memory leak or not, since the models are supposed to be initialized in the parent process before forking.

So, with 240 workers the resident memory is clearly incorrect. This is a known problem on linux. It's possible to calculate the true resident memory by looking at each memory range and checking for overlaps, but for our purposes we can estimate by looking at free memory after starting the workers: 57.5GB free with no celery workers running, 41GB free after starting and waiting for all 240 processes to fork. This is 72MB of resident memory for each worker, including the shared copy-on-write chunk. If all 240 workers were to grow by 400MB of resident memory, it would require an additional 94GB of memory.

To estimate the number of "grown" workers we can support: 57.5GB free / (72MB idle + 400MB working) = 125 workers. Slightly conservatively, let's try 100 workers.

Change 395579 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Reduce the number of Celery workers for ORES

https://gerrit.wikimedia.org/r/395579

Halfak added a comment.Dec 5 2017, 5:47 PM

This is probably not a memory leak. Workers increase in memory because they variable amounts of data for performing feature extractions. Some revisions are very big. Most are very small.

Change 395579 merged by Dzahn:
[operations/puppet@production] Reduce the number of Celery workers for ORES

https://gerrit.wikimedia.org/r/395579

awight added a comment.Dec 5 2017, 8:18 PM

Downsizing my estimate. With 100 workers/box, we're hovering around 28GB free, so (57.5GB - 28GB) / 100 workers = 310MB per worker. We should be able to support 92 more workers. Let's bump up to 175 workers...

Change 395608 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Tune ORES Celery workers up a bit

https://gerrit.wikimedia.org/r/395608

Change 395608 merged by Alexandros Kosiaris:
[operations/puppet@production] Tune ORES Celery workers up a bit

https://gerrit.wikimedia.org/r/395608

I got this shred of stack trace from service celery-ores-worker status -l, but can't get at anything more with my lack of permissions.

Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: stop_if_not_current(self._result_handler)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 151, in stop_if_not_current
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: thread.stop(timeout)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 500, in stop
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: self.on_stop_not_started()
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/asynpool.py", line 317, in on_stop_not_started
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: join_exited_workers(shutdown=True)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/pool.py", line 1119, in _join_exited_workers
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: del self._poolctrl[worker.pid]
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: KeyError: None

@akosiaris Would you help me get some diagnostics about this?

Yeah, so here's the kernel OOMs where the kernel had to kill celery

[Tue Dec  5 16:20:16 2017] Out of memory: Kill process 11626 (celery) score 27 or sacrifice child
[Tue Dec  5 16:49:56 2017] Out of memory: Kill process 7522 (celery) score 21 or sacrifice child
[Tue Dec  5 17:19:55 2017] Out of memory: Kill process 14755 (celery) score 20 or sacrifice child
[Tue Dec  5 19:57:56 2017] Out of memory: Kill process 23471 (celery) score 21 or sacrifice child

and

akosiaris@ores1004:~$ sudo journalctl -u celery-ores-worker --since='2017-12-05' |grep -i error
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: [2017-12-05 16:21:50,115: ERROR/MainProcess] Unrecoverable error: MemoryError('[Errno 12] Cannot allocate memory',)
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: OSError: [Errno 12] Cannot allocate memory
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: [2017-12-05 16:21:50,266: ERROR/MainProcess] Error on stopping Pool: KeyError(None,)
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: OSError: [Errno 12] Cannot allocate memory
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 16:21:50 ores1004 celery-ores-worker[10997]: KeyError: None
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: [2017-12-05 16:51:29,123: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:draftquality:0.1.0:685385372] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: [2017-12-05 16:51:29,431: ERROR/MainProcess] Unrecoverable error: MemoryError('[Errno 12] Cannot allocate memory',)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: OSError: [Errno 12] Cannot allocate memory
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: [2017-12-05 16:51:29,779: ERROR/MainProcess] Error on stopping Pool: KeyError(None,)
Dec 05 16:51:29 ores1004 celery-ores-worker[6628]: KeyError: None
Dec 05 17:21:28 ores1004 celery-ores-worker[13910]: [2017-12-05 17:21:28,557: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:goodfaith:0.4.0:697175130] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Dec 05 17:21:28 ores1004 celery-ores-worker[13910]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: [2017-12-05 17:21:29,187: ERROR/MainProcess] Unrecoverable error: MemoryError('[Errno 12] Cannot allocate memory',)
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: OSError: [Errno 12] Cannot allocate memory
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: [2017-12-05 17:21:29,275: ERROR/MainProcess] Error on stopping Pool: KeyError(None,)
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: OSError: [Errno 12] Cannot allocate memory
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 17:21:29 ores1004 celery-ores-worker[13910]: KeyError: None
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,139: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:goodfaith:0.4.0:701877605] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,141: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:draftquality:0.1.0:708734419] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,143: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:damaging:0.4.0:708734419] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,313: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:wp10:0.6.1:698136198] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,323: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._process_score_map[c9357415-634b-4197-be63-9e9857b9c6fa] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: [2017-12-05 17:29:04,326: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._process_score_map[75cd366e-ee13-4dab-90e0-e64a1c8800ad] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Dec 05 17:29:04 ores1004 celery-ores-worker[20154]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Dec 05 19:58:51 ores1004 celery-ores-worker[23203]: [2017-12-05 19:58:51,329: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:wp10:0.6.1:740349991] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Dec 05 19:58:51 ores1004 celery-ores-worker[23203]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Dec 05 19:59:29 ores1004 celery-ores-worker[23203]: [2017-12-05 19:59:29,606: ERROR/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:draftquality:0.1.0:715997772] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Dec 05 19:59:29 ores1004 celery-ores-worker[23203]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Dec 05 19:59:30 ores1004 celery-ores-worker[23203]: [2017-12-05 19:59:29,950: ERROR/MainProcess] Unrecoverable error: MemoryError('[Errno 12] Cannot allocate memory',)
Dec 05 19:59:30 ores1004 celery-ores-worker[23203]: OSError: [Errno 12] Cannot allocate memory
Dec 05 19:59:30 ores1004 celery-ores-worker[23203]: MemoryError: [Errno 12] Cannot allocate memory
Dec 05 19:59:39 ores1004 celery-ores-worker[23203]: [2017-12-05 19:59:39,041: ERROR/MainProcess] Error on stopping Pool: KeyError(None,)
Dec 05 19:59:39 ores1004 celery-ores-worker[23203]: KeyError: None

Quite clearly OOM.

Halfak added a comment.Dec 5 2017, 9:30 PM

Looks like that was before worker count was changed, right? The last stress test started at 20:08.

Halfak added a comment.Dec 5 2017, 9:55 PM

It seems that the fastest we can send requests from one machine is about 4.5k/min or 75/sec. I tried running two of our stress testing scripts at the same time and achieved the same rate. I'll try running the stress tests from ores1001 and ores1002 next.

OK nevermind. It seems like we have another limit. I ran a stress test on ores1001 and ores1002. Here's what I got:

ores1001.eqiad.wmnet

Stats:
  Duration: 552.711s
  Requests: 20150 (2187.399 per minute)
  Responses: 18658
   - 200: 18658
   - connection error: 492
  Score requests:
   - errored: 0
   - returned: 18658
  Model scores:
   - successes:
     - goodfaith: 17222
     - damaging: 17222
     - draftquality: 17222
     - wp10: 17222
   - errored:
     - goodfaith:
       - TimeoutError: 854
       - TaskRevokedError: 307
       - TextDeleted: 26
       - RevisionNotFound: 247
       - CommentDeleted: 2
     - damaging:
       - TimeoutError: 854
       - TaskRevokedError: 307
       - TextDeleted: 26
       - RevisionNotFound: 247
       - CommentDeleted: 2
     - draftquality:
       - TimeoutError: 854
       - TaskRevokedError: 307
       - TextDeleted: 26
       - RevisionNotFound: 247
       - CommentDeleted: 2
     - wp10:
       - TimeoutError: 854
       - TaskRevokedError: 307
       - TextDeleted: 26
       - RevisionNotFound: 247
       - CommentDeleted: 2

ores1002.eqiad.wmnet

Stats:
  Duration: 570.751s
  Requests: 22679 (2384.122 per minute)
  Responses: 21371
   - 200: 21371
   - connection error: 308
  Score requests:
   - errored: 0
   - returned: 21371
  Model scores:
   - successes:
     - damaging: 19850
     - wp10: 19850
     - draftquality: 19850
     - goodfaith: 19850
   - errored:
     - damaging:
       - CommentDeleted: 2
       - TextDeleted: 31
       - TimeoutError: 905
       - TaskRevokedError: 308
       - RevisionNotFound: 275
     - wp10:
       - CommentDeleted: 2
       - TextDeleted: 31
       - TimeoutError: 905
       - TaskRevokedError: 308
       - RevisionNotFound: 275
     - draftquality:
       - CommentDeleted: 2
       - TextDeleted: 31
       - TimeoutError: 905
       - TaskRevokedError: 308
       - RevisionNotFound: 275
     - goodfaith:
       - CommentDeleted: 2
       - TextDeleted: 31
       - TimeoutError: 905
       - TaskRevokedError: 308
       - RevisionNotFound: 275

My hypothesis is that we have a bottleneck with our web workers.

awight closed this task as Resolved.Dec 6 2017, 8:04 PM

@Halfak Let's declare this a win. We showed that the new cluster is capable of keeping up with the previous maximum of 8x the average load. We have a few avenues to make further improvements, but nothing that blocks provisioning these machines.

Halfak added a comment.Dec 6 2017, 8:39 PM

One reason we might not want to provision these machines is that we won't be able to safely load test again. If we can push the ceiling up a bit higher, that might be worthwhile.

One other note is that we need to T171851: Reimage ores* hosts with Debian Stretch before provisioning.

awight added a comment.Dec 6 2017, 8:52 PM

Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine can handle our average traffic, so we'd be fine depooling 4 or 5 servers.

What I'm unsure of are the best practices around this. The risks are that we mix production and test traffic, accidentally sending stress traffic to production workers, and it might be annoying to switch Redis queues for only some machines. Maybe @akosiaris can weigh in with his preferences.

Halfak added a comment.Dec 6 2017, 8:57 PM

We could de-pool a whole datacenter. That would allow us to not mix traffic and run tests. That would also allow us to safely perform an OS upgrade later.

awight added a comment.Dec 6 2017, 8:58 PM

Let's do it.

Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine can handle our average traffic, so we'd be fine depooling 4 or 5 servers.

Production is not really the place to run tests and that will show if we try to do that in a variety of ways, e.g. having to deal with the puppetization of the stress tests machines which limited unnecessarily your flexibility and added delays. Stress testing hardware for some time is fine, maintaining an infrastructure just for that in production is nor the best use of that hardware, nor the time spent around it.

What I'm unsure of are the best practices around this. The risks are that we mix production and test traffic, accidentally sending stress traffic to production workers, and it might be annoying to switch Redis queues for only some machines. Maybe @akosiaris can weigh in with his preferences.

Both are true. And alongside with those even more stuff, like test traffic triggering false alarms that others will try to address causing confusion and frustration, metrics dashboards reporting false/useless data and so on. Arguably, most of solvable, but then again it might very well not be worth the time. Given all that, overall I am against maintaining a long term stress test infrastructure in production.

Halfak added a comment.Dec 7 2017, 4:12 PM

In this case, it's advanced smoke testing for the cluster. I'm hesitant to deploy in production until we've thoroughly run the cluster through a set of tests. There's no proposal here to maintain a long term stress test infrastructure in production. The proposal is to move forward with production deployment before we have finished thorough stress/smoke testing.

Dzahn added a subscriber: Dzahn.Dec 7 2017, 7:41 PM

Is the stress test over? Then T165170 is probably unstalled now. Is it not over yet? Then maybe this ticket should be reopened?

awight added a comment.EditedDec 7 2017, 8:02 PM

@Dzahn sorry--we decided to test some more, to overcome a suspiciously low performance ceiling. I'll make the follow-up task a blocker.

awight closed this task as Resolved.Dec 14 2017, 9:22 PM

Ok, done for real now. @Halfak and I decided that the remaining bottlenecks are something non-trivial and we don't want to block on further optimization.

The current tuning parameters in stresstest.yaml are correct for the hardware, provide plenty of capacity and good behavior in sustained overload conditions. We're ready to migrate to this cluster, from our perspective.