Page MenuHomePhabricator

Migrate CirrusSearch jobs to Kafka queue
Closed, ResolvedPublic

Description

CirrusSearch job are the biggest chunk of jobs still on the Redis queue. Here's the list:

I think by now we're fairly confident that switching simpler jobs will be pretty straightforward, but we need to coordinate the with @EBernhardson so that we choose a fairly quiet period for Elasticsearch and also to help with verifying the correctness.

I think we have to use a similar approach that we've used for other jobs and for step 0 switch all the jobs for test wikis and mediawiki and then ask the discovery team to verify correctness on the Elasticsearch side.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterDisable redis queue for cirrus search for all wikis.
mediawiki/services/change-propagation/jobqueue-deploy : masterSwitch cirrusSearch jobs for everything.
operations/mediawiki-config : masterDisable redis queue for cirrus except wikipedia, commons and wikidata.
mediawiki/services/change-propagation/jobqueue-deploy : masterEnable cirrus jobs for all but wikipedia, commons and wikidata.
mediawiki/extensions/EventBus : master[JobExecutor] Change the method used to construct the page title.
mediawiki/services/change-propagation/jobqueue-deploy : masterEnable cirrus search jobs for test wikis.
operations/mediawiki-config : masterDisable redis queue for cirrusSearch jobs for test wikis.

Event Timeline

Pchelolo triaged this task as High priority.Mar 7 2018, 5:36 PM
Pchelolo created this task.
Restricted Application added a project: Analytics. · View Herald TranscriptMar 7 2018, 5:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 416992 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrusSearch jobs for test wikis.

https://gerrit.wikimedia.org/r/416992

Change 417000 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus search jobs for test wikis.

https://gerrit.wikimedia.org/r/417000

fdans moved this task from Incoming to Radar on the Analytics board.Mar 8 2018, 6:09 PM
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 23 2018, 3:54 PM

Change 416992 merged by Mobrovac:
[operations/mediawiki-config@master] Disable redis queue for cirrusSearch jobs for test wikis.

https://gerrit.wikimedia.org/r/416992

Change 417000 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus search jobs for test wikis.

https://gerrit.wikimedia.org/r/417000

Mentioned in SAL (#wikimedia-operations) [2018-03-28T14:56:15Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: Disable redis queue for cirrusSearch jobs for test wikis, file 1/2 - T189137 (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-03-28T14:58:05Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Disable redis queue for cirrusSearch jobs for test wikis, file 2/2 - T189137 (duration: 01m 17s)

Change 422436 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] [JobExecutor] Change the method used to construct the page title.

https://gerrit.wikimedia.org/r/422436

Change 422436 abandoned by Ppchelko:
[JobExecutor] Change the method used to construct the page title.

Reason:
In favor of I8f305aca34a20e3c339f6bc96815c3def53e48ad

https://gerrit.wikimedia.org/r/422436

dcausse added a comment.EditedApr 25 2018, 2:02 PM

I don't have strong opinions on which wikis we should migrate next.
My sole concerns right now is regarding write freezes when we restart the elastic clusters.
When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

I don't have strong opinions on which wikis we should migrate next.

group1 could be a good next-step candidate.

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

When we freeze writes we start to push ElasticaWrite jobs that contains the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

We might need to do the same, as there is an Apache instance powering the new queue's end point receiving the jobs.

The subtasks that were created to fix issues discovered during the first iteration of the switch were resolved, and I don't see any logs indicating there's problems, so seems like nothing is blocking us from moving some more projects to kafka queue.

Based on cirrusSearchIncomingLinkCount job here the distribution of projects by traffic based on the selection of 1.000.000 events:

1 912832 "en.wikipedia.org"
2 21949 "commons.wikimedia.org"
3 10475 "www.wikidata.org"
4 6861 "it.wikipedia.org"
5 4494 "zh.wikipedia.org"
6 4410 "fr.wikipedia.org"
7 4122 "cy.wikipedia.org"
8 3722 "bn.wikipedia.org"
9 3669 "de.wikipedia.org"
10 3180 "pl.wikipedia.org"
11 2690 "vi.wikipedia.org"
12 2661 "ar.wikipedia.org"
13 1796 "es.wikipedia.org"
14 1346 "pt.wikipedia.org"
15 1301 "ru.wikipedia.org"
16 1054 "uk.wikipedia.org"
17 944 "ja.wikipedia.org"
18 713 "fr.wiktionary.org"
19 710 "ko.wikipedia.org"
20 633 "ceb.wikipedia.org"
21 611 "he.wikipedia.org"
22 580 "hr.wikipedia.org"
23 579 "te.wikipedia.org"
24 517 "en.wiktionary.org"
25 489 "ast.wikipedia.org"
26 416 "nl.wikipedia.org"
27 414 "cs.wikipedia.org"
28 402 "fa.wikipedia.org"
29 323 "sv.wikipedia.org"
30 284 "el.wikipedia.org"
31 266 "da.wikipedia.org"
32 253 "de.wiktionary.org"
33 238 "id.wikipedia.org"
34 237 "ca.wikipedia.org"
35 212 "en.wikisource.org"
36 185 "pl.wiktionary.org"
37 178 "ur.wikipedia.org"
38 173 "meta.wikimedia.org"
39 170 "bg.wikipedia.org"
40 148 "ro.wikipedia.org"
41 143 "hy.wikipedia.org"
42 142 "lv.wikipedia.org"
43 141 "th.wiktionary.org"
44 141 "species.wikimedia.org"
45 139 "kk.wikipedia.org"
46 132 "hu.wikipedia.org"
47 131 "az.wikipedia.org"
48 125 "fi.wikipedia.org"
49 119 "th.wikipedia.org"
50 109 "simple.wikipedia.org"
51 109 "no.wikipedia.org"
52 101 "tr.wikipedia.org"
53 91 "lt.wikipedia.org"
54 87 "bh.wikipedia.org"
55 83 "gl.wikipedia.org"
56 82 "sl.wikipedia.org"
57 77 "ms.wikipedia.org"
58 77 "hi.wikipedia.org"
59 70 "he.wikisource.org"
60 66 "pt.wiktionary.org"
61 62 "my.wikipedia.org"
62 61 "sr.wikipedia.org"
63 61 "ps.wikipedia.org"
64 58 "sw.wikipedia.org"
65 52 "et.wikipedia.org"
66 52 "bs.wikipedia.org"
67 51 "cs.wiktionary.org"
68 50 "fi.wiktionary.org"
69 50 "eu.wikipedia.org"
70 50 "bjn.wikipedia.org"
71 50 "ba.wikipedia.org"
72 43 "en.wikinews.org"
73 40 "uz.wikipedia.org"
74 40 "fo.wikipedia.org"
75 38 "ru.wiktionary.org"
76 38 "eo.wikipedia.org"
77 35 "la.wiktionary.org"
78 33 "it.wiktionary.org"
79 29 "it.wikiquote.org"
80 28 "it.wikisource.org"
81 28 "ca.wiktionary.org"
82 27 "ja.wikisource.org"
83 27 "an.wikipedia.org"
84 25 "outreach.wikimedia.org"
85 25 "ko.wiktionary.org"
86 24 "zh.wikisource.org"
87 24 "lb.wikipedia.org"
88 22 "sk.wikipedia.org"
89 22 "el.wiktionary.org"
90 21 "fa.wikinews.org"
91 20 "sh.wikipedia.org"
92 19 "ja.wiktionary.org"
93 19 "en.wikibooks.org"
94 18 "mg.wiktionary.org"
95 17 "ta.wikipedia.org"
96 17 "bn.wikibooks.org"
97 16 "it.wikiversity.org"
98 15 "de.wikisource.org"
99 15 "bn.wiktionary.org"
100 14 "mt.wikipedia.org"
101 13 "ku.wiktionary.org"
102 11 "fa.wiktionary.org"
103 11 "fa.wikibooks.org"
104 11 "be.wikipedia.org"
105 10 "kn.wikipedia.org"
106 10 "id.wikibooks.org"
107 10 "de.wikiversity.org"
108 8 "incubator.wikimedia.org"
109 8 "ga.wiktionary.org"
110 8 "et.wiktionary.org"
111 8 "es.wikinews.org"
112 7 "zh-yue.wikipedia.org"
113 7 "mai.wikipedia.org"
114 7 "inh.wikipedia.org"
115 6 "sa.wikipedia.org"
116 6 "nds.wikipedia.org"
117 6 "mr.wikipedia.org"
118 6 "mk.wikipedia.org"
119 6 "min.wikipedia.org"
120 6 "ia.wikipedia.org"
121 6 "ht.wikipedia.org"
122 5 "sq.wikipedia.org"
123 5 "ru.wikisource.org"
124 5 "nn.wikipedia.org"
125 4 "tt.wikipedia.org"
126 4 "sd.wikipedia.org"
127 4 "bcl.wikipedia.org"
128 3 "www.mediawiki.org"
129 3 "ru.wikimedia.org"
130 3 "myv.wikipedia.org"
131 3 "ml.wikipedia.org"
132 3 "ja.wikibooks.org"
133 3 "it.wikinews.org"
134 3 "fy.wikipedia.org"
135 3 "fr.wikisource.org"
136 3 "ce.wikipedia.org"
137 3 "azb.wikipedia.org"
138 2 "zh-classical.wikipedia.org"
139 2 "vls.wikipedia.org"
140 2 "lrc.wikipedia.org"
141 2 "ka.wikipedia.org"
142 2 "is.wikipedia.org"
143 2 "fa.wikivoyage.org"
144 2 "en.wikivoyage.org"
145 2 "en.wikiquote.org"
146 2 "cs.wikiversity.org"
147 2 "ckb.wikipedia.org"
148 2 "ar.wikiquote.org"
149 1 "zh.wikivoyage.org"
150 1 "tyv.wikipedia.org"
151 1 "sv.wiktionary.org"
152 1 "sr.wikinews.org"
153 1 "si.wikipedia.org"
154 1 "oc.wiktionary.org"
155 1 "lmo.wikipedia.org"
156 1 "la.wikipedia.org"
157 1 "kw.wikipedia.org"
158 1 "fi.wikivoyage.org"
159 1 "es.wiktionary.org"
160 1 "de.wikivoyage.org"

So only roughly 3% of the jobs belong to non-wikipedia, non-wikidata and non-commons projects. As a next test, I propose to switch everything except those 3. @dcausse what's your opinion?

I don't have strong opinions on which wikis we should migrate next.

group1 could be a good next-step candidate.

Sounds good (and I'm fine as well with @Pchelolo suggestion to switch everything except the 3 big ones).

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

It depends, but it's happening right now, we can try to sync-up with @Gehel to trigger the next cluster just after we migrate these wikis?

When we freeze writes we start to push ElasticaWrite jobs that contains the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

We might need to do the same, as there is an Apache instance powering the new queue's end point receiving the jobs.

For reference it was https://phabricator.wikimedia.org/T132740#2209349
And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

Given the numbers above, going with everything but enwiki, wikidata and commons should be a good next round.

Pchelolo added a comment.EditedApr 25 2018, 2:14 PM

When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

Currently, the maximum size of the message we accept in Kafka is 4 Mb. I believe in the Event-Platform from time to time we have occasional logs about the message being too large, but they are pretty rare.

And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

100 Mb seems like A LOT. Are we sure we need to support that much?

Also, if we indeed want to support 100 Mb requests, we need to increase max_buffer_size in EventBus proxy service to 100 Mb, but that might be pretty devastating, as Tornado reads the whole request body into a single string in memory, so making the limit too high might be dangerous.

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

Currently, the maximum size of the message we accept in Kafka is 4 Mb. I believe in the Event-Platform from time to time we have occasional logs about the message being too large, but they are pretty rare.

And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

100 Mb seems like A LOT. Are we sure we need to support that much?

I agree :)
I hope we don't need this but at the time we did not really closely check and we aligned the limit with elastic defaults.
If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen?

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

I don't think we can really split them since it's a single job. Other strategy would be to not store the content in the jobqueue but that involves a larger refactoring on our side.

Gehel added a comment.Apr 25 2018, 2:22 PM

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

I don't think we can really split them since it's a single job. Other strategy would be to not store the content in the jobqueue but that involves a larger refactoring on our side.

There was discussion a while back with @Joe and @EBernhardson. The possibility to externalize the pages to an object store and only pass references in the jobs was discussed, but discarded. It would add a non trivial amount of complexity and additional failure modes. We can always revisit that idea now...

Gehel added a comment.Apr 25 2018, 2:26 PM

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

It depends, but it's happening right now, we can try to sync-up with @Gehel to trigger the next cluster just after we migrate these wikis?

We try to not do those full cluster restart too often, they are somewhat painful. But we probably do them every other month.

If the goal is to validate that everything runs smoothly during a cluster restart, with frozen writes, we can schedule a "fake" cluster restart. Where we freeze writes, restart just a few nodes and observe the behaviour. I would prefer that instead of bundling this with a "real" cluster restart, so that in case we see issues, we're not blocking a restart that is actually needed.

If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen?

There are MESSAGE_SIZE_TOO_LARGE logs in the EventBus proxy, but they're not reported to logstash for some reason (mental note - report them). We can analyze those over time and see whether we indeed need 100 Mb. In the very beginning of the project we've found some insane jobs, hundreds of megabytes serialized, and fixed all of them, but since cirrusSearchElasticaWrite wasn't serializing correctly, maybe I need to revisit this now. Will report as I find something interesting on this matter.

I've run some analysis on the logs and indeed sometimes the cirrusSearchElasticWrite is too large. Here're the sizes in bytes for all the log entries I could find so far:

1Occurences# Byte size
2 26 6115141
3 21 4901916
4 14 4906302
5 10 6116869
6 4 5167133
7 2 6433459
8 2 5485125
9 1 6887331
10 1 5485125
11 1 5326067
12 1 5140537

Note that most likely the repeating occurrences of the same byte size are just retries of the same original message, so we don't really have a lot of messages that are too big.

Also worth noting that we didn't see anything crazy in order of 100 Mb, the biggest one is just 6 Mb which is slightly higher than our limit of 4 Mb. Perhaps we can increase the limit in Kafka to 8 Mb and get away with it? What do you think about 8 Mb limit @Ottomata ?

I don't love it! I feel like 4Mb is already huge. Consider troubleshooting some problem with kafkacat -C | jq .. Gotta consume a individual 4Mb messages.

That said, I'm not opposed, as I don't know of any practical reason that we couldn't do it. It does feel like we are just kicking the can though. I don't have context around how hard it would be to try and fix the job, but perhaps we should try that first?

Consider troubleshooting some problem with kafkacat -C | jq .

Haha :)

That said, I'm not opposed, as I don't know of any practical reason that we couldn't do it. It does feel like we are just kicking the can though. I don't have context around how hard it would be to try and fix the job, but perhaps we should try that first?

One thing I've noticed (using kafkacat | jq :) is that in cirrusSearchElasticWrite all the page titles in parameters are \u encoded, which obviously greatly increases the size of the messages. Other jobs support Unicode just fine, is there a specific reason cirrusSearchElasticaWrite must \u encode all the Unicode in page titles? Using plain Unicode will significantly decrease message sizes and improve performance overall (less networking, less time decoding it back and forth etc). @dcausse why does cirrus job use \u encoding? ElasticSearch wouldn't support Unicode?

Change 436248 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus jobs for all but wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436248

Change 436249 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrus except wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436249

Change 436248 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus jobs for all but wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436248

Change 436249 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable redis queue for cirrus except wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436249

Mentioned in SAL (#wikimedia-operations) [2018-06-05T11:21:57Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Switch CirrusSearch jobs for all wikis except wp, wd, commons - T189137 (duration: 00m 51s)

Change 437446 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch cirrusSearch jobs for everything.

https://gerrit.wikimedia.org/r/437446

Change 437448 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrus search for all wikis.

https://gerrit.wikimedia.org/r/437448

Change 437446 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch cirrusSearch jobs for everything.

https://gerrit.wikimedia.org/r/437446

Change 437448 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable redis queue for cirrus search for all wikis.

https://gerrit.wikimedia.org/r/437448

Mentioned in SAL (#wikimedia-operations) [2018-06-06T12:23:00Z] <mobrovac@deploy1001> Synchronized wmf-config/jobqueue.php: Switch CirrusSearch jobs to EventBus for all wikis - T189137 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-06-06T12:24:38Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Switch CirrusSearch jobs to EventBus for all wikis, file 2/2 - T189137 (duration: 00m 56s)

Pchelolo closed this task as Resolved.Jun 26 2018, 8:47 AM
Pchelolo edited projects, added Services (done); removed Patch-For-Review, Services (doing).