Migrate CirrusSearch jobs to Kafka queue
Closed, ResolvedPublic
Actions

Description

CirrusSearch job are the biggest chunk of jobs still on the Redis queue. Here's the list:

cirrusSearchCheckerJob - basically idempotent. It verifies data in elasticsearch matches mysql, creates new jobs if they don't match. Uses delayed execution. Tricky. It runs from a cron script scheduling bulk jobs with a set of pageIds and uses delay 1,2,3,4... to scatter the jobs in time. Really this is abusing the delayed job functionality, and what it really needs is a job scheduler that can insert jobs in the future.
cirrusSearchDeleteArchive - idempotent - checks database to verify archive indexing is still appropriate when run.
cirrusSearchDeletePages - idempotent
cirrusSearchElasticaWrite - idempotent. Issued to retry failed write requests to elasticsearch. uses delayed execution
cirrusSearchIncomingLinkCount - idempotent. expensive, high volume duplicates
cirrusSearchLinksUpdate - idempotent, expensive
cirrusSearchLinksUpdatePrioritized - idempotent, expensive,
cirrusSearchMassIndex - idempotent, expensive, low volume
cirrusSearchOtherIndex - cant use versioning, so out of order updates could be problematic

I think by now we're fairly confident that switching simpler jobs will be pretty straightforward, but we need to coordinate the with @EBernhardson so that we choose a fairly quiet period for Elasticsearch and also to help with verifying the correctness.

I think we have to use a similar approach that we've used for other jobs and for step 0 switch all the jobs for test wikis and mediawiki and then ask the discovery team to verify correctness on the Elasticsearch side.

Details

Subject	Repo	Branch	Lines +/-
Disable redis queue for cirrus search for all wikis.	operations/mediawiki-config	master	+0 -58
Switch cirrusSearch jobs for everything.	mediawiki/services/change-propagation/jobqueue-deploy	master	+23 -33
Disable redis queue for cirrus except wikipedia, commons and wikidata.	operations/mediawiki-config	master	+4 -5
Enable cirrus jobs for all but wikipedia, commons and wikidata.	mediawiki/services/change-propagation/jobqueue-deploy	master	+4 -2
[JobExecutor] Change the method used to construct the page title.	mediawiki/extensions/EventBus	master	+1 -15
Enable cirrus search jobs for test wikis.	mediawiki/services/change-propagation/jobqueue-deploy	master	+28 -0
Disable redis queue for cirrusSearch jobs for test wikis.	operations/mediawiki-config	master	+26 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Resolved	• Pchelolo	T190327 FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus
Resolved	• Pchelolo	T189137 Migrate CirrusSearch jobs to Kafka queue
Resolved	EBernhardson	T190958 CirrusSearchCheckerJob should have a title
Resolved	EBernhardson	T191024 Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts

Event Timeline

• Pchelolo triaged this task as High priority.Mar 7 2018, 5:36 PM

• Pchelolo created this task.

Restricted Application added a project: Analytics. · View Herald TranscriptMar 7 2018, 5:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 416992 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrusSearch jobs for test wikis.

https://gerrit.wikimedia.org/r/416992

gerritbot added a project: Patch-For-Review.Mar 7 2018, 5:58 PM

Change 417000 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus search jobs for test wikis.

https://gerrit.wikimedia.org/r/417000

• fdans moved this task from Incoming to Radar on the Analytics board.Mar 8 2018, 6:09 PM

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 23 2018, 3:54 PM

• dcausse subscribed.Mar 26 2018, 9:41 AM

Change 416992 merged by Mobrovac:
[operations/mediawiki-config@master] Disable redis queue for cirrusSearch jobs for test wikis.

https://gerrit.wikimedia.org/r/416992

Change 417000 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus search jobs for test wikis.

https://gerrit.wikimedia.org/r/417000

Mentioned in SAL (#wikimedia-operations) [2018-03-28T14:56:15Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: Disable redis queue for cirrusSearch jobs for test wikis, file 1/2 - T189137 (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-03-28T14:58:05Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Disable redis queue for cirrusSearch jobs for test wikis, file 2/2 - T189137 (duration: 01m 17s)

Change 422436 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] [JobExecutor] Change the method used to construct the page title.

https://gerrit.wikimedia.org/r/422436

Change 422436 abandoned by Ppchelko:
[JobExecutor] Change the method used to construct the page title.

Reason:
In favor of I8f305aca34a20e3c339f6bc96815c3def53e48ad

https://gerrit.wikimedia.org/r/422436

• mobrovac merged a task: T150283: Port CirrusSearch update JobQueue jobs to EventBus.Apr 4 2018, 5:02 PM

• mobrovac added a subtask: T191024: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts.

• mobrovac added subscribers: Legoktm, • Deskana, Joe, Gehel.

• mobrovac edited parent tasks, added: T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus; removed: T175210: Select candidate jobs for transferring to the new infrastucture.Apr 4 2018, 7:25 PM

• dcausse closed subtask T190958: CirrusSearchCheckerJob should have a title as Resolved.Apr 4 2018, 7:59 PM

• Pchelolo mentioned this in T191024: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts.Apr 25 2018, 1:50 PM

• Pchelolo closed subtask T191024: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts as Resolved.

I don't have strong opinions on which wikis we should migrate next.
My sole concerns right now is regarding write freezes when we restart the elastic clusters.
When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

In T189137#4157665, @dcausse wrote:

I don't have strong opinions on which wikis we should migrate next.

group1 could be a good next-step candidate.

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

When we freeze writes we start to push ElasticaWrite jobs that contains the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

We might need to do the same, as there is an Apache instance powering the new queue's end point receiving the jobs.

The subtasks that were created to fix issues discovered during the first iteration of the switch were resolved, and I don't see any logs indicating there's problems, so seems like nothing is blocking us from moving some more projects to kafka queue.

Based on cirrusSearchIncomingLinkCount job here the distribution of projects by traffic based on the selection of 1.000.000 events:

P7039 cirrusSearchncomingLinksCount 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 distribution

1	912832 "en.wikipedia.org" 21949 "commons.wikimedia.org" 10475 "www.wikidata.org" 6861 "it.wikipedia.org" 4494 "zh.wikipedia.org" 4410 "fr.wikipedia.org" 4122 "cy.wikipedia.org" 3722 "bn.wikipedia.org" 3669 "de.wikipedia.org" 3180 "pl.wikipedia.org" 2690 "vi.wikipedia.org" 2661 "ar.wikipedia.org" 1796 "es.wikipedia.org" 1346 "pt.wikipedia.org" 1301 "ru.wikipedia.org" 1054 "uk.wikipedia.org" 944 "ja.wikipedia.org" 713 "fr.wiktionary.org" 710 "ko.wikipedia.org" 633 "ceb.wikipedia.org" 611 "he.wikipedia.org" 580 "hr.wikipedia.org" 579 "te.wikipedia.org" 517 "en.wiktionary.org" 489 "ast.wikipedia.org" 416 "nl.wikipedia.org" 414 "cs.wikipedia.org" 402 "fa.wikipedia.org" 323 "sv.wikipedia.org" 284 "el.wikipedia.org" 266 "da.wikipedia.org" 253 "de.wiktionary.org" 238 "id.wikipedia.org" 237 "ca.wikipedia.org" 212 "en.wikisource.org" 185 "pl.wiktionary.org" 178 "ur.wikipedia.org" 173 "meta.wikimedia.org" 170 "bg.wikipedia.org" 148 "ro.wikipedia.org" 143 "hy.wikipedia.org" 142 "lv.wikipedia.org" 141 "th.wiktionary.org" 141 "species.wikimedia.org" 139 "kk.wikipedia.org" 132 "hu.wikipedia.org" 131 "az.wikipedia.org" 125 "fi.wikipedia.org" 119 "th.wikipedia.org" 109 "simple.wikipedia.org" 109 "no.wikipedia.org" 101 "tr.wikipedia.org" 91 "lt.wikipedia.org" 87 "bh.wikipedia.org" 83 "gl.wikipedia.org" 82 "sl.wikipedia.org" 77 "ms.wikipedia.org" 77 "hi.wikipedia.org" 70 "he.wikisource.org" 66 "pt.wiktionary.org" 62 "my.wikipedia.org" 61 "sr.wikipedia.org" 61 "ps.wikipedia.org" 58 "sw.wikipedia.org" 52 "et.wikipedia.org" 52 "bs.wikipedia.org" 51 "cs.wiktionary.org" 50 "fi.wiktionary.org" 50 "eu.wikipedia.org" 50 "bjn.wikipedia.org" 50 "ba.wikipedia.org" 43 "en.wikinews.org" 40 "uz.wikipedia.org" 40 "fo.wikipedia.org" 38 "ru.wiktionary.org" 38 "eo.wikipedia.org" 35 "la.wiktionary.org" 33 "it.wiktionary.org" 29 "it.wikiquote.org" 28 "it.wikisource.org" 28 "ca.wiktionary.org" 27 "ja.wikisource.org" 27 "an.wikipedia.org" 25 "outreach.wikimedia.org" 25 "ko.wiktionary.org" 24 "zh.wikisource.org" 24 "lb.wikipedia.org" 22 "sk.wikipedia.org" 22 "el.wiktionary.org" 21 "fa.wikinews.org" 20 "sh.wikipedia.org" 19 "ja.wiktionary.org" 19 "en.wikibooks.org" 18 "mg.wiktionary.org" 17 "ta.wikipedia.org" 17 "bn.wikibooks.org" 16 "it.wikiversity.org" 15 "de.wikisource.org" 15 "bn.wiktionary.org" 14 "mt.wikipedia.org" 13 "ku.wiktionary.org" 11 "fa.wiktionary.org" 11 "fa.wikibooks.org" 11 "be.wikipedia.org" 10 "kn.wikipedia.org" 10 "id.wikibooks.org" 10 "de.wikiversity.org" 8 "incubator.wikimedia.org" 8 "ga.wiktionary.org" 8 "et.wiktionary.org" 8 "es.wikinews.org" 7 "zh-yue.wikipedia.org" 7 "mai.wikipedia.org" 7 "inh.wikipedia.org" 6 "sa.wikipedia.org" 6 "nds.wikipedia.org" 6 "mr.wikipedia.org" 6 "mk.wikipedia.org" 6 "min.wikipedia.org" 6 "ia.wikipedia.org" 6 "ht.wikipedia.org" 5 "sq.wikipedia.org" 5 "ru.wikisource.org" 5 "nn.wikipedia.org" 4 "tt.wikipedia.org" 4 "sd.wikipedia.org" 4 "bcl.wikipedia.org" 3 "www.mediawiki.org" 3 "ru.wikimedia.org" 3 "myv.wikipedia.org" 3 "ml.wikipedia.org" 3 "ja.wikibooks.org" 3 "it.wikinews.org" 3 "fy.wikipedia.org" 3 "fr.wikisource.org" 3 "ce.wikipedia.org" 3 "azb.wikipedia.org" 2 "zh-classical.wikipedia.org" 2 "vls.wikipedia.org" 2 "lrc.wikipedia.org" 2 "ka.wikipedia.org" 2 "is.wikipedia.org" 2 "fa.wikivoyage.org" 2 "en.wikivoyage.org" 2 "en.wikiquote.org" 2 "cs.wikiversity.org" 2 "ckb.wikipedia.org" 2 "ar.wikiquote.org" 1 "zh.wikivoyage.org" 1 "tyv.wikipedia.org" 1 "sv.wiktionary.org" 1 "sr.wikinews.org" 1 "si.wikipedia.org" 1 "oc.wiktionary.org" 1 "lmo.wikipedia.org" 1 "la.wikipedia.org" 1 "kw.wikipedia.org" 1 "fi.wikivoyage.org" 1 "es.wiktionary.org" 1 "de.wikivoyage.org"

912832 "en.wikipedia.org" 21949 "commons.wikimedia.org" 10475 "www.wikidata.org" 6861 "it.wikipedia.org" 4494 "zh.wikipedia.org" 4410 "fr.wikipedia.org" 4122 "cy.wikipedia.org" 3722 "bn.wikipedia.org" 3669 "de.wikipedia.org" 3180 "pl.wikipedia.org" 2690 "vi.wikipedia.org" 2661 "ar.wikipedia.org" 1796 "es.wikipedia.org" 1346 "pt.wikipedia.org" 1301 "ru.wikipedia.org" 1054 "uk.wikipedia.org" 944 "ja.wikipedia.org" 713 "fr.wiktionary.org" 710 "ko.wikipedia.org" 633 "ceb.wikipedia.org" 611 "he.wikipedia.org" 580 "hr.wikipedia.org" 579 "te.wikipedia.org" 517 "en.wiktionary.org" 489 "ast.wikipedia.org" 416 "nl.wikipedia.org" 414 "cs.wikipedia.org" 402 "fa.wikipedia.org" 323 "sv.wikipedia.org" 284 "el.wikipedia.org" 266 "da.wikipedia.org" 253 "de.wiktionary.org" 238 "id.wikipedia.org" 237 "ca.wikipedia.org" 212 "en.wikisource.org" 185 "pl.wiktionary.org" 178 "ur.wikipedia.org" 173 "meta.wikimedia.org" 170 "bg.wikipedia.org" 148 "ro.wikipedia.org" 143 "hy.wikipedia.org" 142 "lv.wikipedia.org" 141 "th.wiktionary.org" 141 "species.wikimedia.org" 139 "kk.wikipedia.org" 132 "hu.wikipedia.org" 131 "az.wikipedia.org" 125 "fi.wikipedia.org" 119 "th.wikipedia.org" 109 "simple.wikipedia.org" 109 "no.wikipedia.org" 101 "tr.wikipedia.org" 91 "lt.wikipedia.org" 87 "bh.wikipedia.org" 83 "gl.wikipedia.org" 82 "sl.wikipedia.org" 77 "ms.wikipedia.org" 77 "hi.wikipedia.org" 70 "he.wikisource.org" 66 "pt.wiktionary.org" 62 "my.wikipedia.org" 61 "sr.wikipedia.org" 61 "ps.wikipedia.org" 58 "sw.wikipedia.org" 52 "et.wikipedia.org" 52 "bs.wikipedia.org" 51 "cs.wiktionary.org" 50 "fi.wiktionary.org" 50 "eu.wikipedia.org" 50 "bjn.wikipedia.org" 50 "ba.wikipedia.org" 43 "en.wikinews.org" 40 "uz.wikipedia.org" 40 "fo.wikipedia.org" 38 "ru.wiktionary.org" 38 "eo.wikipedia.org" 35 "la.wiktionary.org" 33 "it.wiktionary.org" 29 "it.wikiquote.org" 28 "it.wikisource.org" 28 "ca.wiktionary.org" 27 "ja.wikisource.org" 27 "an.wikipedia.org" 25 "outreach.wikimedia.org" 25 "ko.wiktionary.org" 24 "zh.wikisource.org" 24 "lb.wikipedia.org" 22 "sk.wikipedia.org" 22 "el.wiktionary.org" 21 "fa.wikinews.org" 20 "sh.wikipedia.org" 19 "ja.wiktionary.org" 19 "en.wikibooks.org" 18 "mg.wiktionary.org" 17 "ta.wikipedia.org" 17 "bn.wikibooks.org" 16 "it.wikiversity.org" 15 "de.wikisource.org" 15 "bn.wiktionary.org" 14 "mt.wikipedia.org" 13 "ku.wiktionary.org" 11 "fa.wiktionary.org" 11 "fa.wikibooks.org" 11 "be.wikipedia.org" 10 "kn.wikipedia.org" 10 "id.wikibooks.org" 10 "de.wikiversity.org" 8 "incubator.wikimedia.org" 8 "ga.wiktionary.org" 8 "et.wiktionary.org" 8 "es.wikinews.org" 7 "zh-yue.wikipedia.org" 7 "mai.wikipedia.org" 7 "inh.wikipedia.org" 6 "sa.wikipedia.org" 6 "nds.wikipedia.org" 6 "mr.wikipedia.org" 6 "mk.wikipedia.org" 6 "min.wikipedia.org" 6 "ia.wikipedia.org" 6 "ht.wikipedia.org" 5 "sq.wikipedia.org" 5 "ru.wikisource.org" 5 "nn.wikipedia.org" 4 "tt.wikipedia.org" 4 "sd.wikipedia.org" 4 "bcl.wikipedia.org" 3 "www.mediawiki.org" 3 "ru.wikimedia.org" 3 "myv.wikipedia.org" 3 "ml.wikipedia.org" 3 "ja.wikibooks.org" 3 "it.wikinews.org" 3 "fy.wikipedia.org" 3 "fr.wikisource.org" 3 "ce.wikipedia.org" 3 "azb.wikipedia.org" 2 "zh-classical.wikipedia.org" 2 "vls.wikipedia.org" 2 "lrc.wikipedia.org" 2 "ka.wikipedia.org" 2 "is.wikipedia.org" 2 "fa.wikivoyage.org" 2 "en.wikivoyage.org" 2 "en.wikiquote.org" 2 "cs.wikiversity.org" 2 "ckb.wikipedia.org" 2 "ar.wikiquote.org" 1 "zh.wikivoyage.org" 1 "tyv.wikipedia.org" 1 "sv.wiktionary.org" 1 "sr.wikinews.org" 1 "si.wikipedia.org" 1 "oc.wiktionary.org" 1 "lmo.wikipedia.org" 1 "la.wikipedia.org" 1 "kw.wikipedia.org" 1 "fi.wikivoyage.org" 1 "es.wiktionary.org" 1 "de.wikivoyage.org"

So only roughly 3% of the jobs belong to non-wikipedia, non-wikidata and non-commons projects. As a next test, I propose to switch everything except those 3. @dcausse what's your opinion?

In T189137#4157685, @mobrovac wrote:

In T189137#4157665, @dcausse wrote:

I don't have strong opinions on which wikis we should migrate next.

group1 could be a good next-step candidate.

Sounds good (and I'm fine as well with @Pchelolo suggestion to switch everything except the 3 big ones).

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

It depends, but it's happening right now, we can try to sync-up with @Gehel to trigger the next cluster just after we migrate these wikis?

When we freeze writes we start to push ElasticaWrite jobs that contains the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

We might need to do the same, as there is an Apache instance powering the new queue's end point receiving the jobs.

For reference it was https://phabricator.wikimedia.org/T132740#2209349
And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

Given the numbers above, going with everything but enwiki, wikidata and commons should be a good next round.

When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

Currently, the maximum size of the message we accept in Kafka is 4 Mb. I believe in the Event-Platform from time to time we have occasional logs about the message being too large, but they are pretty rare.

And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

100 Mb seems like A LOT. Are we sure we need to support that much?

Also, if we indeed want to support 100 Mb requests, we need to increase max_buffer_size in EventBus proxy service to 100 Mb, but that might be pretty devastating, as Tornado reads the whole request body into a single string in memory, so making the limit too high might be dangerous.

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

In T189137#4157713, @Pchelolo wrote:

When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).

Currently, the maximum size of the message we accept in Kafka is 4 Mb. I believe in the Event-Platform from time to time we have occasional logs about the message being too large, but they are pretty rare.

And the fix was https://gerrit.wikimedia.org/r/#/c/283619/

100 Mb seems like A LOT. Are we sure we need to support that much?

I agree :)
I hope we don't need this but at the time we did not really closely check and we aligned the limit with elastic defaults.
If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen?

In T189137#4157720, @Ottomata wrote:

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

I don't think we can really split them since it's a single job. Other strategy would be to not store the content in the jobqueue but that involves a larger refactoring on our side.

In T189137#4157724, @dcausse wrote:

In T189137#4157720, @Ottomata wrote:

I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?

I don't think we can really split them since it's a single job. Other strategy would be to not store the content in the jobqueue but that involves a larger refactoring on our side.

There was discussion a while back with @Joe and @EBernhardson. The possibility to externalize the pages to an object store and only pass references in the jobs was discussed, but discarded. It would add a non trivial amount of complexity and additional failure modes. We can always revisit that idea now...

In T189137#4157706, @dcausse wrote:

In T189137#4157685, @mobrovac wrote:

In T189137#4157665, @dcausse wrote:

My sole concerns right now is regarding write freezes when we restart the elastic clusters.

How often do these happen? I agree we should verify how do the jobs behave in the new queue in such circumstances.

It depends, but it's happening right now, we can try to sync-up with @Gehel to trigger the next cluster just after we migrate these wikis?

We try to not do those full cluster restart too often, they are somewhat painful. But we probably do them every other month.

If the goal is to validate that everything runs smoothly during a cluster restart, with frozen writes, we can schedule a "fake" cluster restart. Where we freeze writes, restart just a few nodes and observe the behaviour. I would prefer that instead of bundling this with a "real" cluster restart, so that in case we see issues, we're not blocking a restart that is actually needed.

If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen?

There are MESSAGE_SIZE_TOO_LARGE logs in the EventBus proxy, but they're not reported to logstash for some reason (mental note - report them). We can analyze those over time and see whether we indeed need 100 Mb. In the very beginning of the project we've found some insane jobs, hundreds of megabytes serialized, and fixed all of them, but since cirrusSearchElasticaWrite wasn't serializing correctly, maybe I need to revisit this now. Will report as I find something interesting on this matter.

I've run some analysis on the logs and indeed sometimes the cirrusSearchElasticWrite is too large. Here're the sizes in bytes for all the log entries I could find so far:

P7040 Too large cirrusSearchElasticaWrite jobs

1	Occurences# Byte size
2	26 6115141
3	21 4901916
4	14 4906302
5	10 6116869
6	4 5167133
7	2 6433459
8	2 5485125
9	1 6887331
10	1 5485125
11	1 5326067
12	1 5140537

Note that most likely the repeating occurrences of the same byte size are just retries of the same original message, so we don't really have a lot of messages that are too big.

Also worth noting that we didn't see anything crazy in order of 100 Mb, the biggest one is just 6 Mb which is slightly higher than our limit of 4 Mb. Perhaps we can increase the limit in Kafka to 8 Mb and get away with it? What do you think about 8 Mb limit @Ottomata ?

I don't love it! I feel like 4Mb is already huge. Consider troubleshooting some problem with kafkacat -C | jq .. Gotta consume a individual 4Mb messages.

That said, I'm not opposed, as I don't know of any practical reason that we couldn't do it. It does feel like we are just kicking the can though. I don't have context around how hard it would be to try and fix the job, but perhaps we should try that first?

Consider troubleshooting some problem with kafkacat -C | jq .

Haha :)

That said, I'm not opposed, as I don't know of any practical reason that we couldn't do it. It does feel like we are just kicking the can though. I don't have context around how hard it would be to try and fix the job, but perhaps we should try that first?

One thing I've noticed (using kafkacat | jq :) is that in cirrusSearchElasticWrite all the page titles in parameters are \u encoded, which obviously greatly increases the size of the messages. Other jobs support Unicode just fine, is there a specific reason cirrusSearchElasticaWrite must \u encode all the Unicode in page titles? Using plain Unicode will significantly decrease message sizes and improve performance overall (less networking, less time decoding it back and forth etc). @dcausse why does cirrus job use \u encoding? ElasticSearch wouldn't support Unicode?

Change 436248 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus jobs for all but wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436248

Change 436249 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrus except wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436249

Change 436248 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Enable cirrus jobs for all but wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436248

Change 436249 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable redis queue for cirrus except wikipedia, commons and wikidata.

https://gerrit.wikimedia.org/r/436249

Mentioned in SAL (#wikimedia-operations) [2018-06-05T11:21:57Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Switch CirrusSearch jobs for all wikis except wp, wd, commons - T189137 (duration: 00m 51s)

Change 437446 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch cirrusSearch jobs for everything.

https://gerrit.wikimedia.org/r/437446

Change 437448 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable redis queue for cirrus search for all wikis.

https://gerrit.wikimedia.org/r/437448

Change 437446 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch cirrusSearch jobs for everything.

https://gerrit.wikimedia.org/r/437446

Change 437448 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable redis queue for cirrus search for all wikis.

https://gerrit.wikimedia.org/r/437448

Mentioned in SAL (#wikimedia-operations) [2018-06-06T12:23:00Z] <mobrovac@deploy1001> Synchronized wmf-config/jobqueue.php: Switch CirrusSearch jobs to EventBus for all wikis - T189137 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-06-06T12:24:38Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Switch CirrusSearch jobs to EventBus for all wikis, file 2/2 - T189137 (duration: 00m 56s)

• Pchelolo closed this task as Resolved.Jun 26 2018, 8:47 AM

• Pchelolo edited projects, added Services (done); removed Patch-For-Review, Services (doing).

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Migrate CirrusSearch jobs to Kafka queueClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate CirrusSearch jobs to Kafka queue
Closed, ResolvedPublic
Actions

Related Objects
Search...