Maniphest T219234

Job runner request timeouts in Elastica\Task
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

We are getting production error messages from job runners:

Fatal error: entire web request took longer than 1200 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Task.php on line 92

[XJkH0QpAMFwAAH461NEAAAAC] /rpc/RunSingleJob.php   PHP Fatal Error from line 92 of /srv/mediawiki/php-1.33.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Task.php: entire web request took longer than 1200 seconds and timed out

These seem to have started March 7th with the rollout of wmf.20. The most likely related patch deployed that week is: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Elastica/+/493160/

Working Theory:

MWElasticUtils::deleteByQuery didn't work right until the wmf.20 deploy. Upon deploy these new error messages started to be emitted. Monitoring /_cat/tasks doesn't seem to show any long running deleteByQuery tasks (task action = indices:data/write/delete/byquery). Best guess is the code is not appropriately detecting task completion and is instead waiting around until the job gets killed for running over the 20 minute job time limit.
Investigate and repair.

Details

Subject	Repo	Branch	Lines +/-
Allow conflicts during elasticsearch deleteByQuery	mediawiki/extensions/Translate	master	+1 -1
Use an increasing delay in MWElasticUtils::deleteByQuery	mediawiki/extensions/Elastica	master	+21 -2
Specify conflict handling in MWElasticUtils::deleteByQuery	mediawiki/extensions/Elastica	master	+19 -3
Log the query used in MWElasticUtils::deleteByQuery	mediawiki/extensions/Elastica	master	+3 -1
Log start/progress/stop messages for deleteByQuery	mediawiki/extensions/Elastica	master	+70 -7
Make MWElasticUtils::deleteByQuery harder to misuse	mediawiki/extensions/Elastica	master	+27 -4
Turn on Elastica logging channel	operations/mediawiki-config	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: Blog Post: Production Excellence #9: March 2019

Event Timeline

EBernhardson created this task.Mar 25 2019, 8:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2019, 8:52 PM

EBernhardson renamed this task from Request timeouts in Elastica\Task to Job runner request timeouts in Elastica\Task.Mar 25 2019, 8:52 PM

I've tried a few different ways, but I can't convince this to get stuck for a long period of time from the repl. Best guess right now is to have a much lower timeout on the php side. We can detect something running over, say, 300s. Then we can at least log appropriate information instead of having only the fatal error to work off of.

Change 499002 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Turn on Elastica logging channel

https://gerrit.wikimedia.org/r/499002

gerritbot added a project: Patch-For-Review.Mar 25 2019, 10:14 PM

Krinkle added a project: Wikimedia-production-error (ARCHIVED -- Shared Build Failure).Mar 25 2019, 10:17 PM

Krinkle edited projects, added Wikimedia-production-error; removed Wikimedia-production-error (ARCHIVED -- Shared Build Failure).

Krinkle moved this task from Untriaged to Sep2019/1.34.wmf.21+ on the Wikimedia-production-error board.

Change 499012 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Make MWElasticUtils::deleteByQuery harder to misuse

https://gerrit.wikimedia.org/r/499012

Change 499002 merged by jenkins-bot:
[operations/mediawiki-config@master] Turn on Elastica logging channel

https://gerrit.wikimedia.org/r/499002

Mentioned in SAL (#wikimedia-operations) [2019-03-25T23:20:37Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT T219234 Turn on Elastica logging channel (duration: 00m 51s)

Nikerabbit subscribed.Mar 25 2019, 11:37 PM

Attached patch does not fix the immediate problem, it fixes another related problem. Will continue with preparing a logging patch.

EBernhardson updated the task description. (Show Details)Mar 26 2019, 4:28 AM

abi_ subscribed.Mar 26 2019, 7:36 AM

Change 499578 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Log start/progress/stop messages for deleteByQuery

https://gerrit.wikimedia.org/r/499578

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Mar 27 2019, 8:12 PM

EBernhardson moved this task from Needs review to Incoming on the Discovery-Search (Current work) board.Mar 29 2019, 4:01 PM

Change 499012 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Make MWElasticUtils::deleteByQuery harder to misuse

https://gerrit.wikimedia.org/r/499012

Change 499578 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Log start/progress/stop messages for deleteByQuery

https://gerrit.wikimedia.org/r/499578

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.24; 2019-04-02).Apr 1 2019, 5:00 PM

EBernhardson removed projects: MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review.Apr 1 2019, 5:05 PM

EBernhardson moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Apr 2 2019, 5:28 PM

Not timeouts, but seeing a few delete-by-query failures in the logs now. These appear to be due to version conflicts, we can set conflicts=proceed to at least let the delete by query complete rather than abort mid-delete. This might require some input from @Nikerabbit and @abi_ regarding what is appropriate here. Basically what is happening is that during the delete-by-query operation some document that was supposed to be deleted was updated. What should happen in that case?

Example failure: https://logstash.wikimedia.org/goto/a29c885f2dc87905409083561df0b73d

It looks like retry_on_conflict is not supported here, per https://discuss.elastic.co/t/delete-by-query-refresh-wait-for-support/154118/6:

retry_on_conflicts isn't supported by delete-by-query and update-by-query because we don't have a mechanism to re-check that the document matches the query after the conflict. It may not. Your only option is to ignore_conflicts and redo the entire request if there are any conflicts. Delete-by-query at least will be much smaller the second time around.

So ttmserver could look at the result and re-issue delete-by-query if that makes sense. The more I've looked at this though, it seems like there is perhaps some sort of race condition happening here inside ttmserver with multiple users updating the same translation?

Change 501073 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Log the query used in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/501073

gerritbot added a project: Patch-For-Review.Apr 3 2019, 10:29 PM

Still not seeing any job runner timeouts that are obviously related to this since the last once at 2019-03-25T22:52:37, should still probably leave this task around for a little while to check into this a few more times.

Logstash query used: https://logstash.wikimedia.org/goto/8089820c01f5484d75b1c8db0a417cf4

Well it must be this query: https://gerrit.wikimedia.org/g/mediawiki/extensions/Translate/+/51089073db26ffd69096c9d7f805c37703e55b92/ttmserver/ElasticSearchTTMServer.php#266

Editing the same translation multiple times in a succession could result this, as it essentially does delete+create. Assuming last one would win, it's safe to ignore failures.

In T219234#5085918, @Nikerabbit wrote:

Well it must be this query: https://gerrit.wikimedia.org/g/mediawiki/extensions/Translate/+/51089073db26ffd69096c9d7f805c37703e55b92/ttmserver/ElasticSearchTTMServer.php#266

Editing the same translation multiple times in a succession could result this, as it essentially does delete+create. Assuming last one would win, it's safe to ignore failures.

The last write certainly wins, but not in the way that sql would. In SQL you have ACID semantics that ensures the first one to start deleting will also write first, then the second will delete and write second. In elasticsearch the deletes will be intermingled with eachother, and a random write will be the "last write". Is that acceptable here?

Something that would also need to be investigated, queries only return documents that have been refreshed (on 5s intervals). I suspect that documents that have been written to elasticsearch but not yet refreshed would not be deleted by a delete-by-query in that timespan. At a high level reading your writes is not guaranteed, or even expected, in elasticsearch as it is eventually consistent.

It won't be a big deal if ttmserver is slightly out of sync with reality, but of course it would be nice if we could avoid that. It's been a while since I looked at the schema, maybe it is possibly to key the writes and reads by revision id. Then deletes should be just cleaning up stale data from the index.

Simplified example: Let's say we have a translation unit A with English as source language. Initially we store: A-1-en: "English". When translations come in we add them as A-1-fi: Finnish". Now, if A gets updates, we add A-2-en: "Updated English". If the Finnish translation is updated, we first do delete A-*-fi and then add A-2-fi: "Updated Finnish"

At this point we are expected to have:

A-1-en: "English"
A-2-en: "Updated English"
A-2-fi: "Updated Finnish"

When for translation memory, we just ask: For every *-*-en that fuzzy match content "foo", give the corresponding *-*-fi if such exists. For translation search, I am not sure off hand whether it checks that what is returned matches what is in the wiki (e.g. that A-1-en is ignored). If I also remember correctly, there is no automatic clean-up of orphan entries like A-1-en if all translations have been updated. This can degrade performance over time and only way to get rid of them is to reindex from scratch.

If out of sync isn't a big deal, it seems the most direct and simplest way to resolve is to set conflicts=proceed and let them continue deleting instead of failing the delete-by-query.

Change 502259 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Specify conflict handling in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502259

Change 502262 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Translate@master] Allow conflicts during elasticsearch deleteByQuery

https://gerrit.wikimedia.org/r/502262

Some new 1200s timeouts from job's in Task.php came up today: https://logstash.wikimedia.org/goto/314bfac86ad15374a5cd8223f8867cbd

It looks like the problem is not a single deleteByQuery that runs for a long time, but rather than many deleteByQuery calls are being performed in a single execution. There are around 500 log messages per job runner for the start/stop deleteByQuery, suggesting that this is executed more than 100 times in a single job. The problem then really doesn't have anything to do with elasticsearch, the job is simply trying to do too much in a single execution and needs to be refactored to not run for 20 minutes.

Actually there might be a minimum delay hardcoded into the deleteByQuery code, will write up something to decay the delay from perhaps 100ms up to 5s over time

Change 502268 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Use an increasing delay in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502268

EBernhardson moved this task from Waiting to Needs review on the Discovery-Search (Current work) board.Apr 8 2019, 6:56 PM

Change 501073 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Log the query used in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/501073

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.25; 2019-04-09).Apr 9 2019, 1:01 AM

Change 502259 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Specify conflict handling in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502259

EBernhardson moved this task from Needs review to Waiting on the Discovery-Search (Current work) board.Apr 9 2019, 5:17 PM

Change 502268 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Use an increasing delay in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502268

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.1; 2019-04-16).Apr 9 2019, 9:01 PM

Change 502262 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Allow conflicts during elasticsearch deleteByQuery

https://gerrit.wikimedia.org/r/502262

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.3; 2019-04-30); removed MW-1.34-notes (1.34.0-wmf.1; 2019-04-16).Apr 18 2019, 10:00 AM

Krinkle added a project: CirrusSearch.Apr 19 2019, 6:55 PM

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptApr 19 2019, 6:55 PM

Krinkle removed a project: Patch-For-Review.Apr 19 2019, 6:57 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.Apr 19 2019, 7:22 PM

Only seeing timeouts against php-1.33.0-wmf.25, nothing against 1.34-wmf.1 yet. Should let this run for a week before declaring victory on the timeouts though.

Krinkle mentioned this in Blog Post: Production Excellence #9: March 2019.Apr 21 2019, 6:51 PM

The last error message to be emitted is still 1.33.0-wmf.25. Willing to call this complete.

EBernhardson moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Apr 29 2019, 4:41 PM

Nice!

Krinkle moved this task from Sep2019/1.34.wmf.21+ to Resolved on the Wikimedia-production-error board.May 6 2019, 6:18 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM

Job runner request timeouts in Elastica\TaskClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

Job runner request timeouts in Elastica\Task
Closed, ResolvedPublicPRODUCTION ERROR
Actions