Page MenuHomePhabricator

Job runner request timeouts in Elastica\Task
Closed, ResolvedPublicPRODUCTION ERROR

Description

We are getting production error messages from job runners:

Fatal error: entire web request took longer than 1200 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Task.php on line 92
[XJkH0QpAMFwAAH461NEAAAAC] /rpc/RunSingleJob.php   PHP Fatal Error from line 92 of /srv/mediawiki/php-1.33.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Task.php: entire web request took longer than 1200 seconds and timed out

These seem to have started March 7th with the rollout of wmf.20. The most likely related patch deployed that week is: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Elastica/+/493160/

Working Theory:

MWElasticUtils::deleteByQuery didn't work right until the wmf.20 deploy. Upon deploy these new error messages started to be emitted. Monitoring /_cat/tasks doesn't seem to show any long running deleteByQuery tasks (task action = indices:data/write/delete/byquery). Best guess is the code is not appropriately detecting task completion and is instead waiting around until the job gets killed for running over the 20 minute job time limit.
Investigate and repair.

Event Timeline

EBernhardson renamed this task from Request timeouts in Elastica\Task to Job runner request timeouts in Elastica\Task.Mar 25 2019, 8:52 PM

I've tried a few different ways, but I can't convince this to get stuck for a long period of time from the repl. Best guess right now is to have a much lower timeout on the php side. We can detect something running over, say, 300s. Then we can at least log appropriate information instead of having only the fatal error to work off of.

Change 499002 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Turn on Elastica logging channel

https://gerrit.wikimedia.org/r/499002

Change 499012 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Make MWElasticUtils::deleteByQuery harder to misuse

https://gerrit.wikimedia.org/r/499012

Change 499002 merged by jenkins-bot:
[operations/mediawiki-config@master] Turn on Elastica logging channel

https://gerrit.wikimedia.org/r/499002

Mentioned in SAL (#wikimedia-operations) [2019-03-25T23:20:37Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT T219234 Turn on Elastica logging channel (duration: 00m 51s)

Attached patch does not fix the immediate problem, it fixes another related problem. Will continue with preparing a logging patch.

Change 499578 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Log start/progress/stop messages for deleteByQuery

https://gerrit.wikimedia.org/r/499578

Change 499012 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Make MWElasticUtils::deleteByQuery harder to misuse

https://gerrit.wikimedia.org/r/499012

Change 499578 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Log start/progress/stop messages for deleteByQuery

https://gerrit.wikimedia.org/r/499578

Not timeouts, but seeing a few delete-by-query failures in the logs now. These appear to be due to version conflicts, we can set conflicts=proceed to at least let the delete by query complete rather than abort mid-delete. This might require some input from @Nikerabbit and @abi_ regarding what is appropriate here. Basically what is happening is that during the delete-by-query operation some document that was supposed to be deleted was updated. What should happen in that case?

Example failure: https://logstash.wikimedia.org/goto/a29c885f2dc87905409083561df0b73d

It looks like retry_on_conflict is not supported here, per https://discuss.elastic.co/t/delete-by-query-refresh-wait-for-support/154118/6:

retry_on_conflicts isn't supported by delete-by-query and update-by-query because we don't have a mechanism to re-check that the document matches the query after the conflict. It may not. Your only option is to ignore_conflicts and redo the entire request if there are any conflicts. Delete-by-query at least will be much smaller the second time around.

So ttmserver could look at the result and re-issue delete-by-query if that makes sense. The more I've looked at this though, it seems like there is perhaps some sort of race condition happening here inside ttmserver with multiple users updating the same translation?

Change 501073 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Log the query used in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/501073

Still not seeing any job runner timeouts that are obviously related to this since the last once at 2019-03-25T22:52:37, should still probably leave this task around for a little while to check into this a few more times.

Logstash query used: https://logstash.wikimedia.org/goto/8089820c01f5484d75b1c8db0a417cf4

Well it must be this query: https://gerrit.wikimedia.org/g/mediawiki/extensions/Translate/+/51089073db26ffd69096c9d7f805c37703e55b92/ttmserver/ElasticSearchTTMServer.php#266

Editing the same translation multiple times in a succession could result this, as it essentially does delete+create. Assuming last one would win, it's safe to ignore failures.

Well it must be this query: https://gerrit.wikimedia.org/g/mediawiki/extensions/Translate/+/51089073db26ffd69096c9d7f805c37703e55b92/ttmserver/ElasticSearchTTMServer.php#266

Editing the same translation multiple times in a succession could result this, as it essentially does delete+create. Assuming last one would win, it's safe to ignore failures.

The last write certainly wins, but not in the way that sql would. In SQL you have ACID semantics that ensures the first one to start deleting will also write first, then the second will delete and write second. In elasticsearch the deletes will be intermingled with eachother, and a random write will be the "last write". Is that acceptable here?

Something that would also need to be investigated, queries only return documents that have been refreshed (on 5s intervals). I suspect that documents that have been written to elasticsearch but not yet refreshed would not be deleted by a delete-by-query in that timespan. At a high level reading your writes is not guaranteed, or even expected, in elasticsearch as it is eventually consistent.

It won't be a big deal if ttmserver is slightly out of sync with reality, but of course it would be nice if we could avoid that. It's been a while since I looked at the schema, maybe it is possibly to key the writes and reads by revision id. Then deletes should be just cleaning up stale data from the index.

Simplified example: Let's say we have a translation unit A with English as source language. Initially we store: A-1-en: "English". When translations come in we add them as A-1-fi: Finnish". Now, if A gets updates, we add A-2-en: "Updated English". If the Finnish translation is updated, we first do delete A-*-fi and then add A-2-fi: "Updated Finnish"

At this point we are expected to have:

A-1-en: "English"
A-2-en: "Updated English"
A-2-fi: "Updated Finnish"

When for translation memory, we just ask: For every *-*-en that fuzzy match content "foo", give the corresponding *-*-fi if such exists. For translation search, I am not sure off hand whether it checks that what is returned matches what is in the wiki (e.g. that A-1-en is ignored). If I also remember correctly, there is no automatic clean-up of orphan entries like A-1-en if all translations have been updated. This can degrade performance over time and only way to get rid of them is to reindex from scratch.

If out of sync isn't a big deal, it seems the most direct and simplest way to resolve is to set conflicts=proceed and let them continue deleting instead of failing the delete-by-query.

Change 502259 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Specify conflict handling in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502259

Change 502262 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Translate@master] Allow conflicts during elasticsearch deleteByQuery

https://gerrit.wikimedia.org/r/502262

Some new 1200s timeouts from job's in Task.php came up today: https://logstash.wikimedia.org/goto/314bfac86ad15374a5cd8223f8867cbd

It looks like the problem is not a single deleteByQuery that runs for a long time, but rather than many deleteByQuery calls are being performed in a single execution. There are around 500 log messages per job runner for the start/stop deleteByQuery, suggesting that this is executed more than 100 times in a single job. The problem then really doesn't have anything to do with elasticsearch, the job is simply trying to do too much in a single execution and needs to be refactored to not run for 20 minutes.

Actually there might be a minimum delay hardcoded into the deleteByQuery code, will write up something to decay the delay from perhaps 100ms up to 5s over time

Change 502268 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/Elastica@master] Use an increasing delay in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502268

Change 501073 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Log the query used in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/501073

Change 502259 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Specify conflict handling in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502259

Change 502268 merged by jenkins-bot:
[mediawiki/extensions/Elastica@master] Use an increasing delay in MWElasticUtils::deleteByQuery

https://gerrit.wikimedia.org/r/502268

Change 502262 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Allow conflicts during elasticsearch deleteByQuery

https://gerrit.wikimedia.org/r/502262

Only seeing timeouts against php-1.33.0-wmf.25, nothing against 1.34-wmf.1 yet. Should let this run for a week before declaring victory on the timeouts though.

The last error message to be emitted is still 1.33.0-wmf.25. Willing to call this complete.

debt subscribed.

Nice!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM