Page MenuHomePhabricator

EventBus or CirrusSearch: DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.9/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 5
Closed, ResolvedPublic

Description

Error

Request ID: 54810bb36840891c57475f5f

message
DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.9/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 5
trace
#0 /srv/mediawiki/php-1.33.0-wmf.9/vendor/firebase/php-jwt/src/JWT.php(299): Firebase\JWT\JWT::handleJsonError(integer)
#1 /srv/mediawiki/php-1.33.0-wmf.9/vendor/firebase/php-jwt/src/JWT.php(164): Firebase\JWT\JWT::jsonEncode(array)
#2 /srv/mediawiki/php-1.33.0-wmf.9/extensions/EventBus/includes/JobQueueEventBus.php(66): Firebase\JWT\JWT::encode(array, string)
#3 /srv/mediawiki/php-1.33.0-wmf.9/extensions/EventBus/includes/JobQueueEventBus.php(53): JobQueueEventBus::getEventSignature(array)
#4 /srv/mediawiki/php-1.33.0-wmf.9/extensions/EventBus/includes/JobQueueEventBus.php(133): JobQueueEventBus->createJobEvent(CirrusSearch\Job\ElasticaWrite)
#5 /srv/mediawiki/php-1.33.0-wmf.9/includes/jobqueue/JobQueue.php(340): JobQueueEventBus->doBatchPush(array, integer)
#6 /srv/mediawiki/php-1.33.0-wmf.9/includes/jobqueue/JobQueue.php(310): JobQueue->batchPush(array, integer)
#7 /srv/mediawiki/php-1.33.0-wmf.9/includes/jobqueue/JobQueueGroup.php(158): JobQueue->push(array)
#8 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(227): JobQueueGroup->push(array)
#9 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(152): CirrusSearch\Job\ElasticaWrite->requeueError(CirrusSearch\Connection)
#10 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Job/Job.php(99): CirrusSearch\Job\ElasticaWrite->doJob()
#11 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Updater.php(227): CirrusSearch\Job\Job->run()
#12 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Updater.php(85): CirrusSearch\Updater->updatePages(array, integer)
#13 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Job/LinksUpdate.php(52): CirrusSearch\Updater->updateFromTitle(Title)
#14 /srv/mediawiki/php-1.33.0-wmf.9/extensions/CirrusSearch/includes/Job/Job.php(99): CirrusSearch\Job\LinksUpdate->doJob()
#15 /srv/mediawiki/php-1.33.0-wmf.9/extensions/EventBus/includes/JobExecutor.php(65): CirrusSearch\Job\Job->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)
#17 {main}

Related logs: https://logstash.wikimedia.org/goto/3a8a414c666cdca598ec0e7a4d6d182a

Impact

Notes

Details

Related Gerrit Patches:
mediawiki/extensions/EventBus : masterReuse safe serialization method for signing the event.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Dec 19 2018, 8:06 PM
hashar created this task.
Restricted Application added projects: Analytics, Discovery-Search. · View Herald TranscriptDec 19 2018, 8:06 PM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald Transcript
php > print JSON_ERROR_UTF8;
5

Malformed UTF-8 characters, possibly incorrectly encoded

Poked @dcausse and @Smalyshev about it in #wikimedia-discovery and they are trying to untangle the jigsaw puzzle \o/

hashar lowered the priority of this task from Unbreak Now! to Normal.Dec 19 2018, 8:30 PM

Per dcausse, the exception is not new. Doing a logstash search over 30 days yield:

Message Count
DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.6/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 5198
DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.8/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 5100
DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.4/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 594
DomainException from line 353 of /srv/mediawiki/php-1.33.0-wmf.9/vendor/firebase/php-jwt/src/JWT.php: Unknown JSON error: 510

wiki: mediawikiwiki
page: Thread:Project:Support_desk/自己在公司内部搭建的media_wiki,每次编辑完后点提交时大约要5秒左右的响应时间/reply_(4)

Search backend error during sending {numBulk} documents to the {index} index(s) after 16: unknown: Malformed UTF-8 characters, possibly incorrectly encoded

Smalyshev updated the task description. (Show Details)Dec 19 2018, 8:53 PM
Smalyshev updated the task description. (Show Details)
Krinkle raised the priority of this task from Normal to High.EditedDec 20 2018, 8:06 PM
Krinkle added a project: Core Platform Team.
Krinkle added a subscriber: Krinkle.

Tentatively triaging as high priority. I don't know whether the new JobQueue has a mechanism for stashing failed or invalid job descriptions for later recovery, but it appears in this case the jobs are lost before they reach Kafka. The client is unable to serialise the description and the submission attempt fails fatally, also resulting in cascading failure as the original web request or job that created it also gets aborted. This means its impact is not limited to some pages not getting indexed.

I wonder if this can be prevented in some generic way. Alternatively, if the input is simply malformed, then I do wonder how it is able to reach this far into our infrastructure before hitting the wall. Presumably this should have been caught much earlier, or perhaps it is, and these are old pages that made it in before we fixed that, in which case this task could be turned into a TODO item to clean those up with a maintenance script e.g. inserting tofu chars for invalid or some other way of unifying it in a way that won't cause every UTF-8 consumer to run into malformed chars.

then I do wonder how it is able to reach this far into our infrastructure before hitting the wall

The page I found was a 4 year old talk page converted from LiquidThreads to Flow. One guess might be something happened on the way, or some checks were missing either in LT or conversion script or both. I didn't dig into it further - I imagine it'd be a bit annoying to figure out where broken utf-8. Probably digging directly into the DB for the page text would be required, but figuring out how it got there (if it's really there) would not be easy.

The client is unable to serialise the description and the submission attempt fails fatally, also resulting in cascading failure as the original web request or job that created it also gets aborted. This means its impact is not limited to some pages not getting indexed.

In this case, the job seems to be exclusively for indexing, but of course the failure in indexing should not fail other things and we should ensure this is the case. We could try creating some pages with invalid UTF-8 titles/content and see that our infrastructure can fail gracefully on them.

@Smalyshev We have good enough isolation for the running of jobs afaik. The issue is, this is coming from the submission end. (Which itself, might be another job).

If we imagine a user request that edits or renames a page, it might also trigger secondary post-updates, including a re-index job. The serialise step for that job description causes a fatal error before it gets sent to Kafka. That fatal error affects the user's web request. This means other PHP code meant to run during the edit request will not get run either. Including job submissions or deferred updates from entirely unrelated MediaWiki components.

The same may also apply to a job. I might edit a template used by a bunch of pages. The web request in which I edit the page, will enqueue a recursive RefreshLinks job (that works fine). That job then starts running at some point, and in the job instance that deals with a small batch of pages containing the invalid UTF-8 page, could fail when it tries to submit the re-index job, thus causing that batch of RefreshLinks to not run.

Queuing a job is a fairly basic activity, one where we generally don't expect errors, let alone fatal ones.

I don't think this is an issue for CirrusSearch, though. The problem is either that EventBus needs to find a way to serialise these jobs in a way that doesn't fail, or to catch the exception within its portion of the call stack. E.g. by logging it to Logtash and dropping it.

Alternatively, if we don't want to support this, we'll have to fix it from the other end. That is: Ensure these titles cannot be created today, and then fixing up existing ones with a maintenance script.

@Krinkle we now have a process that reindexes all pages over the duration of time. Which means, every page, eventually, will be reindexed, and that message may be triggered by such reindex even without any edits. Good news is that if it fails, we are no worse than before (except for logspam).

Queuing a job is a fairly basic activity, one where we generally don't expect errors, let alone fatal ones.

Agreed.

EBjune added a subscriber: EBjune.Jan 3 2019, 6:10 PM

We'll keep an eye on this, let us know if Search can help further.

mobrovac removed mobrovac as the assignee of this task.Feb 19 2019, 10:37 PM
mobrovac added a subscriber: mobrovac.

I don't think this is an issue for CirrusSearch, though. The problem is either that EventBus needs to find a way to serialise these jobs in a way that doesn't fail, or to catch the exception within its portion of the call stack. E.g. by logging it to Logtash and dropping it.

EventBus logs the error in case it cannot serialise the job when sending it to Kafka in the first place. We had quite a number of such cases while we were moving the jobs to the new JobQueue. In this concrete instance, it does seem like CirrusSearch fails to encode something. On the JobQueue side, when a job fails it is retried several times and if these fail, then we put it in a special error topic. CirrusSearch jobs are special in that regard, though, as they retry themselves, which is probably why the stack trace here suggests that another job is trying to be enqueued.

Restricted Application added a project: Analytics. · View Herald TranscriptFeb 19 2019, 10:37 PM

In this concrete instance, it does seem like CirrusSearch fails to encode something.

From my reading of the stack trace, encoding happens in extensions/EventBus - thus outside CirrusSearch. It might be that CirrusSearch is sending some data that couldn't be encoded - but I presume EventBus should be handling this failure?

Pchelolo claimed this task.Feb 20 2019, 1:44 AM
Pchelolo added a subscriber: Pchelolo.

This is a bug in Event-Platform. Serializing the events for sending is protected with a try-catch, it logs an error and drops the job if it's not serializable. However, in this case, it fails in a different place - each job is signed via JWT so that it could be checked in a later point and verified that Mediawiki is the entity actually sending the job. That signing procedure internally requires serialization as well apparently and it's not protected with try-catch.

Change 491880 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Reuse safe serialization method for signing the event.

https://gerrit.wikimedia.org/r/491880

Milimetric moved this task from Incoming to Radar on the Analytics board.Feb 21 2019, 5:44 PM

Change 491880 merged by Mobrovac:
[mediawiki/extensions/EventBus@master] Reuse safe serialization method for signing the event.

https://gerrit.wikimedia.org/r/491880

Pchelolo closed this task as Resolved.Mar 11 2019, 6:37 PM

The change has been deployed and EventBus doesn't fail with this anymore. Resolving.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM