Page MenuHomePhabricator

Very large jobs posted by Wikidata
Closed, ResolvedPublic

Description

After we've started posting the job events into Event-Platform and Kafka, we've noticed that some of them were rejected by Kafka because of MESSAGE_SIZE_TOO_LARGE error. The limit was increased from 1 Mb to 4Mb, but even after the increase some events are still sometimes rejected. Digging deeper I've found that there are some enormously large jobs.

Example 17 Mb in size originally, I've shortened it.

{
    "meta": {
        "domain": "commons.wikimedia.org",
        "uri": "https://commons.wikimedia.org/wiki/Special:Badtitle/JobSpecification",
        "topic": "mediawiki.job.wikibase-InjectRCRecords",
        "request_id": "faa2e213-fc18-4bd0-9ef1-b716360260a6",
        "schema_uri": "mediawiki/job/1",
        "dt": "2017-09-07T21:59:38+00:00",
        "id": "d80f03e5-9417-11e7-9459-141877615224"
    },
    "page_title": "Special:Badtitle/JobSpecification",
    "database": "commonswiki",
    "params": {
        "pages": {
            "17303097": [6, "'A_Missionary_Preaching_to_the_Natives,_under_a_Skreen_of_platted_Cocoa-nut_leaves_at_Kairua'_by_William_Ellis.jpg"],
            "14883442": [6, "PL_J\\u00f3zef_Ignacy_Kraszewski-Lubonie_tom_II_077.jpeg"],
            "2653965": [6, "Meyers_b9_s0043.jpg"],

            (..a few million further page entries...)
        },
        "change": {
            "info": "...",
            "user_id": "194202",
            "object_id": "Q36180",
            "time": "20170907215252",
            "revision_id": "553663638",
            "type": "wikibase-item~update",
            "id": 550757876
        }
    },
    "type": "wikibase-InjectRCRecords",
    "page_namespace": -1
}

There's another example that is 44 Mb is size serialized. Kafka is capable of handling that, but it's not great in dealing with very large messages, so we can't increase the cap indefinitely. Maybe there's something we could do on Wikidata side to reduce the size of these jobs?

Event Timeline

@Pchelolo, based on our previous conversation about this I am assuming that the bulk of the task is a very large list of pages. Is this correct?

@Pchelolo, based on our previous conversation about this I am assuming that the bulk of the task is a very large list of pages. Is this correct?

Ye, in the actual event the params.pages array contains millions and millions of items.

GWicke raised the priority of this task from Medium to High.Sep 13 2017, 5:11 PM

Raised priority, as this is a) blocking the migration to the Kafka job queue backend (T157088), and b) is likely already causing performance and possibly reliability issues in the current job queue.

Can I examine the job logs in more depth? the pages params can't have more than 100 (old settings) which we changed it to 50 and now to 20.

InjectRCRecords batches inserts when running the job, but doesn't chop the batch up before scheduling the job.
I can easily fix that. The patch should be back-portable, too. Give me a minute...

Note that T174422: Make dbBatchSize in WikiPageUpdater configurable is related, but would not change the fact that the entire set of titles would be put into a single InjectRC job, at the moment.

I did not anticipate that we'd see such *massive* usage for a single item.…

Here's an example of a very large event: https://people.wikimedia.org/~ppchelko/large_event

It's not an event itself, it's a log message from Event-Platform but the event is embedded in the log message.

Change 377811 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Change 377812 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@wmf/1.30.0-wmf.18] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377812

Fix above, backport applies cleanely. Note that the fix must be merged into the wikidata build, which then has to be re-deployed (yes we know that this sucks). Perhaps @aude or @Addshore or @hoo can help. @Legoktm should also know how it works. And I guess I could manage in a pinch. Relevant documentation can be found at https://wikitech.wikimedia.org/wiki/How_to_deploy_Wikidata_code

Note: unit tests pass, but I did not tried this out.

Change 377811 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Change 377897 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikidata@wmf/1.30.0-wmf.18] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377897

Change 377897 merged by jenkins-bot:
[mediawiki/extensions/Wikidata@wmf/1.30.0-wmf.18] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377897

Mentioned in SAL (#wikimedia-operations) [2017-09-13T23:29:57Z] <dereckson@tin> Synchronized php-1.30.0-wmf.18/extensions/Wikidata/extensions/Wikibase/client: Split page set before constructing InjectRCRecordsJob (T175316) (duration: 00m 57s)

Change 377812 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.30.0-wmf.18] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377812

After the patch was deployed the situation improved a lot, but we've got a 5 Mb event today: https://people.wikimedia.org/~ppchelko/event

5 Mb is not critically large, we can increase the limit in Kafka to 8 Mb I think, but maybe we should do some more improvements in Wikidata?

@Pchelolo disabling escaping of non-ascii characters would probably reduce the size to a fourth...

I'm not sure what kind of improvement you mean. We can tweak the chunk size - more jobs, or larger jobs, your pick.

We can tweak the chunk size - more jobs, or larger jobs, your pick.

Since in the new JQ system all jobrunners will run all jobs, a higher number of smaller jobs are preferred over a smaller number of big jobs.

@Pchelolo actually, can you confirm how many entries there were in the "pages" parameter? With the latest patches deployed, there should be no more than 20. Perhaps this is an old job getting retried, because it failed earlier?

@Pchelolo actually, can you confirm how many entries there were in the "pages" parameter? With the latest patches deployed, there should be no more than 20. Perhaps this is an old job getting retried, because it failed earlier?

There's definitely more then 20 entries under the pages parameter, so perhaps it's indeed a retry of an old job. Let's wait a couple of days to see if it happens again.

@mobrovac how about a very large number of very small jobs? e.g. a million jobs to purge a million pages from cdn?

Note that we introduced batching only a few weeks ago, at the explicit request of the performance folks. We had one job per purge before. It caused problems.

@mobrovac how about a very large number of very small jobs? e.g. a million jobs to purge a million pages from cdn?

Note that we introduced batching only a few weeks ago, at the explicit request of the performance folks. We had one job per purge before. It caused problems.

This is where the improvement part of the discussion comes in :) For example, in the concrete case of CDN purges, the EventBus/ChangeProp system supports that out of the box, so instead of having the flow MW -> EB -> CP -> JR -> V we could have MW -> EB -> CP -> V. This is the way we are doing it for async updates, doing around 500-1000 purges per second.

I think we should sit down together (Wikidata/Services) and go over the operational side of WD jobs and see (a) how can the new JQ system best support it; and (b) what changes/improvements can be made on both sides to make the system more performant and robust. (Disclaimer: I am not claiming that there is something wrong, and confess my ignorance when it comes to WD, but as this ticket illustrates, there is definitely room for improvement)

But, this is getting a bit out of scope of this concrete ticket. For now, let's just try to get the size of the jobs below the 4MB mark? :)

For now, let's just try to get the size of the jobs below the 4MB mark? :)

If you fix your encoding ;)

Looks like adding the JSON_UNESCAPED_UNICODE flag should do it: http://php.net/manual/en/function.json-encode.php

Change 378072 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Allow unicode in serialized events.

https://gerrit.wikimedia.org/r/378072

Looks like adding the JSON_UNESCAPED_UNICODE flag should do it: http://php.net/manual/en/function.json-encode.php

We use JsonFormatter class to encode JSON, so the solution is a bit different, but the above patch takes care of that.

Change 378072 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Allow unicode in serialized events.

https://gerrit.wikimedia.org/r/378072

Verified that the jobs are now small enough to fit in our new infrastructure. Resolving.