> 2% of API wall time spent generating UUIDs
Closed, ResolvedPublic
Actions

Description

More than 2% of MediaWiki API wall-clock time is spent in the call to UIDGenerator::newUUIDv1() in ApiMain.php, added in a2b85209e2 ("Emit new style API action logs into Monolog"): May 8th: 2.40%, May 7th: 2.76%.

This can probably be ameliorated once PHP 7.3's hrtime() becomes available. In the mean time, it would be good to avoid having UIDGenerator::newUUIDv1() in the path of every API request.

Details

Subject	Repo	Branch	Lines +/-
Defer .meta.id creation until after the user response is sent.	mediawiki/extensions/EventBus	master	+17 -11
Use UUIDv4 for event generation.	mediawiki/extensions/EventBus	master	+1 -1
Use UUIDv4 for mediawiki.api-request log event ID.	mediawiki/core	master	+1 -1
Don't mention that ID should match the DT field.	mediawiki/event-schemas	master	+30 -30
Use UUIDv4 for cirrussearch-request event ID.	mediawiki/extensions/CirrusSearch	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: T221976: Have Varnish set the `X-Request-Id` header for incoming external requests
rMWa2b85209e225: Emit new style API action logs into Monolog.

Event Timeline

ori created this task.May 10 2019, 4:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2019, 4:32 PM

ori updated the task description. (Show Details)May 10 2019, 4:32 PM

We have been using UUID v1 in Event-Platform events for a while now with no problem, but I guess for other events it did not really matter since they're much lower volume and are created within already very heavy code paths.

As a quick fix, I propose to replace uuid v1 with uuid v4 specifically for mediawiki.api-request event with a big fat disclaimer that it needs to be replaced back once we get access to. PHP 7.3 and speed up the generation of uuid v1. Opinions?

SGTM. UIDGenerator::newUUIDv4() uses a good source of randomness, and the rate of API requests is on the order of <100k/s, so the probability of ever having a collision is incredibly small.

• fdans triaged this task as Medium priority.May 13 2019, 3:21 PM

• fdans moved this task from Incoming to Event Platform on the Analytics board.

Am fine with any uuid that matches the expected format. IIRC, we chose v4 because Gabriel liked the way it encoded the event timestamp. With the meta.dt in every event anyway, I don't see any reason to require that.

Thanks Ori! o/

What is this mediawiki.api-request data ultimately for and why does it require a UUID?

Could the WebRequest::getRequestId() suffice instead? That unique ID from Apache is already based on all the same bits used by UUIDv1 (LAN host-name/IP, process id, timestamp, sub-timestamp counter), and seems authoritative in this context.

What is this mediawiki.api-request data ultimately for and why does it require a UUID?

It's imported into hadoop for analytics purposes and is going to replace the avro-based API request logging.

Could the WebRequest::getRequestId() suffice instead?

We already provide the request id in the event, but we also want each event to have its own unique ID for conformity with other events. We have established a rule that each event must have a unique identifier of the event itself and we include that requirement in all the schemas. One request can generate multiple events. For mediawiki.api-request it might is not true, one request, one api-request event, but other events could also be generated. If we used request ID the events generated within the same request context would have the same IDs. The IDs can be used for a lot of possible purposes. ChangeProp, for example, uses them for deduplication on service restarts.

• Pchelolo moved this task from Inbox to Next on the Platform Team Legacy board.May 14 2019, 1:06 PM

• Pchelolo edited projects, added Platform Team Legacy (Next); removed Platform Team Legacy.

Could the WebRequest::getRequestId() suffice instead

We also log this as meta.request_id. Since you mentioned it...we are hoping that https://gerrit.wikimedia.org/r/c/mediawiki/core/+/505668 and T221976: Have Varnish set the `X-Request-Id` header for incoming external requests get solved soonish. It will be pretty cool to be able to group several events streams by the request_id, to see all of the events that a single http request might generate. :D

Ottomata moved this task from Backlog to Estimated/ Discussed on the Event-Platform board.May 14 2019, 1:34 PM

In T222966#5179973, @Pchelolo wrote:

We have established a rule that each event must have a unique identifier of the event itself and we include that requirement in all the schemas.
[..]
The IDs can be used for a lot of possible purposes. ChangeProp, for example, uses them for deduplication on service restarts.

I may file a separate task about this, but it seems odd to me to require by default a UUID to be generated for every single event at scale at the producer side, without a clear purpose or responsibility around it. If the same exact event is accidentally submitted twice by MW, does some low-level transport layer ensure the second one is rejected? E.g. part of EventBus/Kafka/EventGate somewhere? If not, is there a shared logic that all consumers are meant to enforce to somehow track events they've handled? And that includes the handlers for ChangeProp, JobQueue and mediawiki.api-request?

Would it be appropriate for this field to be intentionally set to a previously used UUID for the purpose of de-duplication, e.g. when MW informs the event stack about something to be purged for the same page title again?

I suspect the answer to all these is no, and that it must always be a new unique value. Which sounds like the only appropriate use case for it is low-level transportation guarantees. E.g. identifying offsets for consuming, resuming and storing offsets. Which, afaik, is the responsibility of Kafka in this setup, which I thought Kafka already assigns offset ids.

If Mediawiki retries the HTTP request to produce the event, or EventGate's (or change-prop's or EventLogging's or whatever's) Kafka client retries the Kafka produce request, or if a downstream consumer consumes the same event multiple times (e.g. it consumes, then dies before it commits its new offset high water mark), it is possible to get duplicate events. Kafka is (by default) at least once guarantee.

I believe change-prop uses the ID to do deduplication. Our Hive import process also uses the id to do deduplication. EventLogging has also always generated a uuid field.

Would it be appropriate for this field to be intentionally set to a previously used UUID for the purpose of de-duplication, e.g. when MW informs the event stack about something to be purged for the same page title again?

No, I don't think so. Some other type of id could possibly be used for this, but I don't thikn that's the purpose of the id field. That is for uniquely identifying events themselves.

Q: Is there any reason constructing the event couldn't be deferred until after the API request finishes? We already defer the send...

Change 510293 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/event-schemas@master] Don't mention that ID should match the DT field.

https://gerrit.wikimedia.org/r/510293

gerritbot added a project: Patch-For-Review.May 15 2019, 4:59 AM

Change 510294 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Use UUIDv4 for mediawiki.api-request log event ID.

https://gerrit.wikimedia.org/r/510294

Change 510295 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Use UUIDv4 for event generation.

https://gerrit.wikimedia.org/r/510295

Change 510296 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/CirrusSearch@master] Use UUIDv4 for cirrussearch-request event ID.

https://gerrit.wikimedia.org/r/510296

Change 510440 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Defer .meta.id creation until after the user response is sent.

https://gerrit.wikimedia.org/r/510440

Change 510296 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Use UUIDv4 for cirrussearch-request event ID.

https://gerrit.wikimedia.org/r/510296

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.6; 2019-05-21).May 15 2019, 4:00 PM

• Pchelolo edited projects, added Platform Team Workboards; removed Patch-For-Review, Platform Team Legacy (Next).May 21 2019, 1:08 PM

• Pchelolo moved this task from Not a real column to Doing on the Platform Team Workboards board.

• Pchelolo edited projects, added Platform Team Workboards (Doing); removed Platform Team Workboards.

Change 510293 merged by Ottomata:
[mediawiki/event-schemas@master] Don't mention that ID should match the DT field.

https://gerrit.wikimedia.org/r/510293

Shall we merge these patches?

Change 510294 merged by jenkins-bot:
[mediawiki/core@master] Use UUIDv4 for mediawiki.api-request log event ID.

https://gerrit.wikimedia.org/r/510294

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.11; 2019-06-26); removed MW-1.34-notes (1.34.0-wmf.6; 2019-05-21).Jun 19 2019, 2:00 AM

More than 2% of MediaWiki API wall-clock time is spent in the call to UIDGenerator::newUUIDv1() in ApiMain.php, added in a2b85209e2 ("Emit new style API action logs into Monolog"): May 8th: 2.40%, May 7th: 2.76%.

Q: has there been any discussion around pre-generating UUID? This would reduce the wall time by the UUID usage to minimal. The basic scenario is to have a fleet of UUID producers who keep a queue full of UUID ready for use, then when something like the API call here needs one, instead of calling the UUID factory, API implementation will simply dequeue the available UUID. Redis or Kafka could actually host reasonable implementations of such a queue.

Q: also, has anyone looked at the mean/variance of timing to the existing call to the UUID factory? the reason I ask is that there is a scenario where 99/100 calls will be super-fast, then the 100th call will block (on linux) due to depletion of the entropy pool, and the kernel will block until enough entropy has been collected. Subsequent calls may also be slower due to less available system entropy between calls

WDoranWMF moved this task from Doing to Team 2 on the Platform Team Workboards board.Jul 5 2019, 5:41 PM

WDoranWMF edited projects, added Platform Team Workboards (Team 2); removed Platform Team Workboards (Doing).