Page MenuHomePhabricator

Log post-send DB updates and make sure their frequency is acceptable
Closed, ResolvedPublic

Description

We still don't want to tie up apaches threads on slow cross-DB queries, so an empirical way of making sure these are reasonable would be logging.

Logs should include (# of master queries and the invoking method names for debugging).

Event Timeline

What is "reasonable" / "acceptable"?

We could define both:
a) A 'post-send' group of settings in $wgTrxProfilerLimits (perhaps similar to 'POST')
b) Track via logging & statsd the requests that have any such updates doing writes. As a % of GET/HEAD, these should be low. Maybe the acceptable threshold could be no more than the rate of POST requests that do writes (or perhaps a factor of that, e.g. 2X).

Essentially we are limited by:
a) # of apache threads doing these update at any given time (wrt to the overall pool).
b) # of concurrent DB master connections. If, say, every GET/HEAD was trying to do updates, we'd run into connection errors. Jaime might have some ideas on good limits from this perspective.

In some senses, these updates are just like any other query (which don't have many defined standards aside from "don't break the site"), but these have the pitfall of perhaps looking "free".

Are we sure things are so bad?

Of course, we cannot have all writes cross-datacenter (the latency would be too high), but if we assumed (and assured) they are only a small subset- would things be so bad *IF* we setup some kind of connection pooling that would save the whole TCP and mysql connection overhead?

I am currently exploring several proxy solutions (for unrelated reasons- HA/failover); but that could be an "easy way" without requiring custom code: limiting concurrency and including persistent connections + SSL. But needs practical proof that it can work on production. I have so far only tested it on labs.

BTW, sorry for not attending today's cross-datacenter meeting, there have been 2 concurrent ops crisis (some still ongoing).

CC'ing bblack, since he brought this up.

I don't think it's bad now, but a few actual metrics won't hurt. The rest of this seems more like a social problem of making sure people don't just do things like call User::saveSettings() on requests all over the place in a deferred updated block.

Change 293343 had a related patch set uploaded (by Aaron Schulz):
Add "PostSend" limits to $wgTrxProfilerLimits

https://gerrit.wikimedia.org/r/293343

But the number of post-send updates won't explode overnight. It's more likely that the number will increase by a slow, steady trickle, drawing you into the paradox of the heap: "so things were fine yesterday with 14 extensions doing post-send updates and now it's my extension that went over the line? How was it OK to go from 12 to 13, and from 13 to 14, but suddenly 14 to 15 is a deal-breaker?"

In my experience people value a clear programming model that tells you what you can and cannot do, as opposed to "we'd really prefer it if you didn't...".

aaron raised the priority of this task from Medium to High.Jun 9 2016, 8:54 PM
aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Change 293343 merged by jenkins-bot:
Add "PostSend" limits to $wgTrxProfilerLimits

https://gerrit.wikimedia.org/r/293343

Change 295041 had a related patch set uploaded (by Aaron Schulz):
Add statsd logging of DeferredUpdates

https://gerrit.wikimedia.org/r/295041

Change 295041 merged by jenkins-bot:
Add statsd logging of DeferredUpdates

https://gerrit.wikimedia.org/r/295041