Page MenuHomePhabricator

Wikidata master database connection issue
Closed, ResolvedPublic


There is a huge amount of connections to the master, as many that there are 5000 connection errors per minute.

While I cannot yet put a reason to it, most of the connections that fail seem to be happening since 12:30-12:31 UTC. Most of the failures come from wikibase-addUsagesForPage job execution. If that is a symptom (because there are a lot of them) or a cause, I cannot say yet:

Screenshot from 2016-05-31 10:47:13.png (473×2 px, 95 KB)

HHVM was restarted yesterday, but much earlier 10:25-10:35, so I am discarding that as a cause.

My pointers is to: a config/code deploy or a change in traffic pattern that we should identify and either support or drop.

Event Timeline

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

That's T108929. We should only see 4 concurrent master connections from the snapshot host, thus this is probably not relevant (and will also be fixed soon).

For the actual problem here: I can't see any obvious change in either editing patterns (or count) on any project, also not in entity usages (which that job adds). There seems to be a coincidental change in job rates: but I'm not sure what that tells us.

Can you tell which shards were (primarily) affected? s4 (commons)? I kind of suspect multilingual pages/ page views on commons to be responsible, but then again, I don't know.

Sorry I wasn't specific enough. This issue is only happening on db1049: s5-master (Wikidata master).

If you ask for wikis, this is the distribution:

filter: wikibase-addUsagesForPage and

( is s5-master)

dewiki (777) enwiki (691) wikidatawiki (487) frwiki (423) zhwiki (415) commonswiki (384) svwiki (375) itwiki (287) ptwiki (177) ruwiki (171) eswiki (165) jawiki (160) shwiki (138) arwiki (124) plwiki (110) Other values (1909)

At about 2016-05-30 1400 addUsagesForPage job pop failed starts to show about 50/min and ok starts to have higher fluctuation than before.
The database that is being connected to is wikidatawiki, most of the connection attempts are from the job running for client wikis T136598#2340744.

Maybe unrelated: Server admin log:
15:10 logmsgbot: dcausse@tin Synchronized wmf-config: Send wmf.4 search and ttmserver traffic to codfw (duration: 00m 33s)
12:59 gehel: disabling warmers elasticsearch codfw cluster (T133125)
12:35 gehel: re-enabling puppet on elasticsearch codfw cluster (T133125)

Any variation in MediaWiki.jobqueue.inserts_actual.wikibase_addUsagesForPage.rate seems normal. So this can't be a sudden increase in pages rendered.
The change in rates noticed in T136598#2340678 is only temporary afterwards it seems to be similar to before again.
The rest of the db errors (not from wikibase_addUsagesForPage, e.g. ) seem to have a high percentage of them going to wikidatawiki.

A high sustained level of errors seem to have gone at 21:40-21:45 yesterday. The errors continue, but only in smaller spikes. We'll see if it returns when user activity grows more.

I do not know what happened there:

Screenshot from 2016-06-01 11:36:39.png (470×2 px, 54 KB)

Change 292115 had a related patch set uploaded (by Hoo man):
Avoid connection to repo master in UsageUpdater

I looked into this very briefly and found that we unnecessarily open these connections in most job runs. My patch should fix that.

Next one might want to look at the graph of established connections and connection attempts to the wikidatawiki master. Perhaps compare with the connection limit (even over a changing master).

@JanZerebecki - our max_connections is very high: 10000 simultaneous connections- because we use connection pooling at server side. We almost never reach that- mediawiki servers get saturated first; what fails is the connection timeout, which is set to 3 seconds (when normally, it should take a few micro or miliseconds), which means it took the local node more than 3 seconds just to establish a mysql connection. The direct cause is usually having too many open connections, which slows down the whole connection handling.

daniel triaged this task as High priority.Jun 2 2016, 12:42 PM
daniel moved this task from incoming to in progress on the Wikidata board.
daniel added a subscriber: daniel.

Taking this into the sprint because it's an ongoing performance issue and is potentially causing data loss. Marius already made a fix.

FTR, it remains unclear why the problem started when it started. The code did not change at that point, afaik. Maybe we just crossed some sort of threashold.

Change 292115 merged by jenkins-bot:
Avoid connection to repo master in UsageUpdater

hoo moved this task from Doing to Done on the Wikidata-Sprint-2016-05-24 board.
hoo removed a project: Patch-For-Review.

Change has been merged on master, can be backported, if needed.

jcrespo added a subscriber: ArielGlenn.

This issue happened again today between 4:26 and 7:20 (I may have done something to end it because I had to do some unrelated server changes):

Screenshot from 2016-06-07 09:39:33.png (465×1 px, 39 KB)

Adding @ArielGlenn because I saw some long-running connections from snapshot1007 (fetching dewiki) at the time- want to discard potential issues with that.

However, from the kibana logs, all issues seem to be rpc (jobs) wikibase-addUsagesForPage creating too much concurrency on s5-master.

Not surprising, that this is still an issue: The changes have not been backported/ deployed, yet.

I can try to arrange a deploy today, if needed.

To anyone deploying this: Also backport the follow-up!!

@hoo, my apologies- I thought it had been already deployed. Let me keep it open until it is deployed so that 3rd parties can see it can happen for now- you can move it if you need it on the Wikidata dashboard.

19:08 < jynus> jzerebecki, as far as I can see, it is only affecting itself, and not other connections
19:09 < jynus> (the wikidata job queue executions)

I think this is not needed today as it does not affect users.

Change 293511 had a related patch set uploaded (by JanZerebecki):
Avoid connection to repo master in UsageUpdater

Change 293511 merged by JanZerebecki:
Avoid connection to repo master in UsageUpdater

The fix for this triggered T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow, not sure how to continue from here. We could potentially work around that, but all ideas I have have strong disadvantages.

Change 293570 had a related patch set uploaded (by Hoo man):
Update Wikibase: Revert T136598 fixes

Change 293570 merged by jenkins-bot:
Update Wikibase: Revert T136598 fixes

Mentioned in SAL [2016-06-09T20:36:04Z] <hoo@tin> Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (without unrelated T136598 fixes this time) (duration: 01m 51s)

Master connection issues not seen for a week.