Page MenuHomePhabricator

Wikidata master database connection issue
Closed, ResolvedPublic

Description

There is a huge amount of connections to the master, as many that there are 5000 connection errors per minute.

While I cannot yet put a reason to it, most of the connections that fail seem to be happening since 12:30-12:31 UTC. Most of the failures come from wikibase-addUsagesForPage job execution. If that is a symptom (because there are a lot of them) or a cause, I cannot say yet:

https://logstash.wikimedia.org/#dashboard/temp/AVUF9DIZivsygkP3Y44Z

HHVM was restarted yesterday, but much earlier 10:25-10:35, so I am discarding that as a cause.

My pointers is to: a config/code deploy or a change in traffic pattern that we should identify and either support or drop.

Event Timeline

jcrespo created this task.May 31 2016, 8:55 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 31 2016, 8:55 AM
jcrespo updated the task description. (Show Details)May 31 2016, 9:03 AM
jcrespo updated the task description. (Show Details)May 31 2016, 9:04 AM

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

hoo added a comment.May 31 2016, 10:17 AM

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

That's T108929. We should only see 4 concurrent master connections from the snapshot host, thus this is probably not relevant (and will also be fixed soon).

For the actual problem here: I can't see any obvious change in either editing patterns (or count) on any project, also not in entity usages (which that job adds). There seems to be a coincidental change in job rates: https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=6&fullscreen&from=1464603245227&to=1464689645228&var-Job=all but I'm not sure what that tells us.

Can you tell which shards were (primarily) affected? s4 (commons)? I kind of suspect multilingual pages/ page views on commons to be responsible, but then again, I don't know.

Sorry I wasn't specific enough. This issue is only happening on db1049: s5-master (Wikidata master).

If you ask for wikis, this is the distribution:

filter: wikibase-addUsagesForPage and 10.64.16.144

(10.64.16.144 is s5-master)

dewiki (777) enwiki (691) wikidatawiki (487) frwiki (423) zhwiki (415) commonswiki (384) svwiki (375) itwiki (287) ptwiki (177) ruwiki (171) eswiki (165) jawiki (160) shwiki (138) arwiki (124) plwiki (110) Other values (1909)

At about 2016-05-30 1400 addUsagesForPage job pop failed starts to show about 50/min and ok starts to have higher fluctuation than before.
The database that is being connected to is wikidatawiki, most of the connection attempts are from the job running for client wikis T136598#2340744.

Maybe unrelated: Server admin log:
2016-05-30
15:10 logmsgbot: dcausse@tin Synchronized wmf-config: Send wmf.4 search and ttmserver traffic to codfw (duration: 00m 33s)
12:59 gehel: disabling warmers elasticsearch codfw cluster (T133125)
12:35 gehel: re-enabling puppet on elasticsearch codfw cluster (T133125)

Any variation in MediaWiki.jobqueue.inserts_actual.wikibase_addUsagesForPage.rate seems normal. So this can't be a sudden increase in pages rendered.
The change in rates noticed in T136598#2340678 is only temporary afterwards it seems to be similar to before again.
The rest of the db errors (not from wikibase_addUsagesForPage, e.g. https://logstash.wikimedia.org/#dashboard/temp/AVUIrz6-_LTxu7wlGH5_ ) seem to have a high percentage of them going to wikidatawiki.

A high sustained level of errors seem to have gone at 21:40-21:45 yesterday. The errors continue, but only in smaller spikes. We'll see if it returns when user activity grows more.

I do not know what happened there:

Change 292115 had a related patch set uploaded (by Hoo man):
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/292115

hoo claimed this task.Jun 1 2016, 11:29 AM

I looked into this very briefly and found that we unnecessarily open these connections in most job runs. My patch should fix that.

Next one might want to look at the graph of established connections and connection attempts to the wikidatawiki master. Perhaps compare with the connection limit (even over a changing master).

@JanZerebecki - our max_connections is very high: 10000 simultaneous connections- because we use connection pooling at server side. We almost never reach that- mediawiki servers get saturated first; what fails is the connection timeout, which is set to 3 seconds (when normally, it should take a few micro or miliseconds), which means it took the local node more than 3 seconds just to establish a mysql connection. The direct cause is usually having too many open connections, which slows down the whole connection handling.

daniel triaged this task as High priority.Jun 2 2016, 12:42 PM
daniel moved this task from incoming to in progress on the Wikidata board.
daniel added a subscriber: daniel.

Taking this into the sprint because it's an ongoing performance issue and is potentially causing data loss. Marius already made a fix.

daniel added a comment.Jun 2 2016, 1:01 PM

FTR, it remains unclear why the problem started when it started. The code did not change at that point, afaik. Maybe we just crossed some sort of threashold.

Change 292115 merged by jenkins-bot:
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/292115

hoo closed this task as Resolved.Jun 2 2016, 4:27 PM
hoo moved this task from Doing to Done on the Wikidata-Sprint-2016-05-24 board.
hoo removed a project: Patch-For-Review.

Change has been merged on master, can be backported, if needed.

hoo added a comment.Jun 5 2016, 2:07 AM

I made a mistake in the initial patch, fix up: https://gerrit.wikimedia.org/r/292834

jcrespo reopened this task as Open.Jun 7 2016, 7:46 AM
jcrespo added a subscriber: ArielGlenn.

This issue happened again today between 4:26 and 7:20 (I may have done something to end it because I had to do some unrelated server changes):
https://logstash.wikimedia.org/#dashboard/temp/AVUpzaXm_LTxu7wlMMqT

Adding @ArielGlenn because I saw some long-running connections from snapshot1007 (fetching dewiki) at the time- want to discard potential issues with that.

However, from the kibana logs, all issues seem to be rpc (jobs) wikibase-addUsagesForPage creating too much concurrency on s5-master.

jcrespo moved this task from Triage to In progress on the DBA board.Jun 7 2016, 9:24 AM
hoo added a comment.Jun 7 2016, 10:08 AM

Not surprising, that this is still an issue: The changes have not been backported/ deployed, yet.

I can try to arrange a deploy today, if needed.

To anyone deploying this: Also backport the follow-up!!

jcrespo added a comment.EditedJun 7 2016, 10:11 AM

@hoo, my apologies- I thought it had been already deployed. Let me keep it open until it is deployed so that 3rd parties can see it can happen for now- you can move it if you need it on the Wikidata dashboard.

19:08 < jynus> jzerebecki, as far as I can see, it is only affecting itself, and not other connections
19:09 < jynus> (the wikidata job queue executions)

I think this is not needed today as it does not affect users.

Change 293511 had a related patch set uploaded (by JanZerebecki):
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/293511

Change 293511 merged by JanZerebecki:
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/293511

hoo added a comment.Jun 9 2016, 8:11 PM

The fix for this triggered T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow, not sure how to continue from here. We could potentially work around that, but all ideas I have have strong disadvantages.

Change 293570 had a related patch set uploaded (by Hoo man):
Update Wikibase: Revert T136598 fixes

https://gerrit.wikimedia.org/r/293570

Change 293570 merged by jenkins-bot:
Update Wikibase: Revert T136598 fixes

https://gerrit.wikimedia.org/r/293570

Mentioned in SAL [2016-06-09T20:36:04Z] <hoo@tin> Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (without unrelated T136598 fixes this time) (duration: 01m 51s)

jcrespo closed this task as Resolved.Jun 16 2016, 2:30 PM

Master connection issues not seen for a week.