Wikidata master database connection issue
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	May 31 2016, 8:55 AM

Description

There is a huge amount of connections to the master, as many that there are 5000 connection errors per minute.

While I cannot yet put a reason to it, most of the connections that fail seem to be happening since 12:30-12:31 UTC. Most of the failures come from wikibase-addUsagesForPage job execution. If that is a symptom (because there are a lot of them) or a cause, I cannot say yet:

https://logstash.wikimedia.org/#dashboard/temp/AVUF9DIZivsygkP3Y44Z

HHVM was restarted yesterday, but much earlier 10:25-10:35, so I am discarding that as a cause.

My pointers is to: a config/code deploy or a change in traffic pattern that we should identify and either support or drop.

Details

	Subject	Repo	Branch	Lines +/-
	Avoid connection to repo master in UsageUpdater	mediawiki/extensions/Wikibase	wmf/1.28.0-wmf.3	+27 -11
	Avoid connection to repo master in UsageUpdater	mediawiki/extensions/Wikibase	master	+27 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hoo	T136598 Wikidata master database connection issue
Declined	None	T116404 EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow
Resolved	hoo	T137539 Use individual select queries to determin which Entities are in the wbc_entity_usage table
Resolved	hoo	T137972 Use UNION ALL instead of UNION for bulk wbc_entity_usage queries

Event Timeline

jcrespo created this task.May 31 2016, 8:55 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 31 2016, 8:55 AM

jcrespo updated the task description. (Show Details)May 31 2016, 9:03 AM

Lydia_Pintscher added subscribers: aude, hoo, JanZerebecki.May 31 2016, 9:03 AM

jcrespo updated the task description. (Show Details)May 31 2016, 9:04 AM

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

In T136598#2340507, @jcrespo wrote:

There are a couple of long-running tasks on the master, from snapshot1003- that is strange, but not sure it is an issue.

That's T108929. We should only see 4 concurrent master connections from the snapshot host, thus this is probably not relevant (and will also be fixed soon).

For the actual problem here: I can't see any obvious change in either editing patterns (or count) on any project, also not in entity usages (which that job adds). There seems to be a coincidental change in job rates: https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=6&fullscreen&from=1464603245227&to=1464689645228&var-Job=all but I'm not sure what that tells us.

Can you tell which shards were (primarily) affected? s4 (commons)? I kind of suspect multilingual pages/ page views on commons to be responsible, but then again, I don't know.

Sorry I wasn't specific enough. This issue is only happening on db1049: s5-master (Wikidata master).

If you ask for wikis, this is the distribution:

filter: wikibase-addUsagesForPage and 10.64.16.144

(10.64.16.144 is s5-master)

dewiki (777) enwiki (691) wikidatawiki (487) frwiki (423) zhwiki (415) commonswiki (384) svwiki (375) itwiki (287) ptwiki (177) ruwiki (171) eswiki (165) jawiki (160) shwiki (138) arwiki (124) plwiki (110) Other values (1909)

At about 2016-05-30 1400 addUsagesForPage job pop failed starts to show about 50/min and ok starts to have higher fluctuation than before.
The database that is being connected to is wikidatawiki, most of the connection attempts are from the job running for client wikis T136598#2340744.

Maybe unrelated: Server admin log:
2016-05-30
15:10 logmsgbot: dcausse@tin Synchronized wmf-config: Send wmf.4 search and ttmserver traffic to codfw (duration: 00m 33s)
12:59 gehel: disabling warmers elasticsearch codfw cluster (T133125)
12:35 gehel: re-enabling puppet on elasticsearch codfw cluster (T133125)

Any variation in MediaWiki.jobqueue.inserts_actual.wikibase_addUsagesForPage.rate seems normal. So this can't be a sudden increase in pages rendered.
The change in rates noticed in T136598#2340678 is only temporary afterwards it seems to be similar to before again.
The rest of the db errors (not from wikibase_addUsagesForPage, e.g. https://logstash.wikimedia.org/#dashboard/temp/AVUIrz6-_LTxu7wlGH5_ ) seem to have a high percentage of them going to wikidatawiki.

A high sustained level of errors seem to have gone at 21:40-21:45 yesterday. The errors continue, but only in smaller spikes. We'll see if it returns when user activity grows more.

I do not know what happened there:

Change 292115 had a related patch set uploaded (by Hoo man):
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/292115

gerritbot added a project: Patch-For-Review.Jun 1 2016, 11:27 AM

I looked into this very briefly and found that we unnecessarily open these connections in most job runs. My patch should fix that.

Next one might want to look at the graph of established connections and connection attempts to the wikidatawiki master. Perhaps compare with the connection limit (even over a changing master).

@JanZerebecki - our max_connections is very high: 10000 simultaneous connections- because we use connection pooling at server side. We almost never reach that- mediawiki servers get saturated first; what fails is the connection timeout, which is set to 3 seconds (when normally, it should take a few micro or miliseconds), which means it took the local node more than 3 seconds just to establish a mysql connection. The direct cause is usually having too many open connections, which slows down the whole connection handling.

daniel triaged this task as High priority.Jun 2 2016, 12:42 PM

daniel added a project: Wikidata-Sprint-2016-05-24.

daniel moved this task from incoming to in progress on the Wikidata board.

Taking this into the sprint because it's an ongoing performance issue and is potentially causing data loss. Marius already made a fix.

FTR, it remains unclear why the problem started when it started. The code did not change at that point, afaik. Maybe we just crossed some sort of threashold.

Change 292115 merged by jenkins-bot:
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/292115

Change has been merged on master, can be backported, if needed.

I made a mistake in the initial patch, fix up: https://gerrit.wikimedia.org/r/292834

This issue happened again today between 4:26 and 7:20 (I may have done something to end it because I had to do some unrelated server changes):
https://logstash.wikimedia.org/#dashboard/temp/AVUpzaXm_LTxu7wlMMqT

Screenshot from 2016-06-07 09:39:33.png (465×1 px, 39 KB)

Adding @ArielGlenn because I saw some long-running connections from snapshot1007 (fetching dewiki) at the time- want to discard potential issues with that.

However, from the kibana logs, all issues seem to be rpc (jobs) wikibase-addUsagesForPage creating too much concurrency on s5-master.

jcrespo moved this task from Triage to In progress on the DBA board.Jun 7 2016, 9:24 AM

Not surprising, that this is still an issue: The changes have not been backported/ deployed, yet.

I can try to arrange a deploy today, if needed.

To anyone deploying this: Also backport the follow-up!!

@hoo, my apologies- I thought it had been already deployed. Let me keep it open until it is deployed so that 3rd parties can see it can happen for now- you can move it if you need it on the Wikidata dashboard.

19:08 < jynus> jzerebecki, as far as I can see, it is only affecting itself, and not other connections
19:09 < jynus> (the wikidata job queue executions)

I think this is not needed today as it does not affect users.

Change 293511 had a related patch set uploaded (by JanZerebecki):
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/293511

gerritbot added a project: Patch-For-Review.Jun 9 2016, 3:05 PM

Change 293511 merged by JanZerebecki:
Avoid connection to repo master in UsageUpdater

https://gerrit.wikimedia.org/r/293511

The fix for this triggered T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow, not sure how to continue from here. We could potentially work around that, but all ideas I have have strong disadvantages.

hoo added a subtask: T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow.Jun 9 2016, 8:11 PM

Change 293570 had a related patch set uploaded (by Hoo man):
Update Wikibase: Revert T136598 fixes

https://gerrit.wikimedia.org/r/293570

Change 293570 merged by jenkins-bot:
Update Wikibase: Revert T136598 fixes

https://gerrit.wikimedia.org/r/293570

Mentioned in SAL [2016-06-09T20:36:04Z] <hoo@tin> Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (without unrelated T136598 fixes this time) (duration: 01m 51s)

In T136598#2369366, @hoo wrote:

The fix for this triggered T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow, not sure how to continue from here. We could potentially work around that, but all ideas I have have strong disadvantages.

I suggest to try T137539: Use individual select queries to determin which Entities are in the wbc_entity_usage table

hoo closed subtask T116404: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow as Declined.Jun 10 2016, 4:11 PM

adrianheine closed subtask T137539: Use individual select queries to determin which Entities are in the wbc_entity_usage table as Resolved.Jun 14 2016, 7:18 AM

Master connection issues not seen for a week.

	F4138663: Screenshot from 2016-06-07 09:39:33.png
	Jun 7 2016, 7:46 AM

	F4098270: Screenshot from 2016-06-01 11:36:39.png
	Jun 1 2016, 9:37 AM

	F4087000: Screenshot from 2016-05-31 10:47:13.png
	May 31 2016, 8:55 AM

Wikidata master database connection issueClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Wikidata master database connection issue
Closed, ResolvedPublic
Actions

Related Objects
Search...