[Bug] EntityUsageTable::touchUsageBatch slow query
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	aude
	Sep 8 2015, 10:25 AM

Description

I see some slow queries in the logs for touch usages:

UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')

We should still think about how to improve these.

Related Objects

Mentioned In: T124737: [RfC] Implement usage tracking without eu_touched
T123867: Repeated reports of wikidatawiki (s5) API going read only
T114654: [Task] Delete by primary key in EntityUsageTable::pruneStaleUsages and EntityUsageTable::removeEntities
T107319: [Bug] EntityUsageTable::touchUsages slow query
Mentioned Here: T59750: [Task] Use diff information to optimize database updates when saving an entity
T68025: [Story] Monitor size of some Wikidata database tables
T97414: [Task] Use an own load section for Wikibase DB queries
T113763: [Story] ensure use of staging db for analytics/stats works for us
T114902: Remove numeric entity IDs from database schema
T95501: Fix causes of replica lag and get it to under 5 seconds at peak
T125838: Implement usage tracking without eu_touched
T124737: [RfC] Implement usage tracking without eu_touched
T122429: Batch updates create slave lag on s3 over WAN

Event Timeline

aude created this task.Sep 8 2015, 10:25 AM

aude raised the priority of this task from to Needs Triage.

aude updated the task description. (Show Details)

aude added projects: Wikidata, MediaWiki-extensions-WikibaseClient.

aude subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2015, 10:25 AM

aude mentioned this in T107319: [Bug] EntityUsageTable::touchUsages slow query.Sep 8 2015, 10:25 AM

jcrespo subscribed.Sep 8 2015, 10:47 AM

daniel added projects: Performance Issue, DBA.Sep 8 2015, 1:51 PM

daniel subscribed.

I'm a bit at a loss here. This query said "here is a set of values of the primary key, go and set field x to value y for the rows with that key". I don't see how this could be improved...

Sep 7 05:03:28 mw1005: SlowTimer [15207ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20141005013634' WHERE eu_row_id = '1266958'
Sep 7 06:49:49 mw1183: SlowTimer [15917ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch ... */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')
Sep 7 06:50:05 mw1002: SlowTimer [15802ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')

by comparison, term queries are significantly slower

SlowTimer [60001ms] at runtime/ext_mysql: slow query: SELECT /* Wikibase\TermSqlIndex::getMatchingTerms */ term_entity_type,term_type,term_language,term_text,term_weight,term_entity_id FROM wb_terms WHERE ((term_language = 'en' AND (term_search_key LIKE 'foo%' ) AND term_type = 'label' AND term_entity_type = 'item') OR (term_language = 'en' AND (term_search_key LIKE 'unchar%' ) AND term_type = 'alias' AND term_entity_type = 'item')) LIMIT 5000

daniel unsubscribed.Sep 8 2015, 2:21 PM

Updating a single row should not take 15 seconds. Something is quite wrong here.
...and it's not the query.

It could be an enqueue issue, if thousands of those queries are sent at the same time (due to bot traffic, for example), they would be enqueued on server side and executed with controlled concurrency. I could check the binary logs to see if that is the case, and profile the database server when that happens (although I would need to know on which server/wiki that happens, and catch it in the moment).

We recently had issues with bots doing legit but heavier-than-usual traffic on wikidata, could be related.

aude renamed this task from EntityUsageTable::touchUsageBatch slow query to [Bug] EntityUsageTable::touchUsageBatch slow query.Sep 10 2015, 10:41 AM

aude triaged this task as Medium priority.

aude set Security to None.

jcrespo moved this task from Triage to Backlog on the DBA board.Sep 10 2015, 10:59 AM

JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.Sep 10 2015, 6:07 PM

hoo mentioned this in T114654: [Task] Delete by primary key in EntityUsageTable::pruneStaleUsages and EntityUsageTable::removeEntities.Oct 5 2015, 3:49 PM

This may be one of the reasons of T122429.

jcrespo mentioned this in T123867: Repeated reports of wikidatawiki (s5) API going read only.Jan 25 2016, 10:32 AM

hoo subscribed.Jan 26 2016, 12:15 AM

I just looked into this very briefly and I see two cases where we do queries as listed above:

On LinksUpdateComplete which should also be throttled to not overwhelm the infrastructure
In AddUsagesForPageJob… maybe we should throttle that job in order to reduce concurrency.

Also we might (I don't actually know) be able to reduce the number or size of these writes by comparing data first and only updating what actually needs to be updated.

hoo mentioned this in T124737: [RfC] Implement usage tracking without eu_touched.Jan 26 2016, 1:28 AM

As per @hoo request, I am sharing my own opinion here:

This is blocking ROW based replication application, which should be rolled in soon. In my own optinion (database-focused) I think wikidata updates (this is one of them) are the worse mediawiki-related production cluster problem happening right now.

JanZerebecki subscribed.Feb 1 2016, 12:17 PM

We decided to completely drop the field in question in T124737, please see T125838 for that.

@jcrespo Can you write down somewhere what problems you see with wikidata updates on the database level? Perhaps it would be useful to establish a communication channel for this kind of thing. This is the first I hear about our updates being a blocker for switching to row based replication.

I created T122429 one month ago.

@jcrespo That describes the problem as "When updating slow special pages from terbium, such as Listredirects, certain rows of the querycache table are deleted and then inserted." It mentions that updating slow query pages is problematic when "Other writes are happening at the same time, such as updating pagelinks or wbc_entity_usage". There is nothing there that indicates that wikidata operation is causing issues, and what such issues might be. Nothing alerted us to the existence that ticket, or to the relevance it might have for us.

Please, if you see any problems related to the wb_* and wbc_* tables, let us know. Adding the "Wikidata" umbrella to the ticket is generally enough. If it's not, poke me or Jan or Marius or Adam or Katie...

/* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch 127.0.0.1 */ creates lots of row updates without checking lag on the slaves. Even if a single one was short (which is not, according to this very same ticket), creating many updates without checking replication lag has the same effect. It is problematic when using statement-based replication, but even more when run in ROW-based replication -it creates lag, specially over a smaller bandwidth channels such as WAN.

It prevents s3 from using ROW-based replication, which in turn is slowly creating production data corruption due to other queries which will take years to fix. Why s3? Because most of updates there have to do with maintaining the wbc_entity_usage table, as there are a lot of small wikis using wikidata.

Causes of lag are tracked on: T95501

I created that ticket, went to holidays, then investigated more and didn't have the time to update it (it is not trivial to know the exact cause when I first reported it). Hoo asked me my opinion recently, and he told me to share it here. Wikidata is *not the only issue creating lag*, but it is the probably the only one where things get worse in ROW-based replication. My past experiences of reporting production, high impact database issues using the Wikidata tag is that they are discarded as "low" importance. I am currently asking for extra hardware to isolate wikidata on its own shard (s8), so it doesn't affect other other production data (specially, dewiki), as in a previous schema change.

@jcrespo Yea, I understand the problem with updating eu_touched. Marius came up with a way to ditch that field completely, and we'll do that. Should reduce the number of updates a lot. So this worked out ok in the end, I think.

I just want to make sure we have a working communication channel for this kind of thing. When wikidata is causing issues, I want a big red "you are breaking the wiki" thing on my dashboard somehow. And you should have someone to poke about it, instead of filing a ticket that then gets lost somewhere in the dustbin of project management.

Critical operational issues should not be degraded to "low" priority. If this happens, please poke me (Lydia or Jan or Marius or Adam or Katie) directly. I recommend you add the "Wikidata" umbrella, and if nothing happens, complain to us. I hope that will improve the situation. We could also establish a monthly meeting with ops, if that would help. But a month is a long time... @Lydia_Pintscher, what do you think?

No more meetings, please! :-) Someone hanging out in #wikimedia-operations or #wikimedia-databases would be enough so I can ping someone about ongoing issues, suggestions, etc.

Certain awareness about the work being done both in reducing lag and multi-datacenter work is welcome, but not compulsory (you will not work on that, it is mostly a performance/architecture/ops work, but some devel blockers may arise from it). Find me on one of those channels at any time. Thank you.

You should also add me and DBA project to tickets such as T97414(!) T113763 T114902 T68025 T59750 and any gerrit change that involves a schema change (I only comment if there is a big problem) -that makes schema changes go faster.

@jcrespo ok, will do! Conversely, you can find us at #wikidata on IRC if there is any trouble.

Yesterday at ~19:00UTC there was an increase 2x in the number of UPDATES on almost all wikis. I tracked this down to large amounts of UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch queries. Is there anything that could have caused that? What could have doubled the amount of queries being done?

In T111769#2043332, @jcrespo wrote:

Yesterday at ~19:00UTC there was an increase 2x in the number of UPDATES on almost all wikis. I tracked this down to large amounts of UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch queries. Is there anything that could have caused that? What could have doubled the amount of queries being done?

Yesterday I converted string properties to external id ones and after that at 12:50 (UTC) I bumped the cache epoch on Wikidata. But I don't think that should have had a noticeable impact beyond Wikidata (if at all).

	F3382467: UPDATE_increase.png
	Feb 19 2016, 9:54 AM

[Bug] EntityUsageTable::touchUsageBatch slow queryClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

[Bug] EntityUsageTable::touchUsageBatch slow query
Closed, DeclinedPublic
Actions