Page MenuHomePhabricator

[Bug] EntityUsageTable::touchUsageBatch slow query
Closed, DeclinedPublic

Description

I see some slow queries in the logs for touch usages:

UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')

We should still think about how to improve these.

Event Timeline

aude raised the priority of this task from to Needs Triage.
aude updated the task description. (Show Details)
aude subscribed.

I'm a bit at a loss here. This query said "here is a set of values of the primary key, go and set field x to value y for the rows with that key". I don't see how this could be improved...

Sep 7 05:03:28 mw1005: SlowTimer [15207ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20141005013634' WHERE eu_row_id = '1266958'
Sep 7 06:49:49 mw1183: SlowTimer [15917ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch ... */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')
Sep 7 06:50:05 mw1002: SlowTimer [15802ms] at runtime/ext_mysql: slow query: UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch */ wbc_entity_usage SET eu_touched = '20150907064933' WHERE eu_row_id IN ('2430363','2430361','2430345','2430360','2430348','61591','61590','107383','2430347','2430358','2430343','2430362','2430365','2430351','2430353','2430354','2430344','rEVED243035664a3b','2430350','2430366','2430355','2430359','2430352','2430357','2430346','2430349','2430364')

by comparison, term queries are significantly slower

SlowTimer [60001ms] at runtime/ext_mysql: slow query: SELECT /* Wikibase\TermSqlIndex::getMatchingTerms */ term_entity_type,term_type,term_language,term_text,term_weight,term_entity_id FROM wb_terms WHERE ((term_language = 'en' AND (term_search_key LIKE 'foo%' ) AND term_type = 'label' AND term_entity_type = 'item') OR (term_language = 'en' AND (term_search_key LIKE 'unchar%' ) AND term_type = 'alias' AND term_entity_type = 'item')) LIMIT 5000

Updating a single row should not take 15 seconds. Something is quite wrong here.
...and it's not the query.

It could be an enqueue issue, if thousands of those queries are sent at the same time (due to bot traffic, for example), they would be enqueued on server side and executed with controlled concurrency. I could check the binary logs to see if that is the case, and profile the database server when that happens (although I would need to know on which server/wiki that happens, and catch it in the moment).

We recently had issues with bots doing legit but heavier-than-usual traffic on wikidata, could be related.

aude renamed this task from EntityUsageTable::touchUsageBatch slow query to [Bug] EntityUsageTable::touchUsageBatch slow query.Sep 10 2015, 10:41 AM
aude triaged this task as Medium priority.
aude set Security to None.

This may be one of the reasons of T122429.

I just looked into this very briefly and I see two cases where we do queries as listed above:

  • On LinksUpdateComplete which should also be throttled to not overwhelm the infrastructure
  • In AddUsagesForPageJob… maybe we should throttle that job in order to reduce concurrency.

Also we might (I don't actually know) be able to reduce the number or size of these writes by comparing data first and only updating what actually needs to be updated.

As per @hoo request, I am sharing my own opinion here:

This is blocking ROW based replication application, which should be rolled in soon. In my own optinion (database-focused) I think wikidata updates (this is one of them) are the worse mediawiki-related production cluster problem happening right now.

hoo closed this task as Declined.EditedFeb 4 2016, 5:10 PM
hoo claimed this task.

We decided to completely drop the field in question in T124737, please see T125838 for that.

@jcrespo Can you write down somewhere what problems you see with wikidata updates on the database level? Perhaps it would be useful to establish a communication channel for this kind of thing. This is the first I hear about our updates being a blocker for switching to row based replication.

@jcrespo That describes the problem as "When updating slow special pages from terbium, such as Listredirects, certain rows of the querycache table are deleted and then inserted." It mentions that updating slow query pages is problematic when "Other writes are happening at the same time, such as updating pagelinks or wbc_entity_usage". There is nothing there that indicates that wikidata operation is causing issues, and what such issues might be. Nothing alerted us to the existence that ticket, or to the relevance it might have for us.

Please, if you see any problems related to the wb_* and wbc_* tables, let us know. Adding the "Wikidata" umbrella to the ticket is generally enough. If it's not, poke me or Jan or Marius or Adam or Katie...

/* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch 127.0.0.1 */ creates lots of row updates without checking lag on the slaves. Even if a single one was short (which is not, according to this very same ticket), creating many updates without checking replication lag has the same effect. It is problematic when using statement-based replication, but even more when run in ROW-based replication -it creates lag, specially over a smaller bandwidth channels such as WAN.

It prevents s3 from using ROW-based replication, which in turn is slowly creating production data corruption due to other queries which will take years to fix. Why s3? Because most of updates there have to do with maintaining the wbc_entity_usage table, as there are a lot of small wikis using wikidata.

Causes of lag are tracked on: T95501

I created that ticket, went to holidays, then investigated more and didn't have the time to update it (it is not trivial to know the exact cause when I first reported it). Hoo asked me my opinion recently, and he told me to share it here. Wikidata is *not the only issue creating lag*, but it is the probably the only one where things get worse in ROW-based replication. My past experiences of reporting production, high impact database issues using the Wikidata tag is that they are discarded as "low" importance. I am currently asking for extra hardware to isolate wikidata on its own shard (s8), so it doesn't affect other other production data (specially, dewiki), as in a previous schema change.

@jcrespo Yea, I understand the problem with updating eu_touched. Marius came up with a way to ditch that field completely, and we'll do that. Should reduce the number of updates a lot. So this worked out ok in the end, I think.

I just want to make sure we have a working communication channel for this kind of thing. When wikidata is causing issues, I want a big red "you are breaking the wiki" thing on my dashboard somehow. And you should have someone to poke about it, instead of filing a ticket that then gets lost somewhere in the dustbin of project management.

Critical operational issues should not be degraded to "low" priority. If this happens, please poke me (Lydia or Jan or Marius or Adam or Katie) directly. I recommend you add the "Wikidata" umbrella, and if nothing happens, complain to us. I hope that will improve the situation. We could also establish a monthly meeting with ops, if that would help. But a month is a long time... @Lydia_Pintscher, what do you think?

No more meetings, please! :-) Someone hanging out in #wikimedia-operations or #wikimedia-databases would be enough so I can ping someone about ongoing issues, suggestions, etc.

Certain awareness about the work being done both in reducing lag and multi-datacenter work is welcome, but not compulsory (you will not work on that, it is mostly a performance/architecture/ops work, but some devel blockers may arise from it). Find me on one of those channels at any time. Thank you.

You should also add me and DBA project to tickets such as T97414(!) T113763 T114902 T68025 T59750 and any gerrit change that involves a schema change (I only comment if there is a big problem) -that makes schema changes go faster.

@jcrespo ok, will do! Conversely, you can find us at #wikidata on IRC if there is any trouble.

Yesterday at ~19:00UTC there was an increase 2x in the number of UPDATES on almost all wikis. I tracked this down to large amounts of UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch queries. Is there anything that could have caused that? What could have doubled the amount of queries being done?

UPDATE_increase.png (458×1 px, 46 KB)

Yesterday at ~19:00UTC there was an increase 2x in the number of UPDATES on almost all wikis. I tracked this down to large amounts of UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch queries. Is there anything that could have caused that? What could have doubled the amount of queries being done?

Yesterday I converted string properties to external id ones and after that at 12:50 (UTC) I bumped the cache epoch on Wikidata. But I don't think that should have had a noticeable impact beyond Wikidata (if at all).