Page MenuHomePhabricator

Large number of CategoryMembershipChangeJob::run updates are failing
Closed, ResolvedPublicPRODUCTION ERROR

Description

After handling an incident: T147747 I saw a large amount of jobs complaining on the mediawiki logs:

DatabaseMysqlBase::lock failed to acquire lock 'CategoryMembershipUpdates:1354575'
https://logstash.wikimedia.org/goto/158cebca05fd93303c3d15d46a1cb0ad (over 400 of those per minute, 99.8% of those on wikidatawiki).

Before that started happening, I saw several long running queries, probably due only to the initial incident:

SELECT /* CategoryMembershipChangeJob::run */ rev_id, rev_page, rev_text_id, rev_timestamp, rev_comment, rev_user_text, rev_user, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, rev_content_format, rev_content_model FROM `revision` WHERE rev_page = '7165976' AND (rev_timestamp > '20161009084342' OR (rev_timestamp = '20161009084342' AND rev_id > 0)) ORDER BY rev_timestamp ASC, rev_id ASC

Could this be related to the changes on T133801 ?

This has been happening at higher rates since 2016-10-07. It happens, I believe in other wikis that are not wikidatawiki but the rate is 1:100 with all other wikis, including dewiki, which shares resources with.

Event Timeline

jcrespo triaged this task as Unbreak Now! priority.Oct 10 2016, 5:14 AM

Unbreak now because it makes logs impossible for me to read right now logs while unrelated ongoing issues are happening. Just stopping the logging would fix that.

Probably related, queries like this are causing thousands of errors a day:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
11910	33	4	52,440	db1087	wikiuser	wikidatawiki
SELECT /* Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities */ term_entity_type, term_entity_id, term_type, term_language, term_text FROM `wb_terms` WHERE term_entity_type = 'property' AND term_entity_id IN ('21', '143', '244', '214', '227', '31', '569', '268', '27', '106', '1440', '1006', '248', '813', '854', '19', '735', '1087', '585', '641', '1344', '54', '580', '582', '1665', '1412', '2962', '18') AND term_type IN ('label', 'description') AND term_language = 'en' /* ba7efbdd5ad83b5cce3089a4db829fe1 db1087 wikidatawiki 4s */

Blocking all main s5 servers.

It is clear to me that the direct cause for connection issues is pilups coming from queries such as:

      Id: 6451784424
    User: wikiuser
    Host: 10.64.32.35:38688
      db: wikidatawiki
 Command: Query
    Time: 18
   State: Sending data
    Info: SELECT /* Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities  */  term_entity_type,term_entity_id,term_type,term_language,term_text  FROM `wb_terms`    WHERE term_entity_type = 'property' AND term_entity_id IN ('373','374','248','143','17','131','582','580','625','813','31','646','577','1566','18','214','268','856','2046','94','281','242','47','1082','585')  AND term_type IN ('label','description')  AND term_language = 'en'
Progress: 0.000

That take over the available connections, as they can take 10-30 seconds to run.

We will disable these Jobs for (most of) Wikidata with T126977. I'll try to get this done and deployed ASAP.

Mentioned in SAL (#wikimedia-operations) [2016-10-10T12:06:49Z] <hoo@tin> Synchronized php-1.28.0-wmf.21/extensions/Wikidata: Update Wikibase, add EntityHandler::supportsCategories (T147748) (duration: 02m 25s)

The change has been deployed and the number of Jobs scheduled on Wikidata should drop significantly.

It will take some time before the existing queue has dried off, though. You might want to lower the priority to HIGH.

hoo lowered the priority of this task from Unbreak Now! to High.Oct 10 2016, 3:17 PM

Decreasing priority as the queue is down to 0.

Thank you for the quick response, seeing the effect already, I am more than ok with putting this as high or normal (sorry, I didn't see it soon).

Is the slow Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities related to the job or is it a separate issue?

Is the slow Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities related to the job or is it a separate issue?

Separate issue… I already looked into this briefly before and think I have a potential fix. Will probably pick this up tomorrow (it's high on my to do list).

I've benchmarked the above query very briefly (picked db1092 for that, all queries were run five times in a row(!)).

Original query

SELECT /* Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities  */  term_entity_type,term_entity_id,term_type,term_language,term_text  FROM `wb_terms`    WHERE term_entity_type = 'property' AND term_entity_id IN ('373','374','248','143','17','131','582','580','625','813','31','646','577','1566','18','214','268','856','2046','94','281','242','47','1082','585')  AND term_type IN ('label','description')  AND term_language = 'en';

5.324s on average


One query per term_type (above query but term_type = 'description'/ term_type = 'label' rather than the term_type IN ('label','description')):
0.028s on average (for both queries combined).

I conclude that we should run one query per term type here and simply merge the result in the application code.

Change 315616 had a related patch set uploaded (by Hoo man):
Use one query per term type in SqlEntityInfoBuilder::collectTermsForEntities

https://gerrit.wikimedia.org/r/315616

I conclude that we should run one query per term type here and simply merge the result in the application code.

Yes, this is a property of Btree indexes- 2 ranges cannot be handled efficiently- you can only handle one with an index at a time. You could even merge both into UNION ALLs, but again, as usual, I suggest going for the largest gain ASAP.

You could even merge both into UNION ALLs, but again, as usual, I suggest going for the largest gain ASAP.

Sadly this is still a major headache in tests because we use temporary tables there… which makes MySQL sad.

thiemowmde moved this task from Review to Done on the Wikidata-Sprint-2016-10-12 board.
thiemowmde removed a project: Patch-For-Review.
thiemowmde moved this task from incoming to in progress on the Wikidata board.
thiemowmde subscribed.

I'm moving this to "done" in our Sprint board, but I will leave this ticket open because I do not know if the only patch mentioned here (https://gerrit.wikimedia.org/r/315616) fully resolves the issue. Please have a look.

Change 315616 merged by jenkins-bot:
Use one query per term type in SqlEntityInfoBuilder::collectTermsForEntities

https://gerrit.wikimedia.org/r/315616

Using a UNION ALL does not seem worth the hassle right now (I created a patch, but the test situation is a nightmare).

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM