Page MenuHomePhabricator

{{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion
Closed, ResolvedPublicPRODUCTION ERROR

Description

See https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#CSD_backlog? for details.

Basically, the magic word {{PAGESINCATEGORY}} returns far higher numbers for this category (at the time of writing this: 319) then viewing the category does (at the time of writing this: 10) and no one can figure out why.

Event Timeline

Looks to be an ancient problem related to T18036

I manually refreshed the count for just this category. Let's see if it gets bad again.

I manually refreshed the count for just this category. Let's see if it gets bad again.

Count says 355 at the moment with 151 entries actually in the category

Some data:

  • The database (and on-wiki {{PAGESINCATEGORY}}) currently shows 463 pages.
  • The categorylinks table has 31 entries referring to existing pages, and 718 entries referring to pages that don't currently exist.
  • The categorylinks from pages that don't exist have cl_timestamp from 2018-06-01 04:01:47 to 2018-06-06 14:46:15 (UTC).
  • The categorylinks from pages that don't exist all have deletion log entries with log_timestamp between 2018-06-01 08:54:55 and 2018-06-06 15:49:20 (UTC).

All this makes me suspect that the deleteLinks jobs aren't running in a timely manner. @mobrovac or @Pchelolo, can you tell if the enwiki deleteLinks jobs are backlogged in the new EventBus job queue, consistent with these timestamps?

BTW, the discrepancy between 432 extra "count" and 718 broken categorylinks could be explained by the fact that various code paths recount the category whenever the saved "count" is less than 100 and the recount ignores the broken categorylinks. So until the "count" manages to exceed 100 the recounts would keep correcting for the jobs not running, and bad "count" would only pile up once the recounts are no longer triggered.

All this makes me suspect that the deleteLinks jobs aren't running in a timely manner. @mobrovac or @Pchelolo, can you tell if the enwiki deleteLinks jobs are backlogged in the new EventBus job queue, consistent with these timestamps?

Hm... Interesting that the last deleteLinks submitted into kafka I see is from 2018-05-31T15:21:05+00:00 for User:Lothar_Hakelberg/Benutzernavigation on commons.

The graphs for rates of this particular one looks very suspicious as well. I will come back with more data tomorrow EU morning.

I've been looking into this for a while now, but with no luck - the deleteLinks job just doesn't get posted and I couldn't locate any errors in JobQueue or mediawiki logs..

It looks like with HHVM we don't do LinksDeletionUpdates as a job, I was mistaken. Sorry for having you look.

Anomie added a subscriber: aaron.

Digging into this deeper, I see an increase in database deadlocks at the end of April, roughly correlating with the deployment of 1.32.0-wmf.1. Specifically, I'm looking at log messages like these:

2018-06-11 07:46:24 [Wx4oxQpAEMIAAL0WdSoAAADD] mw1313 commonswiki 1.32.0-wmf.7 DBQuery ERROR: Category::refreshCounts   10.64.48.23     1213    Deadlock found when trying to get lock; try restarting transaction (10.64.48.23)        UPDATE  `category` SET
cat_pages = '2',cat_subcats = '0',cat_files = '2' WHERE cat_title = 'Duplicate' {"db_server":"10.64.48.23","db_name":"commonswiki","db_user":"wikiuser","method":"Wikimedia\\Rdbms\\Database::makeQueryException","errno":1213,"error":"Deadlock found when trying to get lock; try restarting transaction (10.64.48.23)","sql1line":"UPDATE  `category` SET cat_pages = '2',cat_subcats = '0',cat_files = '2' WHERE cat_title = 'Duplicate'","fname":"Category::refreshCounts"}
2018-06-11 14:13:40 [Wx6DjQpAAE4AAC30QoIAAABQ] mw1283 commonswiki 1.32.0-wmf.7 DBQuery ERROR: LinksDeletionUpdate::batchDeleteByPK      10.64.48.23     1213    Deadlock found when trying to get lock; try restarting transaction (10.64.48.23)        DELETE FROM `categorylinks` WHERE (cl_from = '65163833' AND cl_to = '1966_coins') OR (cl_from = '65163833' AND cl_to = 'CC-BY-SA-4.0') OR (cl_from = '65163833' AND cl_to = 'Commemorative_10_Mark_coins_of_East_Germany') OR (cl_from = '65163833' AND cl_to = 'Deletion_requests_June_2018') OR (cl_from = '65163833' AND cl_to = 'Karl_Friedrich_Schinkel') OR (cl_from = '65163833' AND cl_to = 'Self-published_work') {"db_server":"10.64.48.23","db_name":"commonswiki","db_user":"wikiuser","method":"Wikimedia\\Rdbms\\Database::makeQueryException","errno":1213,"error":"Deadlock found when trying to get lock; try restarting transaction (10.64.48.23)","sql1line":"DELETE FROM `categorylinks` WHERE (cl_from = '65163833' AND cl_to = '1966_coins') OR (cl_from = '65163833' AND cl_to = 'CC-BY-SA-4.0') OR (cl_from = '65163833' AND cl_to = 'Commemorative_10_Mark_coins_of_East_Germany') OR (cl_from = '65163833' AND cl_to = 'Deletion_requests_June_2018') OR (cl_from = '65163833' AND cl_to = 'Karl_Friedrich_Schinkel') OR (cl_from = '65163833' AND cl_to = 'Self-published_work')","fname":"LinksDeletionUpdate::batchDeleteByPK"}

The second is what would be causing this bug: when the error occurs in LinksDeletionUpdate, the whole transaction has to be rolled back and the update doesn't get retried.

Datein LinksDeletionUpdatein refreshCounts
04-2000
04-2100
04-2200
04-2302
04-2402
04-25024
04-26422
04-2746196
04-28180240
04-29380160
04-30260272

Breaking log messages since June 1 down by wiki,

Wikiin LinksDeletionUpdatein refreshCounts
enwiki31621782
commonswiki10021822
zhwiki0162
kowiki898
ruwiki140
ptwiki08
dewiki06
viwiki02
thwiki20
plwiki02
idwiki02
huwiki02
arwiki02

I see two changes in 1.32.0-wmf.1 that seem like they might be relevant:

The first looks more suspicious to me. It causes Category::refreshCounts() to be called much more often, and for categories that probably aren't empty. On a wiki with heavy deletion activity in a particular category (such as enwiki's Category:Candidates for speedy deletion, which exists specifically to hold pages to be deleted), this seems like it could easily wind up with the refresh's transaction deadlocking with transactions that are trying to delete rows in the category being refreshed.

The second being the cause would require that there be some difference in locking behavior between these two SQL queries, with the second somehow locking more rows to cause the deadlocks:

SELECT  COUNT(*) AS `rowcount`  FROM (SELECT  1  FROM `categorylinks`,`page`    WHERE cl_to = 'Foobar' AND (page_id = cl_from)   LOCK IN SHARE MODE) `tmp_count`
SELECT  COUNT(*) AS `pages`,COUNT( (CASE WHEN page_namespace = '14' THEN 1 ELSE NULL END) ) AS `subcats`,COUNT( (CASE WHEN page_namespace = '6' THEN 1 ELSE NULL END) ) AS `files`  FROM `categorylinks`,`page`    WHERE cl_to = 'Foobar' AND (page_id = cl_from)  LIMIT 1  LOCK IN SHARE MODE

I don't have any explanation for why the orphaned categorylinks rows on enwiki are only showing up since 2018-06-01.

Change 439809 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reduce frequency of refreshCounts() calls in LinksDeletionUpdate

https://gerrit.wikimedia.org/r/439809

Change 439946 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Category: Lock the category row before the categorylinks rows

https://gerrit.wikimedia.org/r/439946

Change 439946 merged by jenkins-bot:
[mediawiki/core@master] Category: Lock the category row before the categorylinks rows

https://gerrit.wikimedia.org/r/439946

Change 440053 had a related patch set uploaded (by Aaron Schulz; owner: Anomie):
[mediawiki/core@wmf/1.32.0-wmf.8] Category: Lock the category row before the categorylinks rows

https://gerrit.wikimedia.org/r/440053

Change 440053 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.8] Category: Lock the category row before the categorylinks rows

https://gerrit.wikimedia.org/r/440053

Mentioned in SAL (#wikimedia-operations) [2018-06-13T14:15:56Z] <anomie@deploy1001> Synchronized php-1.32.0-wmf.8/includes/Category.php: Backporting fix for T195397 ([[gerrit:440053]]) (duration: 01m 00s)

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.
jcrespo subscribed.

This was known to me. This points me I should be more vocal about Wikimedia-production-error I find in the future. Related: T165675

Anomie claimed this task.

This looks likely to be resolved now: the change was backported to wmf.8, and no deadlocks related to this task have been logged since that went out.

I also went through and recounted most categories that were logged as having these deadlocks. I skipped some that have more than 10000 rows, as categories with too many rows take too long to recount. Note that "10000" is an arbitrary limit, I picked a round number that seemed large enough that count inaccuracy wouldn't make that much difference and small enough that the queries completed quickly.

The categories not recounted were:

1commonswiki: Self-published_work
2commonswiki: With_insignia
3commonswiki: Uploaded_with_Mobile/Android
4commonswiki: Template_Unknown_(author)
5commonswiki: Photographs_by_Rob_Croes
6commonswiki: Vector_version_available
7commonswiki: Unidentified_logos
8commonswiki: Uploaded_via_Campaign:wle-es
9commonswiki: With_trademark
10commonswiki: CC-Zero
11commonswiki: Pages_with_maps
12commonswiki: People
13commonswiki: Flickr_images_reviewed_by_FlickreviewR_2
14commonswiki: Uploaded_via_Campaign:wle-in
15commonswiki: Media_needing_category_review_with_2_suggested_categories
16commonswiki: Media_needing_categories_as_of_24_February_2017
17commonswiki: USGov_files_uploaded_by_OceanAtoll
18commonswiki: WLE_2016_Pakistan_unreviewed
19commonswiki: User_page_images
20commonswiki: Music
21commonswiki: CC-BY-2.5
22commonswiki: Uploaded_with_en.wp_upload_wizard
23commonswiki: PD-AR-Photo
24commonswiki: Files_with_no_machine-readable_source
25commonswiki: CC-BY-SA-4.0
26commonswiki: WebM_videos
27commonswiki: Images_with_extracted_images
28commonswiki: PD_US_Government
29commonswiki: Uploaded_with_pattypan
30enwiki: Noindexed_pages
31enwiki: Wikipedia_non-free_files_with_valid_backlink
32enwiki: Wikipedia_non-free_files_with_redirect_backlink
33enwiki: Screenshots_of_television
34enwiki: Vector_images_of_trademarks
35enwiki: Wikipedia_non-free_historic_files
36enwiki: Pages_with_missing_files
37enwiki: Wikipedia_orphaned_files
38enwiki: WikiProject_Vietnam_articles
39enwiki: Redirects_from_modifications
40enwiki: Redirects_from_moves
41enwiki: Wikipedia_soft_redirected_categories
42enwiki: Wikipedia_template_categories
43enwiki: Pages_with_reference_errors
44enwiki: WikiProject_Football_articles
45enwiki: Unprintworthy_redirects
46enwiki: Interlanguage_link_template_link_number
47enwiki: Pages_containing_links_to_subscription-only_content
48enwiki: Wikipedia_license_migration_redundant
49enwiki: Music_infoboxes_with_deprecated_parameters
50enwiki: Wikipedia_license_migration_completed
51enwiki: Webarchive_template_wayback_links
52enwiki: Underpopulated_categories
53enwiki: User_talk_pages_with_Uw-spam2_notices
54enwiki: Wikipedia_sockpuppets
55enwiki: Wikipedia_files_tagged_as_own_work
56enwiki: Pages_translated_from_German_Wikipedia
57enwiki: Pages_to_import_images_to_Wikidata
58enwiki: Suspected_Wikipedia_sockpuppets
59enwiki: Place_of_birth_missing_(living_people)
60enwiki: Declined_AfC_submissions
61enwiki: Pages_translated_from_French_Wikipedia
62enwiki: Singlechart_called_without_song
63enwiki: Pages_with_citations_lacking_titles
64enwiki: Pages_using_citations_with_accessdate_and_no_URL
65enwiki: Pages_using_web_citations_with_no_URL
66eswiki: Wikipedia:Páginas_no_indexadas
67fawiki: رده‌های_پنهان
68sourceswiki: Proofread

Please don't run count(*) + LOCK IN SHARE MODE on the masters or you will create contention. This run was conflicting with HTMLCacheUpdates and created a bunch of fatals: https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1529061367667&to=1529069430000

It looks like that locking was added in 2008 in rMW80a5874828c2: This is a schema change. It's only a table creation, but the table must be… so the counts wouldn't be inaccurate due to concurrent deletions. Normally the method is only called when the category is expected to have very few rows in the first place.

I see most issues arose from updates to things like 'CC-BY-SA-4.0', and 'Self-published_work', which may have dozens of millions of members.

Change 439809 merged by jenkins-bot:
[mediawiki/core@master] Reduce frequency of refreshCounts() calls in LinksDeletionUpdate

https://gerrit.wikimedia.org/r/439809

Vvjjkkii renamed this task from {{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion to mfcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Anomie as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Community_Tech_bot renamed this task from mfcaaaaaaa to {{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion.Jul 1 2018, 6:20 AM
Community_Tech_bot closed this task as Resolved.
Community_Tech_bot assigned this task to Anomie.
Community_Tech_bot updated the task description. (Show Details)
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM