Page MenuHomePhabricator

Special:UnconnectedPages for main namespace is slow (ca. 10 seconds)
Closed, ResolvedPublicSecurity

Description

SpecialUnconnectedPages is currently at the top of the slow queries dashboard – it’s not the slowest query (ca. 10 seconds on average), but its comparatively high volume pushes it to the #1 spot by sum time (ca. 900 seconds per hour). While this is currently an acute problem on dewiki, in principal all properties are vulnerable to this issue. We should fix it.

The query is the one that the special page generates when filtering by a namespace, specifically the main namespace (other namespaces are faster):

SELECT /* Wikibase\Client\Specials\SpecialUnconnectedPages::reallyDoQuery  */  page_id AS `value`,page_namespace AS `namespace`,page_title AS `title`,0 AS `page_num_iwlinks`  FROM `page` LEFT JOIN `page_props` ON ((page_id = pp_page) AND pp_propname IN ('wikibase_item','expectedUnconnectedPage') )   WHERE (page_namespace = 0) AND page_is_redirect = 0 AND (pp_propname IS NULL)  ORDER BY value DESC LIMIT 101

Solution summary:
We introduced a new page prop, unexpectedUnconnectedPage, with a maintenance script to populate it. The page prop is set for non-redirect pages in a Wikibase-enabled namespace that are neither connected to a Wikibase item nor have the __EXPECTED_UNCONNECTED_PAGE__ magic word set. Its value is the negative namespace number; thus, sorting by descending pp_sortkey and pp_page produces first the newest unexpected unconnected pages in namespace 0 (“newest” meaning the ones with the highest page ID), then older pages in that namespace, then newest pages in the next namespace (e.g. 2), and so on. The new page prop makes it easy to efficiently find all unexpected unconnected pages, either within a certain namespace or across all Wikibase-enabled namespaces.

MariaDB [commonswiki]> SELECT NOW() AS asof\G SELECT -pp_sortkey AS ns, COUNT(*) AS count FROM page_props WHERE pp_propname = 'unexpectedUnconnectedPage' GROUP BY pp_sortkey ORDER BY COUNT(*) DESC;
*************************** 1. row ***************************
asof: 2022-10-10 15:39:15
1 row in set (0.001 sec)

+------+---------+
| ns   | count   |
+------+---------+
|   14 | 7162706 |
|    4 | 1134432 |
|   10 |  252933 |
|  486 |   60843 |
|  100 |   54691 |
|    0 |   35032 |
|  102 |   17000 |
|  106 |    6443 |
|  828 |    1021 |
|  460 |     768 |
|   12 |     768 |
|  490 |     160 |
|  104 |      24 |
+------+---------+
16 rows in set (2.648 sec)

(The implementation and migration was complicated by the fact that we initially made the page prop value the positive namespace number, and then had to follow up with additional Gerrit changes and a second run of maintenance script runs to turn it negative for better sorting.)


Wikibase changes:

Config changes:

Maintenance script runs:

Maintenance script runs, second round (inverted namespace):

Details

Author Affiliation
Wikimedia Deutschland

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Even though it's small, It's a rather new feature, it might grow. We can also push communities to adapt it more, they might not be aware this exists.

Grow to such an extent that it would become a serious problem? (And we could still remove the page prop later if it does turn out to be a problem…)

What I mean is the opposite, if we are going with the direction of changing page_prop to be "unexpected unconnected", the more we push for this feature, the less page props for it would exists. If you think for now, you need to have both, I don't think it could cause issues, page_props table is rather small.

I don’t think we need both, ourselves, once the migration is done, but I think there’s some value in the expected_unconnected page props, so if there’s no strong need to discard it I’d keep it.

Sure, but if it gets too big, I'll make you normalize it. Deal?

D:

I’d say we could delete the page prop before that, but sure, I could also try to normalize the table at some point ^^ I’m assuming a NameTableStore for all the prop names is okay, enwiki only has 38 different page prop names so loading them all at once shouldn’t be problematic

Yup, it's a very reasonable and straightforward normalization but the table has not grown to the point that we need to do it.

It seems like we have consensus to keep the old page prop around… I'll bring this up in the daily tomorrow and unless there are objections, I'll change my patches to keep it.

I’ve uploaded two config changes to start rolling out the migration stage (WRITE_BOTH, i.e. don’t use it yet): Beta and Test Wikidata clients. I assume these should be okay to deploy next week.

This (currently restricted) task appears to be a duplicate of the (public) T297515. I don't see an obvious way to merge these without making things difficult for both parties involved. So here's a comment instead.

That's technically not the same, the title is misleading. It is about a toolforge tool developer wanting to adopt expected unconnected pages feature. It doesn't say anything about the query being slow.

I think once the new expectedUnconnectedPage prop is fully populated, Mike Peel will definitely want to use it (for the same reason we want to use it: query go brrr), but until then we can leave both tasks as they are.

I’ve uploaded two config changes to start rolling out the migration stage (WRITE_BOTH, i.e. don’t use it yet): Beta and Test Wikidata clients. I assume these should be okay to deploy next week.

With these changes plus a third one deployed, we’re now at WRITE_BOTH everywhere, beta and prod. Beta also has the new page prop everywhere, since the maintenance script to backfill it was run as part of update.php: see beta-update-databases-eqiad #57469. Next steps (could be in either order):

  • switch Beta to MIGRATION_NEW
  • run the maintenance script to backfill the page prop on all production wikis

The maintenance script turned out to have a small bug (bugfix), so beta-update-databases-eqiad #57627 backfilled more rows on some wikis, especially simplewiki.

On Test Wikidata clients, the page prop was populated and is already being used now. I also ran the maintenance script on Commons, which took about fifty minutes in total; in Grafana MySQL Aggregated (s4), you can see a clear impact on the number of rows written (twelve spikes, corresponding to twelve maintenance script runs, see also task description):

Screenshot 2022-04-13 at 17-21-21 MySQL Aggregated - Grafana.png (258×1 px, 57 KB)

However, I’d characterize the impact on maxlag as very much acceptable, at least once I added --batch-size 500 to override the default 1000:

Screenshot 2022-04-13 at 17-21-50 MySQL Aggregated - Grafana.png (258×1 px, 28 KB)

The first three runs were without --batch-size, corresponding to the first two spikes in that screenshot. I recommend we use --batch-size 500 for all remaining maintenance script runs as well.

Commons is now at some eight million unexpected unconnected pages – mostly categories, with another significant portion in the project namespace:

MariaDB [commonswiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'unexpectedUnconnectedPage';
+----------+
| COUNT(*) |
+----------+
|  8268893 |
+----------+
1 row in set (2.271 sec)

MariaDB [commonswiki]> SELECT pp_sortkey, COUNT(*) FROM page_props WHERE pp_propname = 'unexpectedUnconnectedPage' GROUP BY pp_sortkey ORDER BY COUNT(*) DESC LIMIT 8;
+------------+----------+
| pp_sortkey | COUNT(*) |
+------------+----------+
|         14 |  6754123 |
|          4 |  1098488 |
|         10 |   248380 |
|        486 |    59286 |
|        100 |    51722 |
|          0 |    32019 |
|        102 |    15881 |
|        106 |     6327 |
+------------+----------+
8 rows in set (2.505 sec)
hoo updated the task description. (Show Details)

The page prop is being written now and the maintenance script ran on all (applicable) wikis.

Now we can read from the new page prop in production (I'll probably make the switch tomorrow).

Afterwards we can change the default in Wikibase and make this public at some point thereafter.

Apparently we set the page prop mistakenly in some cases, the following is a redirect on dewiki:

MariaDB [dewiki]> SELECT * FROM page_props WHERE pp_page = 12196262;
+----------+---------------------------+----------+------------+
| pp_page  | pp_propname               | pp_value | pp_sortkey |
+----------+---------------------------+----------+------------+
| 12196262 | unexpectedUnconnectedPage | 0        |          0 |
+----------+---------------------------+----------+------------+
1 row in set (0.001 sec)

MariaDB [dewiki]> SELECT * FROM page WHERE page_id = 12196262\G
*************************** 1. row ***************************
           page_id: 12196262
    page_namespace: 0
        page_title: Rottenoktett
  page_is_redirect: 1
…
      page_touched: 20220406124246
page_links_updated: 20220406060935
…
1 row in set (0.001 sec)

Given the recent page_links_updated (which we don't touch in our maintenance script), this is probably introduced by ClientParserOutputDataUpdater::setUnexpectedUnconnectedPage).

Given the recent page_links_updated (which we don't touch in our maintenance script), this is probably introduced by ClientParserOutputDataUpdater::setUnexpectedUnconnectedPage).

Yeah, I suspect we need to check $content->isRedirect() instead of $title->isRedirect() there. (We don’t currently pass the $content into ClientParserOutputDataUpdater – only the $parserOutput, where AFAICT it’s not available – but ParserOutputUpdateHookHandler::onContentAlterParserOutput() has a $content argument, we just need to pass it through.)

Given the recent page_links_updated (which we don't touch in our maintenance script), this is probably introduced by ClientParserOutputDataUpdater::setUnexpectedUnconnectedPage).

Yeah, I suspect we need to check $content->isRedirect() instead of $title->isRedirect() there. (We don’t currently pass the $content into ClientParserOutputDataUpdater – only the $parserOutput, where AFAICT it’s not available – but ParserOutputUpdateHookHandler::onContentAlterParserOutput() has a $content argument, we just need to pass it through.)

That sounds good to me. Besides that, we will also need a solution for the mistakenly added page props we have now… we could probably tinker with the migration maintenance script to support this or manually force a links update on all pages with the page prop and a recent enough page_links_updated (once the fix is out, of course).

Edit: Disregard this comment, I accidentally checked for non-redirects instead of redirects >.<

That sounds good to me. Besides that, we will also need a solution for the mistakenly added page props we have now… we could probably tinker with the migration maintenance script to support this or manually force a links update on all pages with the page prop and a recent enough page_links_updated (once the fix is out, of course).

I did a check on stats1007:

while IFS= read -r wiki; do rows=$(sudo -u analytics-wmde analytics-mysql -N "$wiki" <<< "SELECT COUNT(*) FROM page JOIN page_props ON pp_page = page_id AND pp_propname = 'unexpectedUnconnectedPage' WHERE page_is_redirect = 1;"); printf '%8d\t%s\n' "$rows" "$wiki"; done < ~/wikidataclient.dblist | tee >(sort -n | tee ~/T300770-redirects-2)

Only enwiki has over 10k pages that have the page prop despite being redirects, and only eight others have more than 1k:

 1274	ukwiki
 1775	eswiki
 1890	jawiki
 2264	specieswiki
 2350	kowiki
 2979	frwiki
 3309	ruwiki
 3617	zhwiki
13139	enwiki

(Full list at ~lucaswerkmeister-wmde/T300770-redirects-2 on stat1007.) Those numbers might be low enough to just force purges with link updates… but we could also do it with another maintenance script directly at the SQL / page_props level. The numbers will go up slightly over the long Easter weekend, of course, but probably not by very much.

Proposed fix: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/780869

I tested it on mwdebug1002 and it seemed to work – without the patch, a newly created redirect got the unexpectedUnconnectedPage page prop; with the patch, no page prop to be seen.

I think ClientParserOutputDataUpdater::updateTrackingCategories() might have a similar bug, by the way (it also uses $title->isRedirect()) – it looks like dewiki has 2k redirects that are linked to Wikidata but aren’t in the tracking category (though the category still has 18k pages in it, so the feature isn’t completely broken either):

MariaDB [dewiki]> SELECT COUNT(*) FROM page JOIN page_props ON page_id = pp_page AND pp_propname = 'wikibase_item' LEFT JOIN categorylinks ON page_id = cl_from AND cl_to = 'Wikipedia:Weiterleitung_mit_Wikidata-Datenobjekt' WHERE page_namespace = 0 AND page_is_redirect = 1 AND cl_from IS NULL;
+----------+
| COUNT(*) |
+----------+
|     2076 |
+----------+
1 row in set (13.879 sec)

But we can look into that after this is fully fixed.

With the latest config change (SAL), this should finally be fixed. Let’s verify it in the slow-queries log after a few days.

Correction: we should still fix the pagination to not use OFFSET before we close this task (and also clean up the support for the older migration stages, once we’re sure it’s no longer needed).

Slow queries log looking good so far.

Let’s verify it in the slow-queries log after a few days.

I think we can call this a success:

image.png (327×913 px, 31 KB)

(The tiny blip on 2022-05-05 is T307647: 2022-05-05 Wikimedia full site outage, when all sorts of queries started slowing down – here, it was Title::newFromID and LinkCache::fetchPageRow, two each, with SpecialUnconnectedPages somewhere in the stack trace. Unrelated to this issue.)

Thank you for doing this. It shows a significant reduction on total number of slow queries cutting it basically to half!

As far as I’m concerned, we could probably make this task public by now – users have already started to notice the different sort order (T307751 – CCing @Stang because I think they’re trustworthy ^^), and I think third-party wikis are unlikely to be affected.

Thanks @Lucas_Werkmeister_WMDE! Actually a user posted their question on VP in our community, saying that the current sorting order does not meet their maintenance experience, like they tends to find those newly created pages without a linked Wikidata item (and to create them). I'm not an expert in optimizing query statement, so I could not comment on patches... Maybe develop a tool on Toolforge to offer the latest "unconnected" pages ?

The options we have for efficient database sorting are:

  • let users sort by descending page ID when only a single namespace is selected (which is efficient for the database) – either automatically depending on the selected namespace(s), or with a checkbox that’s only shown/used when the namespace selector isn’t set to “all”
  • let users sort by descending namespace number and page ID (also efficient for the database) – not very desirable because people will see non-article pages first
  • change the page prop so that it contains the negative namespace, and sorting by descending page prop and page ID (in the database) effectively sorts by ascending namespace but descending page ID (i.e. newest pages first within each namespace)

Also, correcting myself from earlier:

Correction: we should still fix the pagination to not use OFFSET before we close this task

This isn’t possible within the QueryPage framework (which assumes a single numerical offset), so it’ll take more work and should be done later. (It’ll also be a breaking change according to the stable interface policy, since decoupling the special page from the QueryPage framework will remove support for list=querypage&qppage=UnconnectedPages in the API; we’d probably want to add it as a dedicated list= parameter instead, if support for this as an API is needed at all.)

Given this is merged now, I will migrate the test wikis this week (probably Thursday) and the rest sometime next week (Tuesday?). I'll keep this task updated.

Today we successfully deployed the new format (inverted namespace number as pp_value/pp_sortkey) to the "wikidataclient-test" wikis.

The deploy window today didn't suffice and, although only the biggest wikis are done, about 2/3rds of the rows have been update. I've schedule a new window for tomorrow 08:30–10:00 UTC (10:30–12:00 UTC+2).

Today when deploying we noticed that sometimes old entries (with positive namespace id as value) can slip back in when a link update is performed from an older parser output (from parser cache). Given this happens rarely, we decided to just ignore this for now and run the maintenance scripts again in a month to fix these cases.

For 3rd parties that could in theory also prove problematic, but given the non-permanent nature of page props and the estimated negligible impact, we can probably safely ignore this edge cases here.

This is fully deployed in production now!

Due to the issue explained above, we should re-visit this once all parser caches have expired (which is in 21 days, October 27 11:15 UTC) and run the migration script again, as needed. I'll note this down and take care of it.

For gathering the number of legacy format rows per wiki, one can use:

foreachwikiindblist wikidataclient sql.php --json --query "SELECT COUNT(*) FROM page_props WHERE pp_propname = 'unexpectedUnconnectedPage' AND pp_sortkey > 0;" | sed -E 's/(.*):.*"COUNT.*"([0-9]+)"/\1 \2/' | grep -P '[^:]+ [0-9]+$'

Pipe this to sort -r -k2 -n to get the wikis sorted by rows to update or through grep -oP '\d+' | paste -s -d '+' | bc to get the overall row count.

\o/

Thank you @hoo and @Lucas_Werkmeister_WMDE for staying on top of this!

Once everything else is done, IMHO we should backport the cleanup of this to REL1_39, so that it’s working out of the box in that release (without the config). Maybe we can squash some of the changes together for that. (But at the moment this is blocked on T319390.)

Lucas_Werkmeister_WMDE added a subscriber: Mike_Peel.

I think this is fully resolved at last – Special:UnconnectedPages hasn’t been seen in the slow queries in a while, on dewiki it’s fast regardless of namespace filter, and we backported all this to the upcoming 1.39 release too.

I think we can close this task, make it public, and finally tell @Mike_Peel that we’ve been on-and-off working on the slow database query he identified in T297515 for the better part of this year :D

Oh wow, I didn't realise I'd sent you down a rabbit hole, sorry - and really nice work fixing this!

I just noticed there aren’t any security team members subscribed to this task anymore (afaict) – @Reedy, maybe you can make the task public for us? (Asking you since you moved it into SecTeam-Processed a while ago.)

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Oct 11 2022, 3:39 PM
Reedy changed the edit policy from "Custom Policy" to "All Users".

Hi, using the new database structure, I've run into a problem: I think it's not excluding one or both of noindex and hiddencat - which normally should not be connected to Wikidata. Please could someone check this?

The proper solution for this is that the magic word for expected unconnected pages added to the same templates that add hidden cat or noindex (e.g. Template:User page). I'm not 100% sure for hidden cat for example, some should have the magic word, some shouldn't

@Ladsgroup The specific case I came across was AfC categories on enwiki, like https://en.wikipedia.org/wiki/Category:AfC_submissions_by_date/07_January_2021 - that should be excluded, but weren't.

Ah, I see, thanks. Looking a bit more, there are examples where they are noindex but probably they do want to be in Wikidata, e.g., https://en.wikipedia.org/wiki/Category:Serial_murders_in_France - so excluding noindex probably wouldn't work well anyway. Will have to look into this further before deciding how the bot should approach this now.

Mentioned in SAL (#wikimedia-operations) [2022-10-27T20:47:00Z] <hoo> Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for commonswiki (T300770)

Mentioned in SAL (#wikimedia-operations) [2022-10-27T20:47:42Z] <hoo> Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for enwiki, enwiktionary (T300770)

I just did the final clean up of this (see T300770#8286874/T300770#8290231): We had 18,961 invalid entries for commons, 84 for enwiki and 5 for enwiktionary (and none for other wikis).