Page MenuHomePhabricator

Audit query pages on Wikidata
Open, Needs TriagePublic

Description

These are query pages in Wikidata:

They fall into three categories:

  • We should disable them, they don't make sense for wikidata
  • We need to move it out of query pages or find a way for it to work. Currently it's disabled or causing issues for production while it's needed
  • It's okay to stay as is.

Query pages show cached data and information for editors, like clean up lists. But to build the cache they have cronjobs in production that runs really heavy queries once a week (for each special page in each wiki) and it takes really long time to finish (sometimes ten hours) and we had to disable those regardless if they are important or not but now it's the time to audit them and find a solution for the ones that matter and avoid wasteful db queries for special pages that never been used. I feel the best solution for some of those is to move them out of mediawiki and get these data from hadoop instead but I don't know how that can be done in analytics cluster.


Original report:

  • AncientPages - last update in November 2019 (said to be updated twice a month)
  • DeadendPages - last update in November 2019 (said to be updated twice a month)
  • FewestRevisions - reported to be slow (T238199), last update in November 2019
  • LonelyPages - seems unlikely any of those 5000 pages will get fixed
  • MostCategories - empty, items cannot be categorized, useless for Wikidata
  • MostInterwikis - empty, seems to only work with "old" interwiki, useless for Wikidata

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 21 2020, 11:10 AM

Update:

  • The first three pages were updated meanwhile.
  • LonelyPages says:

Updates for this page are currently disabled. Data here will not presently be refreshed.

... but got an update in March.

  • The last two are still empty.
  • MostLinkedPages hasn't been updated since September but still shows:

Updates for this page are running twice a month.

I wonder if I did something wrong in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/573969

These QueryPages have been causing issues for database of Wikidata for a long time now, I was planning to create a ticket to go through them and clean up the ones that can be disabled and try to find a solution for the others.

I will write a full list soon.

This is an important ticket and thank you for creating it @matej_suchanek I hope it's fine that I hijack it.

Ladsgroup renamed this task from Disable or revive some query pages on Wikidata to Audit query pages on Wikidata.Apr 16 2020, 5:29 PM
Ladsgroup updated the task description. (Show Details)

I hope it's fine that I hijack it.

Totally fine.

Ladsgroup added a comment.EditedApr 17 2020, 12:23 PM

On the API-side of requests, this is the number I got since first of April (17 days);

select params['qppage'], params['gqppage'], count(*) as hitcount from event.mediawiki_api_request where year = 2020 and month = 4 and params['action'] = 'query' and (params['list'] = 'querypage' or params['generator'] = 'querypage') and meta.domain = "www.wikidata.org" group by params['qppage'], params['gqppage'];

_c0	_c1	hitcount
NULL	NULL	124
DoubleRedirects	NULL	17
NULL	Wantedpages	120
Longpages	NULL	112
NULL	UnconnectedPages	120
NULL	Wantedfiles	120
NULL	Wantedcategories	120
NULL	Wantedtemplates	120
NULL	Longpages	114
NULL	Ancientpages	117
Ancientpages	NULL	118
NULL	Unwatchedpages	115
NULL	DoubleRedirects	66
NULL	BrokenRedirects	18
Shortpages	NULL	114
15 rows selected (361.954 seconds)

Figuring out the NULL ones + trying to to get request number for special pages now.

SELECT uri_path, count(*) FROM wmf.webrequest WHERE year = 2020 AND month = 4 AND agent_type = "user" AND uri_host = "www.wikidata.org" AND uri_path like '/wiki/Special:%' and lower(uri_path) in ('/wiki/special:ancientpages', '/wiki/special:brokenredirects', '/wiki/special:deadendpages', '/wiki/special:doubleredirects', '/wiki/special:fileduplicatesearch', '/wiki/special:listduplicatedfiles', '/wiki/special:linksearch', '/wiki/special:listredirects', '/wiki/special:lonelypages', '/wiki/special:longpages', '/wiki/special:mediastatistics', '/wiki/special:mimesearch', '/wiki/special:mostcategories', '/wiki/special:mostimages', '/wiki/special:mostinterwikis', '/wiki/special:mostlinkedcategories', '/wiki/special:mostlinkedtemplates', '/wiki/special:mostlinked', '/wiki/special:mostrevisions', '/wiki/special:fewestrevisions', '/wiki/special:shortpages', '/wiki/special:uncategorizedcategories', '/wiki/special:uncategorizedpages', '/wiki/special:uncategorizedimages', '/wiki/special:uncategorizedtemplates', '/wiki/special:unusedcategories', '/wiki/special:unusedimages', '/wiki/special:wantedcategories', '/wiki/special:wantedfiles', '/wiki/special:wantedpages', '/wiki/special:wantedtemplates', '/wiki/special:unwatchedpages', '/wiki/special:unusedtemplates', '/wiki/special:withoutinterwiki') GROUP BY uri_path;
/wiki/Special:MostInterwikis	40
/wiki/Special:MediaStatistics	38
/wiki/Special:UnusedTemplates	41
/wiki/Special:ListDuplicatedFiles	35
/wiki/Special:UnwatchedPages	1
/wiki/Special:LongPages	78
/wiki/Special:FileDuplicateSearch	34
/wiki/Special:MostRevisions	39
/wiki/Special:MostLinkedCategories	43
/wiki/Special:MIMESearch	43
/wiki/Special:WantedCategories	58
/wiki/Special:Linksearch	6
/wiki/Special:Unusedcategories	1
/wiki/Special:Unusedimages	1
/wiki/Special:UnusedCategories	39
/wiki/Special:MIMEsearch	2
/wiki/Special:Listredirects	1
/wiki/Special:Withoutinterwiki	1
/wiki/Special:UncategorizedTemplates	37
/wiki/Special:Mostcategories	1
/wiki/Special:WantedTemplates	40
/wiki/Special:MostCategories	42
/wiki/Special:WantedPages	45
/wiki/Special:UncategorizedPages	41
/wiki/Special:ShortPages	48
/wiki/Special:LonelyPages	47
/wiki/Special:AncientPages	50
/wiki/Special:Uncategorizedpages	1
/wiki/Special:WantedFiles	45
/wiki/Special:Ancientpages	1
/wiki/Special:Wantedfiles	1
/wiki/Special:Mostlinkedtemplates	1
/wiki/Special:Mostimages	2
/wiki/Special:DoubleRedirects	44
/wiki/Special:BrokenRedirects	66
/wiki/Special:UncategorizedCategories	45
/wiki/Special:LinkSearch	86
/wiki/Special:ListRedirects	47
/wiki/Special:DeadendPages	47
/wiki/Special:WithoutInterwiki	45
/wiki/Special:FewestRevisions	49
Ladsgroup updated the task description. (Show Details)Apr 17 2020, 3:41 PM

I had to kill updateSpecialPages.php wikidatawiki --override --only=Fewestrevisions as it was causing issues - tracking task: T238199

@Marostegui Hey, can you give us the list of the ones that are the most problematic in the DB-level? I know we had to disable some already

This one T238199 is the one I have had to kill the most lately.
I also recall MostLinked being pretty heavy as well

I have raised T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run to high as it has caused again errors in production.
We need that cronjob either fixed or disabled - everytime it runs we have to kill it and luckily that always happens during the week when we are available, but during weekends it can cause more errors until someone gets to it.

@Marostegui Thanks. Is there any other page that is also causing problems (not enough to be killed but still enough to be annoying?)