Page MenuHomePhabricator

Query Special Pages periodically fail.
Closed, ResolvedPublic

Description

Maintenance scripts (/usr/local/bin/mwscriptwikiset updateSpecialPages.php) executing database queries longer than 1 minute get killed automatically.

Original report:

Cron jobs on all Wikipedias seem to be broken. These used to run twice a week but have not showed up at all since Sept 16th. See this.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

How about other wikis? Turkish Wikipedia does not seem to be getting any updates.

It still hasn't processed all wikis, there is a reason these are done by cron job, be patient.

Are we talking about manually running cron tasks here?

No.

Same problem at twice-a-month-queries. See "Special:DeadendPages" for example.

krenair@terbium:/var/log/mediawiki/updateSpecialPages$ cat *DeadendPages.log
The specified dblist file, /srv/mediawiki/dblist/s1.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s2.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s3.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s4.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s5.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s6.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/s7.dblist, could not be found.
The specified dblist file, /srv/mediawiki/dblist/silver.dblist, could not be found.

All from Oct 23 01:00.

/dblist/ should be /dblists/, and the code doing this check is wrong in a different way. Separate cause.

Change 249310 had a related patch set uploaded (by Alex Monk):
Fix mwscriptwikiset dblists paths

https://gerrit.wikimedia.org/r/249310

Change 249310 merged by Dzahn:
Fix mwscriptwikiset dblists paths

https://gerrit.wikimedia.org/r/249310

Hello, @Krenair. Still does not work, two run days after.

There just keeps being more and more things broken...

krenair@mw1152:/var/log/mediawiki/updateSpecialPages$ head -n 6 s7@17-DeadendPages.log
-------------------------------------
arwiki
-------------------------------------
Deadendpages                   [QueryPage] [Mon Nov  9 01:01:03 2015] [hphp] [28528:7ff8f7192d00:0:000001] [] SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT /* DeadendPagesPage::reallyDoQuery  */  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace IN ('104','0')  AND page_is_redirect = '0'  ORDER BY page_namespace,page_title LIMIT 5000  
[Mon Nov  9 01:02:03 2015] [hphp] [28528:7ff8f7192d00:0:000002] [] SlowTimer [59999ms] at runtime/ext_mysql: slow query: SELECT /* DeadendPagesPage::reallyDoQuery  */  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace IN ('104','0')  AND page_is_redirect = '0'  ORDER BY page_namespace,page_title LIMIT 5000  
FAILED: database error

@Krenair The first too lines are normal warnings.

The last one does not provide any useful information. Basing a query service on cron is in general a bad idea- MySQL *service* is guaranteed, that is not true for a particular server- any server can be unavailable for many reasons- network problems, hardware failure, maintenance. That is why it is important for a server to refresh periodically its connection information and try again automatically, up to a number of times.

Having said that, it could be a permission problem (maybe due to the new client?). It would be nice if you could try it manually and see if it fails immediately.

As a separate issue, that particular query could be improved significantly:

MariaDB PRODUCTION s7 localhost arwiki > EXPLAIN SELECT /* DeadendPagesPage::reallyDoQuery  */  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace IN ('104','0')  AND page_is_redirect = '0'  ORDER BY page_namespace,page_title LIMIT 5000\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: range
possible_keys: name_title,page_redirect_namespace_len
          key: name_title
      key_len: 4
          ref: NULL
         rows: 1227848
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: ref
possible_keys: pl_from
          key: pl_from
      key_len: 4
          ref: arwiki.page.page_id
         rows: 15
        Extra: Using where; Using index; Not exists
2 rows in set (0.00 sec)

MariaDB PRODUCTION s7 localhost arwiki > EXPLAIN SELECT /* DeadendPagesPage::reallyDoQuery  */  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` FORCE INDEX (name_title) LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace IN ('104','0')  AND page_is_redirect = '0'  ORDER BY page_namespace,page_title LIMIT 5000\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: range
possible_keys: name_title
          key: name_title
      key_len: 4
          ref: NULL
         rows: 1227851
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: ref
possible_keys: pl_from
          key: pl_from
      key_len: 4
          ref: arwiki.page.page_id
         rows: 15
        Extra: Using where; Using index; Not exists
2 rows in set (0.00 sec)

The second version, for arwiki only takes 15 minutes.

But I would recommend splitting it in several, smaller tasks (by page id, for example) for larger wikis so long-running summary-like jobs are not lost.

I tried it again manually and DeadendPages succeeded on arwiki but still fails on enwiki, and probably others.

I tried it again manually [...] but still fails on enwiki, and probably others.

Does it fail immediately, or does it take some time?

I see it now on the logs. I think it is being killed automatically by mediawiki (because of lag or because of it exceeding Mediawiki's own limit):

{
  "_index": "logstash-2015.11.11",
  "_type": "mediawiki",
  "_id": "AVD2_z7ZlAIL90ZzKijR",
  "_score": null,
  "_source": {
    "message": "DeadendPagesPage::reallyDoQuery\t10.64.32.23\t2013\tLost connection to MySQL server during query (10.64.32.23)\tSELECT  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace = '0' AND page_is_redirect = '0'  ORDER BY page_title LIMIT 5000  ",
    "@version": 1,
    "@timestamp": "2015-11-11T14:42:13.000Z",
    "type": "mediawiki",
    "host": "mw1152",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_trimmed"
    ],
    "channel": "wfLogDBError",
    "uid": "705acdc",
    "process_id": 10216,
    "wiki": "enwiki",
    "db_server": "10.64.32.23",
    "db_name": "enwiki",
    "db_user": "wikiadmin",
    "method": "DatabaseBase::reportQueryError",
    "errno": 2013,
    "error": "Lost connection to MySQL server during query (10.64.32.23)",
    "sql1line": "SELECT  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace = '0' AND page_is_redirect = '0'  ORDER BY page_title LIMIT 5000  ",
    "fname": "DeadendPagesPage::reallyDoQuery",
    "normalized_message": "DeadendPagesPage::reallyDoQuery\t10.64.32.23\t2013\tLost connection to MySQL server during query (10.64.32.23)\tSELECT  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_fr"
  },
  "sort": [
    1447252933000
  ]
}
Krenair removed a project: Patch-For-Review.

@Superyetkin @IKhitron, thanks to @Krenair progress, I think the issue has improved, at least partially.

I still see the following recent queries failing on enwiki:

BrokenRedirects
Mostimages
Mostinterwikis
Uncategorizedpages
Uncategorizedimages
Wantedcategories
Wantedfiles
DisambiguationPageLinks
GadgetUsage

I will try to add more debugging to understand the actual reason behind it.

Thank you, @jcrespo, but you should know that all three-days queries (you mentioned many of them) work well on hewiki, only BrokenRedirects (twice a month) wasn't. It's something unique to hewiki, I think.

Yesterday was another run day. Still nothing on hewiki.

It's still the same error as last time, @IKhitron.

I thought it was fixed from "I think the issue has improved".

If we thought the issue was fully fixed again, this would have been set back to resolved fixed.

Again, I thought it wasn't closed because of enwiki problem only.

No, I've been monitoring the errors to understand the impact (it is impacting several wikis). It is on my short term backlog.

Something interesting. Today [[Special:LonelyPages]] wasn't updated on hewiki. The rest 72-hours queries worked well.

I see some concerning issues (connection problems) with db1053 that may or may not be related, one of the databases involved in vslow queries, I am going to perform some maintanance and upgrade on it, then reevaluate this issue.

I've debugged it, and this is not a Database/infrastructure problem:

Query starts:

-------------------------------------
enwiki
-------------------------------------

      Id: 2618
    User: wikiadmin
    Host: 10.64.32.13:37239
      db: enwiki
 Command: Query
    Time: 26
   State: Copying to tmp table on disk
    Info: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000
Progress: 0.000

A second connection is used after 1 minute (when the slow timer warns). It seems that the script thinks that the first one has failed:

Mostlinkedtemplates            [QueryPage] [Thu Nov 26 11:32:05 2015] [hphp] [25884:7fa3cb166d00:0:000001] [] SlowTimer [59999ms] at runtime/ext_mysql: slow query: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000

*************************** 8. row ***************************
      Id: 2618
    User: wikiadmin
    Host: 10.64.32.13:37239
      db: enwiki
 Command: Query
    Time: 60
   State: Copying to tmp table on disk
    Info: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000
Progress: 0.000
...
*************************** 10. row ***************************
      Id: 2659
    User: wikiadmin
    Host: 10.64.32.13:37272
      db: enwiki
 Command: Query
    Time: 0
   State: converting HEAP to Aria
    Info: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000
Progress: 0.000
10 rows in set (0.00 sec)

The second timer goes off:

[Thu Nov 26 11:33:05 2015] [hphp] [25884:7fa3cb166d00:0:000002] [] SlowTimer [59999ms] at runtime/ext_mysql: slow query: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000

And exactly after 2 minutes (too long query warnings), the script fails:

FAILED: database error

This is mediawiki killing itself after 1 minutes, maintenance is not well-configured. Database works as intended, in fact, the queries continue being executed on MySQL (which is bad by itself):

*************************** 10. row ***************************
      Id: 2729
    User: wikiadmin
    Host: 10.64.32.13:37377
      db: enwiki
 Command: Query
    Time: 445
   State: Copying to tmp table on disk
    Info: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000
Progress: 0.000
*************************** 11. row ***************************
      Id: 2770
    User: wikiadmin
    Host: 10.64.32.13:37411
      db: enwiki
 Command: Query
    Time: 385
   State: Copying to tmp table on disk
    Info: SELECT /* MostlinkedTemplatesPage::reallyDoQuery  */  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000
Progress: 0.000

This probably means that all maintenance scripts that take more than 1 minute are failing.

From your tests, in logstash we have:

2015-11-26T11:32:05.000Z mediawiki.DBPerformance.INFO

DatabaseBase::query: lost connection to 10.64.32.23; reconnected:
Database.php line 1025 calls wfBacktrace()
Database.php line 1572 calls DatabaseBase->query()
QueryPage.php line 406 calls DatabaseBase->select()
QueryPage.php line 310 calls QueryPage->reallyDoQuery()
updateSpecialPages.php line 84 calls QueryPage->recache()
doMaintenance.php line 103 calls UpdateSpecialPages->execute()
updateSpecialPages.php line 164 calls include()
MWScript.php line 97 calls include()

And exactly a minute after:

2015-11-26T11:33:05.000Z mediawiki.wfLogDBError.ERROR

MostlinkedTemplatesPage::reallyDoQuery	10.64.32.23	2013	Lost connection to MySQL server during query (10.64.32.23)
SELECT  tl_namespace AS namespace,tl_title AS title,COUNT(*) AS value  FROM `templatelinks`    GROUP BY tl_namespace,tl_title ORDER BY value DESC LIMIT 5000

@Joe finally found it: long running queries were killed after migrating the server due to an obscure and undocumented variable. This was on top of the other issues that had already been resolved. This has been automatized on puppet so it does not happen again.

New queries are being run right now so that we can refresh the reported special pages, however, it will take some time to be effective.

@IKhitron I've just updated https://he.wikipedia.org/wiki/Special:BrokenRedirects as a proof that it is finally working.

If someone has a special interest in a fast-to-generate special page (nothing in commons/enwiki/wikidata), I can do it, if not, I will let the normal schedule to update all pages and close this as resolved.

Thank you, @jcrespo, but Broken redirects was not a problem on our wiki. Could you do this for he:Special:DeadendPages, please? So we can see it's really OK? Thank you in advance.

@IKhitron, as you can see on the page, it says:

Updates for this page are currently disabled. Data here will not presently be refreshed.

Which means it has been explicitly disabled from running on configuration. That is a separate issue than this one, and if you want to ask to reenable it you need to do it on a separate ticket (feature request, project mediawiki-configuration, I suppose). I cannot just override that configuration without prior approval.

To clarify, I am not saying that that cannot be enabled, it is just that it cannot enabled without involving the people that disabled it in the first place (usually that happens due to a performance issue). To ping people, I ask you to create a different ticket asking for it to be enabled.

Sorry and thanks for your understanding.

I see. Wierd. Thank you, @jcrespo.

Actually we use a cron to update DeadendPages despite MW thinking that is disabled (it uses --override). I just ran it for hewiki:

krenair@terbium:~$ mwscript updateSpecialPages.php hewiki --override --only Deadendpages
Deadendpages                   [QueryPage] [Thu Nov 26 17:08:16 2015] [hphp] [16820:7f54000eed00:0:000001] [] SlowTimer [237681ms] at runtime/ext_mysql: slow query: SELECT /* DeadendPagesPage::reallyDoQuery  */  page_namespace AS namespace,page_title AS title,page_title AS value  FROM `page` LEFT JOIN `pagelinks` ON ((page_id=pl_from))  WHERE (pl_from IS NULL) AND page_namespace = '0' AND page_is_redirect = '0'  ORDER BY page_title LIMIT 5000  
got 8 rows in 3m 57.71s
krenair@terbium:~$
Krenair claimed this task.

Thank you, @Krenair. Of course, I can't ask you every two weeks to do this manually, so the new task is still needed.

I expect the cron to work properly next time though.

@IKhitron, @Krenair mentioned that it will be done now, as the rest of the pages, regularly.

Oh, I see. Very well, let's wait.
Thank you.

Something wierd, @Krenair: It get 8 results, 5 of them were irrelevant.

That should probably be a separate bug against MediaWiki-Special-pages

IKhitron renamed this task from Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 to Query Special Pages periodically fail..Dec 18 2015, 1:47 PM
IKhitron reopened this task as Open.

Hello again. The run day of Dec 11 for all 72-hours queries in hewiki was surprisingly Dec 12. There were not any updates in the run days of Dec 14 and Dec 17. Thank you.

This is currently running on terbium:

UID        PID  PPID  C STIME TTY          TIME CMD
www-data 20519 20510  0 Dec16 ?        00:00:00 /bin/bash /usr/local/bin/foreachwikiindblist /srv/mediawiki-staging/dblists/all.dblist updateSpecialPages.php

Last entry in /var/log/mediawiki/updateSpecialPages.log is this:

enwiki:  GadgetUsage                    [QueryPage] [Fri Dec 18 08:42:18 2015] [hphp] [12275:7f564703ad00:0:000013] [] SlowTimer [86399999ms] at runtime/ext_mysql: slow query: SELECT /* SpecialGadgetUsage::reallyDoQuery  */  up_property AS title,SUM( up_value ) AS value,COUNT( qcc_title ) AS namespace  FROM `user_properties` LEFT JOIN `user` ON ((up_user = user_id)) LEFT JOIN `querycachetwo` ON ((user_name = qcc_title) AND (qcc_type = "activeusers") AND (up_value = 1))  WHERE (up_property LIKE 'gadget-%' )  GROUP BY up_property ORDER BY up_property DESC LIMIT 5000

All of them were updated yesterday. Thank you.

@Krenair: The current GadgetUsage query fails on English Wikipedia because it takes over 24 hours. We merged a new feature flag for it last week (https://gerrit.wikimedia.org/r/#/c/259684/) and will be switching English Wikipedia to a simpler query this week (that doesn't give active user stats). On most wikis, the query takes less than a second, and the next slowest besides English Wikipedia is Commons, which takes less than an hour. Unfortunately, the user_properties table on English Wikipedia is just too bloated (T54777).

Same same. I think you have a problem with a timer.

It comes in some wierd times, @Krenair. I'm sure the problem is with the timer.

Do you mean the update happens at some times other than expected, @IKhitron?

I don't think that counts as failing. We expect some slowness (as you go further down the list of wikis sorted alphabetically by DB name) due to the way that it has to go through big updates on lots of wikis before getting to others.

If it was later, it's seems like you said. But if in place of Dec
11, 14, 17, 20, 23, 26
it was in
12, --, 19, --, 25, --,
I believe it's a serious problem, @Krenair.

If it was later, it's seems like you said. But if in place of Dec
11, 14, 17, 20, 23, 26
it was in
12, --, 19, --, 25, --,
I believe it's a serious problem, @Krenair.

Are you sure it's not a TZ issue?

I have no idea what is TZ or cache issue.

I don't consider this a serious issue. I'm not saying there's no issue. At least you're getting semi-regular runs to update the data. For a long time, we didn't have any.

Please continue to monitor the issue, and see how things look

As above, it's got to work through a lot of wikis, and do a lot of long running SQL queries. It's not going to be quick