Page MenuHomePhabricator

Display count of remaining content space errors
Open, HighPublic

Description

Right now, it's really disheartening to know that en.WP has over 5 million obsolete HTML tag errors (low priority). Many of these are caused by e.g. <font> or <tt> or some other attribute in a non-content space.

Given our mission is to produce a Wikipedia that looks nice, it would be nice to get a count of all the errors in the content namespaces.

Event Timeline

Izno created this task.Aug 23 2017, 3:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 23 2017, 3:01 PM
ssastry triaged this task as Medium priority.Aug 23 2017, 5:27 PM
Izno updated the task description. (Show Details)Sep 3 2017, 2:04 AM
ssastry raised the priority of this task from Medium to High.Jan 19 2018, 3:28 PM

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

Izno added a comment.Jan 19 2018, 5:44 PM

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

There is a config variable [[https://www.mediawiki.org/wiki/Manual:$wgContentNamespaces | $wgContentNamespaces]] which is the list of all the namespaces of interest for a particular wiki. (I was careful to request content namespaces -- most projects have more than just the mainspace in the content space. ;)

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

There is a config variable [[https://www.mediawiki.org/wiki/Manual:$wgContentNamespaces | $wgContentNamespaces]] which is the list of all the namespaces of interest for a particular wiki. (I was careful to request content namespaces -- most projects have more than just the mainspace in the content space. ;)

Guilty as charged! :-) Will fix.

The current query for comparison:

wikiadmin@db1080(enwiki)>explain select linter_cat, count(*) as count from linter group by linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: NULL
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253288
        Extra: Using index
1 row in set (0.00 sec)

The proposed query:

wikiadmin@db1080(enwiki)>explain select linter_cat, count(*) as count from linter inner join page on page_id=linter_page where page_namespace=0 group by linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: linter_page
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253288
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY,name_title
          key: PRIMARY
      key_len: 4
          ref: enwiki.linter.linter_page
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

Should I be using commons for the worst-case performance scenario? In any case, this doesn't appear to be any worse than the existing query, except we're now running it twice.

I thought it would be nice if we could just cache the per namespace/category counts and sum them for the namespaces we want at run time, but that query has to filesort :/

wikiadmin@db1080(enwiki)>explain select page_namespace, linter_cat, count(*) as count from linter inner join page on page_id=linter_page group by page_namespace, linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: linter_page
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253287
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.linter.linter_page
         rows: 1
        Extra: 
2 rows in set (0.01 sec)

+DBA, We'd appreciate your review on the proposed query (second one) I posted in T173943#3933570. The current plan is that we'd be running both the current query and the proposed query when people view Special:LintErrors. Would that be OK? Alternatively, if we could run the third query, we'd be able to run only that one. But it has to filesort, so I think it's already out of question?

jcrespo added a subscriber: jcrespo.Feb 1 2018, 8:45 AM

This was bad and is still bad for the reasons mentioned here: T184280

Thanks, we'll make sure that the patch uses the estimateRowCount method so the queries aren't slow.

So, estimateRowCount doesn't appear like it will work:

wikiadmin@db1080(enwiki)>select count(*) from linter inner join page on page_id=linter_page where linter_cat=4 and page_namespace=0;
+----------+
| count(*) |
+----------+
|   735992 |
+----------+
1 row in set (5.58 sec)

wikiadmin@db1080(enwiki)>select count(*) from linter inner join page on page_id=linter_page where linter_cat=4;
+----------+
| count(*) |
+----------+
|  3894549 |
+----------+
1 row in set (4.13 sec)

wikiadmin@db1080(enwiki)>explain select count(*) from linter inner join page on page_id=linter_page where linter_cat=4 and page_namespace=0;
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
| id   | select_type | table  | type   | possible_keys                        | key                      | key_len | ref                       | rows    | Extra       |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
|    1 | SIMPLE      | linter | ref    | linter_cat_page_position,linter_page | linter_cat_page_position | 4       | const                     | 8810144 | Using index |
|    1 | SIMPLE      | page   | eq_ref | PRIMARY,name_title                   | PRIMARY                  | 4       | enwiki.linter.linter_page |       1 | Using where |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
2 rows in set (0.00 sec)

wikiadmin@db1080(enwiki)>explain select count(*) from linter inner join page on page_id=linter_page where linter_cat=4;
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
| id   | select_type | table  | type   | possible_keys                        | key                      | key_len | ref                       | rows    | Extra       |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
|    1 | SIMPLE      | linter | ref    | linter_cat_page_position,linter_page | linter_cat_page_position | 4       | const                     | 8810144 | Using index |
|    1 | SIMPLE      | page   | eq_ref | PRIMARY                              | PRIMARY                  | 4       | enwiki.linter.linter_page |       1 | Using index |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
2 rows in set (0.00 sec)

Regardless of the namespace filter, it still gives the same number of rows despite the real values being very different. I think we need to bite the bullet and implement a proper table storing counts.

Izno added a comment.Aug 30 2018, 1:49 PM

FYI, this is maybe a lower priority now since Fireflytools add linter counts.

The maintainer of Fireflytools has abandoned updates. It would be useful for this tool to be maintained centrally.