Page MenuHomePhabricator

Display count of remaining content space errors
Open, HighPublic

Description

Right now, it's really disheartening to know that en.WP has over 5 million obsolete HTML tag errors (low priority). Many of these are caused by e.g. <font> or <tt> or some other attribute in a non-content space.

Given our mission is to produce a Wikipedia that looks nice, it would be nice to get a count of all the errors in the content namespaces.

Event Timeline

ssastry raised the priority of this task from Medium to High.Jan 19 2018, 3:28 PM

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

There is a config variable [[https://www.mediawiki.org/wiki/Manual:$wgContentNamespaces | $wgContentNamespaces]] which is the list of all the namespaces of interest for a particular wiki. (I was careful to request content namespaces -- most projects have more than just the mainspace in the content space. ;)

Change 405327 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

https://gerrit.wikimedia.org/r/405327

There is a config variable [[https://www.mediawiki.org/wiki/Manual:$wgContentNamespaces | $wgContentNamespaces]] which is the list of all the namespaces of interest for a particular wiki. (I was careful to request content namespaces -- most projects have more than just the mainspace in the content space. ;)

Guilty as charged! :-) Will fix.

The current query for comparison:

wikiadmin@db1080(enwiki)>explain select linter_cat, count(*) as count from linter group by linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: NULL
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253288
        Extra: Using index
1 row in set (0.00 sec)

The proposed query:

wikiadmin@db1080(enwiki)>explain select linter_cat, count(*) as count from linter inner join page on page_id=linter_page where page_namespace=0 group by linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: linter_page
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253288
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY,name_title
          key: PRIMARY
      key_len: 4
          ref: enwiki.linter.linter_page
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

Should I be using commons for the worst-case performance scenario? In any case, this doesn't appear to be any worse than the existing query, except we're now running it twice.

I thought it would be nice if we could just cache the per namespace/category counts and sum them for the namespaces we want at run time, but that query has to filesort :/

wikiadmin@db1080(enwiki)>explain select page_namespace, linter_cat, count(*) as count from linter inner join page on page_id=linter_page group by page_namespace, linter_cat\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: linter
         type: index
possible_keys: linter_page
          key: linter_cat_page_position
      key_len: 16
          ref: NULL
         rows: 23253287
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.linter.linter_page
         rows: 1
        Extra: 
2 rows in set (0.01 sec)

+DBA, We'd appreciate your review on the proposed query (second one) I posted in T173943#3933570. The current plan is that we'd be running both the current query and the proposed query when people view Special:LintErrors. Would that be OK? Alternatively, if we could run the third query, we'd be able to run only that one. But it has to filesort, so I think it's already out of question?

This was bad and is still bad for the reasons mentioned here: T184280

Thanks, we'll make sure that the patch uses the estimateRowCount method so the queries aren't slow.

So, estimateRowCount doesn't appear like it will work:

wikiadmin@db1080(enwiki)>select count(*) from linter inner join page on page_id=linter_page where linter_cat=4 and page_namespace=0;
+----------+
| count(*) |
+----------+
|   735992 |
+----------+
1 row in set (5.58 sec)

wikiadmin@db1080(enwiki)>select count(*) from linter inner join page on page_id=linter_page where linter_cat=4;
+----------+
| count(*) |
+----------+
|  3894549 |
+----------+
1 row in set (4.13 sec)

wikiadmin@db1080(enwiki)>explain select count(*) from linter inner join page on page_id=linter_page where linter_cat=4 and page_namespace=0;
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
| id   | select_type | table  | type   | possible_keys                        | key                      | key_len | ref                       | rows    | Extra       |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
|    1 | SIMPLE      | linter | ref    | linter_cat_page_position,linter_page | linter_cat_page_position | 4       | const                     | 8810144 | Using index |
|    1 | SIMPLE      | page   | eq_ref | PRIMARY,name_title                   | PRIMARY                  | 4       | enwiki.linter.linter_page |       1 | Using where |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
2 rows in set (0.00 sec)

wikiadmin@db1080(enwiki)>explain select count(*) from linter inner join page on page_id=linter_page where linter_cat=4;
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
| id   | select_type | table  | type   | possible_keys                        | key                      | key_len | ref                       | rows    | Extra       |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
|    1 | SIMPLE      | linter | ref    | linter_cat_page_position,linter_page | linter_cat_page_position | 4       | const                     | 8810144 | Using index |
|    1 | SIMPLE      | page   | eq_ref | PRIMARY                              | PRIMARY                  | 4       | enwiki.linter.linter_page |       1 | Using index |
+------+-------------+--------+--------+--------------------------------------+--------------------------+---------+---------------------------+---------+-------------+
2 rows in set (0.00 sec)

Regardless of the namespace filter, it still gives the same number of rows despite the real values being very different. I think we need to bite the bullet and implement a proper table storing counts.

FYI, this is maybe a lower priority now since Fireflytools add linter counts.

The maintainer of Fireflytools has abandoned updates. It would be useful for this tool to be maintained centrally.

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.

The maintainer of Fireflytools has abandoned updates. It would be useful for this tool to be maintained centrally.

And resurrected. :D

Change #405327 abandoned by Subramanya Sastry:

[mediawiki/extensions/Linter@master] WIP: Display count of linter errors in the main namespace

Reason:

quarry is good enough to get this info now -- and the original motivation is not as relevant now.

https://gerrit.wikimedia.org/r/405327