Page MenuHomePhabricator

For some categories, cat_pages is less than cat_subcats
Open, Needs TriagePublic

Description

For some categories, cat_pages is less than cat_subcats, for example:

> select * from category where cat_title='Wikipedia_soft_redirected_categories';
+--------+--------------------------------------+-----------+-------------+-----------+
| cat_id | cat_title                            | cat_pages | cat_subcats | cat_files |
+--------+--------------------------------------+-----------+-------------+-----------+
| 795306 | Wikipedia_soft_redirected_categories |     65901 |       65907 |         0 |
+--------+--------------------------------------+-----------+-------------+-----------+

This looks like some kind of bug, as pages count includes subcats count. Checking on enwiki shows there is no pages in that category, only subcategories, so the counts should be identical.

Event Timeline

A bug indeed. Here you go: T18036

There's been no consensus on a manual update mechanism either, see T85696

We could pull this from elasticsearch? One problem though will be that categories on redirects are not accounted for here, and would be hard to count since they don't get their own entry in the index.

Count of categories and subcats:

{
  "size": 0,
  "query": {
    "match": {
      "category.lowercase_keyword": "Wikipedia soft redirected categories"
    }
  }
}

Result: {"took":23,"timed_out":false,"_shards":{"total":15,"successful":15,"failed":0},"hits":{"total":65958,"max_score":0.0,"hits":[]}}

Number of direct subcats:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "match": {
            "category.lowercase_keyword": "Wikipedia soft redirected categories"
        } },
        { "match": {
            "namespace": 14
        } }
      ]
    }
  }
}

Result: {"took":31,"timed_out":false,"_shards":{"total":15,"successful":15,"failed":0},"hits":{"total":65958,"max_score":0.0,"hits":[]}}

Number of non-subcats:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "match": {
            "category.lowercase_keyword": "Wikipedia soft redirected categories"
        } }
      ],
      "must_not": [
        { "match": {
            "namespace": 14
        } }
      ]
    }
  }
}

Result: {"took":38,"timed_out":false,"_shards":{"total":15,"successful":15,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}

I guess it's possible in theory to use Elastic, but since we'll need to do it for every single category on every wiki, I wonder what would be the performance costs. I'll try on some of the larger categories and see if it's a problem. I wonder if it's possible (or even worth it) to ask in batches.

Doing it for all categories can also be done, but it needs a different query. Essentially a terms aggregation plus partitioning, like the following. Unfortunately though this wont work with our current indices because it requires a keyword field to aggregate on, and we do keywords with a special analysis chain on text rather than the baked in keywords.

{
   "size": 0,
   "aggs": {
      "my_terms": {
         "terms": {
            "field": "category.lowercase_keyword",
            "include": {
               "partition": 0,
               "num_partitions": 1000
            },
            "size": 10000
         }
      }
   }
}

Aggregation could be avoided though by somehow attaching to the code that updates category counts and running the elastic queries in the normal update pipeline.