Page MenuHomePhabricator

Make CirrusSearch word count available via API
Closed, ResolvedPublic5 Estimated Story Points

Description

CirrusSearch adds a total word count of the wiki to Special:Statistics. As far as I can tell, this word count is not available anywhere else (countContentWords() is only called in Hooks::onSpecialStatsAddExtra()); it would be useful to also have it in the API, e.g. in meta=siteinfo.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Workaround using CirrusSearch elasticsearch replicas:

lucaswerkmeister@tools-sgebastion-07:~$ curl -s -XGET -H 'Content-Type: application/json' -d '{"query":{"bool":{"filter":[{"terms":{"namespace":[0]}}]}},"aggs":{"word_count":{"sum":{"field":"text.word_count"}}},"stats":["sum_word_count"]}' https://cloudelastic.wikimedia.org:8243/enwiki_content/_search | jq -r .aggregations.word_count.value
3901684579

Limitations:

  • only Wikimedia wikis (unless any third-party wikis also have CirrusSearch replicas, I guess?)
  • requires Toolforge access
  • you need to know the right cluster

I packaged that up in a script (P16317) and put the results at P16318 if anyone needs them.

Also, this doesn’t take $wgNamespacesToBeSearchedDefault into account.

I think a good use of this may be plotting word count vs article count, so we can see how the article size in any given wiki is growing.

Also, this doesn’t take $wgNamespacesToBeSearchedDefault into account.

If you request the _content this should take $wgNamespacesToBeSearchedDefault into account (we changed this a couple years to take this variable into account when creating the index)

But I'm all for exposing this value more broadly through existing wiki APIs :)

I have checked the words/article ratio for evey Wikipedia and got this graph, that is consistent with the content.

irudia.png (734×1 px, 21 KB)

There seems to be a bug at http://pms.wikipedia.org, because it has 65.846 articles and only 138.187 words, which doesn't seem the case if you take random pages to check.

I have checked the words/article ratio for evey Wikipedia and got this graph, that is consistent with the content.

irudia.png (734×1 px, 21 KB)

There seems to be a bug at http://pms.wikipedia.org, because it has 65.846 articles and only 138.187 words, which doesn't seem the case if you take random pages to check.

Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.

Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.

I think that there may be a problem because they use {{Prinsipi}} and {{Fin}}, which inserts everything inside a table.

MPhamWMF moved this task from needs triage to Feature Requests on the Discovery-Search board.

Change 800211 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Add cirrussearch-word-count to siteinfo api

https://gerrit.wikimedia.org/r/800211

Change 800211 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add cirrussearch-word-count to siteinfo api

https://gerrit.wikimedia.org/r/800211

Seems to be working: https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=json

{
  "batchcomplete": "",
  "query": {
    "statistics": {
      "pages": 56528593,
      "articles": 6548695,
      "edits": 1102455734,
      "images": 897652,
      "users": 44128680,
      "activeusers": 113963,
      "admins": 1032,
      "jobs": 0,
      "cirrussearch-article-words": 4171733694,
      "queued-massmessages": 0
    }
  }
}

Which matches the number on Special:Statistics. 🎉