CirrusSearch adds a total word count of the wiki to Special:Statistics. As far as I can tell, this word count is not available anywhere else (countContentWords() is only called in Hooks::onSpecialStatsAddExtra()); it would be useful to also have it in the API, e.g. in meta=siteinfo.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add cirrussearch-word-count to siteinfo api | mediawiki/extensions/CirrusSearch | master | +22 -9 |
Related Objects
Event Timeline
Workaround using CirrusSearch elasticsearch replicas:
lucaswerkmeister@tools-sgebastion-07:~$ curl -s -XGET -H 'Content-Type: application/json' -d '{"query":{"bool":{"filter":[{"terms":{"namespace":[0]}}]}},"aggs":{"word_count":{"sum":{"field":"text.word_count"}}},"stats":["sum_word_count"]}' https://cloudelastic.wikimedia.org:8243/enwiki_content/_search | jq -r .aggregations.word_count.value 3901684579
Limitations:
- only Wikimedia wikis (unless any third-party wikis also have CirrusSearch replicas, I guess?)
- requires Toolforge access
- you need to know the right cluster
I think a good use of this may be plotting word count vs article count, so we can see how the article size in any given wiki is growing.
If you request the _content this should take $wgNamespacesToBeSearchedDefault into account (we changed this a couple years to take this variable into account when creating the index)
But I'm all for exposing this value more broadly through existing wiki APIs :)
I have checked the words/article ratio for evey Wikipedia and got this graph, that is consistent with the content.
There seems to be a bug at http://pms.wikipedia.org, because it has 65.846 articles and only 138.187 words, which doesn't seem the case if you take random pages to check.
Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.
Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.
I think that there may be a problem because they use {{Prinsipi}} and {{Fin}}, which inserts everything inside a table.
Change 800211 had a related patch set uploaded (by EJoseph; author: EJoseph):
[mediawiki/extensions/CirrusSearch@master] Add cirrussearch-word-count to siteinfo api
Change 800211 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add cirrussearch-word-count to siteinfo api
Seems to be working: https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
{ "batchcomplete": "", "query": { "statistics": { "pages": 56528593, "articles": 6548695, "edits": 1102455734, "images": 897652, "users": 44128680, "activeusers": 113963, "admins": 1032, "jobs": 0, "cirrussearch-article-words": 4171733694, "queued-massmessages": 0 } } }
Which matches the number on Special:Statistics. 🎉