Page MenuHomePhabricator

Make CirrusSearch word count available via API
Open, MediumPublic

Description

CirrusSearch adds a total word count of the wiki to Special:Statistics. As far as I can tell, this word count is not available anywhere else (countContentWords() is only called in Hooks::onSpecialStatsAddExtra()); it would be useful to also have it in the API, e.g. in meta=siteinfo.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Workaround using CirrusSearch elasticsearch replicas:

lucaswerkmeister@tools-sgebastion-07:~$ curl -s -XGET -H 'Content-Type: application/json' -d '{"query":{"bool":{"filter":[{"terms":{"namespace":[0]}}]}},"aggs":{"word_count":{"sum":{"field":"text.word_count"}}},"stats":["sum_word_count"]}' https://cloudelastic.wikimedia.org:8243/enwiki_content/_search | jq -r .aggregations.word_count.value
3901684579

Limitations:

  • only Wikimedia wikis (unless any third-party wikis also have CirrusSearch replicas, I guess?)
  • requires Toolforge access
  • you need to know the right cluster

I packaged that up in a script (P16317) and put the results at P16318 if anyone needs them.

Also, this doesn’t take $wgNamespacesToBeSearchedDefault into account.

I think a good use of this may be plotting word count vs article count, so we can see how the article size in any given wiki is growing.

Also, this doesn’t take $wgNamespacesToBeSearchedDefault into account.

If you request the _content this should take $wgNamespacesToBeSearchedDefault into account (we changed this a couple years to take this variable into account when creating the index)

But I'm all for exposing this value more broadly through existing wiki APIs :)

I have checked the words/article ratio for evey Wikipedia and got this graph, that is consistent with the content.

irudia.png (734×1 px, 21 KB)

There seems to be a bug at http://pms.wikipedia.org, because it has 65.846 articles and only 138.187 words, which doesn't seem the case if you take random pages to check.

I have checked the words/article ratio for evey Wikipedia and got this graph, that is consistent with the content.

irudia.png (734×1 px, 21 KB)

There seems to be a bug at http://pms.wikipedia.org, because it has 65.846 articles and only 138.187 words, which doesn't seem the case if you take random pages to check.

Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.

Checked a couple pages and most of the content of the page is considered auxiliary which is not taken into account in the word count. We should probably file a ticket for this specific wiki.

I think that there may be a problem because they use {{Prinsipi}} and {{Fin}}, which inserts everything inside a table.

MPhamWMF triaged this task as Medium priority.Jun 9 2021, 3:25 PM
MPhamWMF moved this task from needs triage to Feature Requests on the Discovery-Search board.