Page MenuHomePhabricator

XTools has inaccurate prose size counts
Closed, ResolvedPublic

Description

The "Page History" tool provides a Prose size character and word count, however it doesn't correctly ignore <style> (via TemplateStyles) and <math> tags. I recently created https://prosesize.toolforge.org/ - it has an API, I would suggest pulling counts from there instead; I also wrote a blog post with some more detail.

Event Timeline

Awesome! I might just go off of your code, though, as the XTools Prose API is used quite frequently -- some 50,000+ requests a day, and that's not including HTML requests to the "Page history" tool. XTools already has to scrape the HTML anyway for various other stats.

MusikAnimal moved this task from Backlog to Pending deployment on the XTools board.

I have gone by your blog post to improve XTools' algorithm. It still doesn't always match, though. Sometimes XTools overcounts, or yours does and it's unclear which is correct. de:Provinzial-Heil- und Pflegeanstalt Allenberg for example is off by only two words from the new XTools version. I've tried to analyze line by line and can't see where the differences are. I did notice however both our implementations may be overcounting words after removing other elements. I lost the example, but sometimes math elements are comma-separated, and we're both just splitting on a space character to count words. I'm not sure how to reliably remove punctuation that shouldn't be counted, but it's a trivial difference anyway.

The new implementation is on GitHub should you wish to review it.

Thanks again for filing this bug!

MusikAnimal moved this task from Pending deployment to Complete on the XTools board.