Page MenuHomePhabricator

Stats for en.wikinews.org not working
Closed, InvalidPublic

Description

The stats located here: http://stats.wikimedia.org/wikinews/EN/Sitemap.htm appear to not be working.

Event Timeline

DragonFire1024 raised the priority of this task from to Unbreak Now!.
DragonFire1024 updated the task description. (Show Details)
DragonFire1024 subscribed.
Milimetric lowered the priority of this task from Unbreak Now! to High.Aug 17 2015, 5:28 PM
Milimetric set Security to None.
Milimetric subscribed.

I changed to High as no site is actually down right now. cc-ing Erik as well.

@DragonFire1024 could you be more specific on what wasn't working? Did you see no reports, or outdated reports? FYI there is always gap of 2-4 weeks after closure of a month before all dumps have been generated and processed.

The last time the stats updated were June 30, and the last dump was on July 30. The the stats are not updating with the dumps.

We're talking about stats for July, right? Stats for June are online since long. So for July the last dump wasn't on July 30, of course. Generation of relevant dumps starts August 1. The complete dump process took till around August 10. It takes at least a week for Wikistats to process all these dumps. This time I had to rerun largest dumps, as a temporary workaround needed to process anomalous June dumps was still in the code (my bad). There is a short manual vetting after all dumps have been processed, before I publish. You can check the draft location next time for unvetted prerelease , for en.wikinews.org that is https://stats.wikimedia.org/wikinews/EN/draft/TablesWikipediaEN.htm.

Milimetric claimed this task.

@DragonFire1024, it sounds like there's no issue here other than normal workflow. The next version of wikistats aims to update more frequently and in a more automated fashion, but we're just now starting work on that project.

Some background: Wikistats used to publish fully automated. But once, when a bug got numbers totally wrong, no-one told me, I only learned of it when an article had been posted on the issue in Signpost. So I went back to manual vetting. :-)

Right, the only way to go back to automated would be if we had monitoring in place that would let us know when processing is not going as expected.

automated monitoring: agree, but not a trivial task. I experimented with MoM comparisons and alerts based on thresholds. But we would need to have flexible threshold as smaller wiki have more variance.

Here is an example of such a comparison file. Ideally we would publish these as well, so that others can help in vetting process. But that would require more solid thresholds than in this example I guess. Or do you see other way to go about this?

Come to think of it, I would favor to always have a category 'other' / 'not counted' or similar. For instance if we see the percentage html requests marked as pageview drop, that would be good incentive to review definitions. Such a thing happened with http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview2014Q3.htm where region 'Unknown' went from 0.1% to 8% in this quarter.

I'm thinking lower level like making sure all the data is where we think it should be, all the computation that we expect to happen happens, and so on. I agree that trying to automatically determine which data is correct is always tricky. We could try some of the anomaly detection algorithms we were playing with earlier this year, that might prove useful.