Page MenuHomePhabricator

Problems with Erik Zachte's Wikipedia Statistics
Closed, DeclinedPublic

Description

Usually, I do the editor's statistics for the Hungarian Wikipedia and I had several questions about the stats generated by Erik Zachte over the years [1].

Please be aware, that I am not a developer/programmer, I am mainly read-only concerning program codes. It can happen that some of the reports below come from my misunderstanding or lack of knowledge. Below, I have questions and I listed potential bugs. Please answer the questions here and open separate tickets for the real problems.

  1. I didn't find description or definition how the number of words counted, only this sentence: "Total number of words (excl. redirects, html/wiki codes and hidden links)". "There is an apple" means 4 words and "They're children" means 2 words? [[Example]] was [[born]] in [[1999]] means 2 or 5 words (is it wiki code)?
  1. I tried to check the answers in the codes, but the links are broken:
  1. What is the reason that part of the tables are empty (for example the Hungarian table since March 2014). I understand that these columns are generated from the full archive dump and analysis of it needs longer time and more resources, but it is available. Can we expect that these rows will be filled out in the near future?
  1. Column K is "Percentage of articles with at least 2 Kb readable text (see F)" based on the description but it links to the page "https://stats.wikimedia.org/EN/TablesArticlesGt1500Bytes.htm" Is the limit 1500 bytes or 2048 bytes?
  1. I tried to compare my data analysis with Erik's stats, but it is not easy since I don't know which dumps are used for which row.
  • If there is a dump from end of January (for example https://dumps.wikimedia.org/archive/2007/2007-02/huwiki/20070130/) it is used for the row of Jan 2007 or for the row of Feb 2007 in the table?
  • If there was no dump in a month, which ones were used for the row of this months?
  • If there was more than one dump for a month, which ones were used for the row of this months?
  1. Based on the dump file huwiki-20070130-page.sql I calculated the database size and the average article size. (I know, that the definition of the article [2] is different in this case, and therefore the number of articles, the database size and the average size of the articles differ slightly, but this difference is quite small.) According to the row of Jan 2007 in the table [1] there were 46 k articles (that contain at least one internal link excl. redirects) with the average size of articles (that contain at least one internal link excl. redirects) of 2580 bytes and the with the database size (combined size of all articles incl. redirects) of 158 MB that time in the Hungarian Wikipedia.
  • If I counted the number of articles I get 49400 in the main namespace excl. redirects and 70391 in the main namespace incl. redirects. Using the official definition of an article 46 k looks possible (excl. redirects).
  • If I add the length of the articles I get 159.8 MB excl. redirects and 160.4 MB incl. redirects. The 158 MB (excl. redirects) in the table looks possible again.
  • If I calculate the average size of the articles, the results are 159.8 MB/49400 article=3391 bytes excl. redirects and 160.4 MB/70391=2389 bytes incl. redirects. The number in the table 2580 bytes, which is quite far from the calculated 3391 bytes. If I back-calculate the number of articles excl. redirects from the 158 MB (which number contains the size of the redirects, but this is about 0.6 MB and has small effect here) and from the 2580 bytes (numbers in Zachte's table), I got 64 k. This result looks wrong for me, see the first point. Is it possible, that the script uses the number of articles incl. redirects here?
  1. https://stats.wikimedia.org/EN/TablesWikipediaHU.htm#anonymous says "All together 747,130 article edits were made by anonymous users, out of a total of 11346,643 article edits (7%)"
  • The used commas are quite misleading: they are used in the Hungarian language as the decimal separator (747 130,0 and 11 346 643,0), in the U.S. as a groups delimiter of numbers (747,130.0 and 11,346,643.0). The second number follows none of these rules.
  • Based on my analysis of the huwiki-20160111-stub-meta-history.xml dump there were more than 16 million edits until end of year 2015. The actual number of diffs is more than 17.2 millions now, so the 16 million edits at end of the year looks more realistic than than the 11 million edits on the stats.wikimedia.org page.

[1]: https://stats.wikimedia.org/EN/TablesWikipediaHU.htm
[2]: https://www.mediawiki.org/wiki/Analytics/Metric_definitions

Event Timeline

Hi @Samat,

1 The code is at https://github.com/wikimedia/analytics-wikistats/blob/master/dumps/perl/WikiCountsInput.pm, line 1810 etc, sub CollectArticleCounts
WikiStats is poorly documented (but that fact by itself is pretty well documented as I keep telling this every year or so)

2 https://github.com/wikimedia/analytics-wikistats/tree/master/dumps/perl (where are those links you mention?)

3 Wikistats stopped processing full archive dumps about a year ago, as a full dump cycle takes more than a month these days.
Ariel reworked the dump procedure to generate stub dumps as soon as new month starts.

Additionally I did process full archive dumps on stat1003 on a slower cycle, but that hasn't been maintained for a long while in favor of other work.

Can we expect to see rows filled in the near future? No, not really. Wikistats is approaching end of life. A replacement is direly needed.

4 the url is misnamed, the page subtitle and description is correct, see https://stats.wikimedia.org/EN/TablesWikipediaEN.htm#distribution, all bins are powers of 2

5 which dump has been parsed? This is at the end of each wiki specific page,
e.g. https://stats.wikimedia.org/EN/TablesWikipediaEN.htm:
"Dump file enwiki-20160113-stub-meta-history.xml.gz (edits only), size 4.7 GB as gz -> 284 GB
Dump processed till Dec 31, 2015, on server stat1002, ready at Tue-19/01/2016-18:36 after 2 days, 10 hrs, 35 min, 37 sec."

Only completed months are processed. So several dumps occurring for same month is irrelevant, as they only differ in the data for current month, which is discarded.

Edits from the dumps that occurred between Jan 1, 2016 and Jan 31, 2016 lead to counts reported as 'Jan 2016'.

6 The script shouldn't include redirects in the counts. Please be aware that a dump from 2007 yield different counts than a recent dump for that same time period.

7 Did you look at namespace 0 only? Wikistats does, or to be more precise only the 'content' namespaces as listed in the API.

11346,643? Yes a comma is missing. This is supposed to be English notation. The scripts were written at a time when any metric reaching more than 6 digits could not be envisioned. A generic number formatting routine would of course have been better. I see that on Magyar version of reports spaces are used as separator. https://stats.wikimedia.org/HU/Sitemap.htm BTW reports in other languages are a mix of that language and English, as translations are no longer maintained. Sorry.

I'm not sure I answered all your questions satisfactorily, but we''ll take it from here.

Cheers, Erik

Dear Erik,

Thank you for your quick, detailed and helpful answer. Sadly, I can answer you only on Monday.

Samat

Dear Erik,

  1. I have found these links here: https://stats.wikimedia.org/index.html#fragment-14 (section Raw Data and Scripts), but they are on the https://meta.wikimedia.org/wiki/Wikistats page, too (Source code and External links sections).
  1. I am really sad to read that Wikistats is approaching end of its life. This page is the most important statistics page without real alternative. Not only because many of these statistical values are not available elsewhere, but because it is generated for almost every Wikimedia projects and language edition with the option of direct comparison and ranks. If I can help somehow, I will happily do. I have a strong computer (running 24/7) and some TB free space, enough for at least the projects on Hungarian language.
  1. You mean, that the whole table is generated and updated always from the latest dump? (I thought you generate the row of Jan 2007 from the dump saved that time and only the row of Jan 2016 generated from the huwiki-20160203-stub-meta-history.xml.gz dump.) This method surely causes several side effects and inaccuracy in the stats, because of for example changes of user rights or deleted pages.
  1. According to your description the L (number of monthly edits) and M (database size) columns include redirects, other columns (including the average size of the articles) exclude them. I still feel that the value of the average size of the articles is not correct, few deletion or similar changes since then doesn't explain the difference between 2580 bytes and 3391 bytes. Could you please take a look on it?
  1. You are right. The mentioned 16 or 17 million edits includes every namespaces. Sorry. Anyway, could I help in the translation still? If yes, how/where?

Cheers,
Samat

Hi @Samat,

2 I updated links at both places, thanks for noticing.

3 After some 13 years of Wikistats I feel my role in maintaining those scripts is coming to an end. And while Wikistats has some serious strengths (I'm glad you mention its equal treatment of projects and languages), it also has serious weaknesses. Major weakness is its lack of maintainability. The code is complex, hardly documented, monolithic, and so on. I feel excused to some degree because I acted as a one person general statistics department for long time. So getting things done quickly was important, at the expense of maintainability. But 13 years of patch upon patch took their toll. And explosive growth of dumps didn't help either.

Enough for now. I'll follow up on this in another context. Of course the scripts will still be there. I believe @Nemo_bis uses them to process some non Wikimedia dumps.

5 Yeah, surprise isn't it? My usual mantra, in most concise form : this allowed to add ever more metrics and still have those all collected from day one. It also allowed to have all data benefit from bug fixes and policy adjustments (e.g. more content namespaces). The one caveat is that dumps aren't static. They used to be for many years, with an occasional delete to protect someone's privacy. But of course these days deletes happen all the time. So Wikistats, as it came to be unintentionally, is not about how many edits happened in some year, but instead about how many edits happened to content which still has survived, it still deemed worthy. There is slow and steady self-cleaning of our wikis and Wikistats follows suit, only focuses on what remains. Like I said unintentional but IMO a valid way of looking at stats.

6 Not answered now, I will have to take a look.

7 Let's defer this to a later moment. There are files with language specific literals, once file per language. But new scripts would certainly do this differently, using TranslateWiki. A far more scalalable and robust approach.

Cheers, Erik

Dear Erik,

13 years is a very long time, and I fully understand your answer. I would be happy if we didn't loose your efforts writing the scripts and your scripts would be maintained and developed by others in the future (see T107175 and other requests). While we would like to see at least the statistics (full table) for huwiki, until we have a better solution I would like to use your scripts to generate the table (without the ranking part) for the Hungarian projects. May I expect some support from your side (or from somebody else) at the beginning, how could this work? (I would like to run your scripts under Windows 7 using the (full) Wikimedia data dump. Note again: I am not a programmer, only an enthusiastic but stupid program user). A modified version of this description would be really great (not only for me): https://meta.wikimedia.org/wiki/Wikistats#Running_Wikistats_on_your_own_MediaWiki_site

Is it right, that you use only the stub-meta-history dump (not the full dump) now, therefore your statistics don't fit to the https://www.mediawiki.org/wiki/Analytics/Metric_definitions? For example article number is the 'real' number of articles, as the script can not check if they contain an internal link or category link, or are they redirects or not.

For number 6: I would very appreciate if you could check this. I checked now using a dump from 2012 (huwiki-20120604-page.sql, the latest still available dump which I can compare with your table), and I got similar results then before:

  • size of all articles is 1.18 GB without and 1.19 GB with redirects,
  • number of articles is 219 k without and 351 k with redirects,
  • average size of articles is 5820 bytes without and 3646 bytes with redirects.

The numbers in the table (row of May 2012) are 1.2 GB (vs. 1.19), 217 k (vs. 219 k) and 4187 bytes (vs. 5820 bytes). My problem is the difference between the latest numbers (average size of the articles).

Thank you for your help and answers.

Samat

It seems to me that most questions here had an answer. :)

Most of... but not all.
I am really curious about the the answer for question 6 and I am still and patiently waiting for @ezachte on this.

@Nemo_bis I disagree with closing the ticket now.

The two metrics are incompatible from 2007 onwards

column M ('total database size') , is the total of raw bytes for all articles
column I, ('mean bytes'), divides total printable symbols, so after removing html, headers (as these are mostly repeated content), links except the printed part, and some other wikification symbols, where multibyte chars count as 1

in WikiCountsInput.pm, function CollectArticleCounts the former is $size, the latter $size2

The legend only says: "I = Mean size of article in bytes", so that is confusing at least, if not outright wrong.
"Mean size of rendered article in symbols" is also not right as by then templates have been resolved.
It's more 'Mean size of readable content in article as stored in the database, in symbols"

For Wikistats replacement this seems too complex to calculate (these are most expensive regexps in Wikistats), too fuzzily defined, and hard to grasp. Same for word count which is also based on sanitized content. I guess I wanted to get close to printable text to have a better comparison with printed encyclopedias.

Dear Erik,

Thank you for your clarification. This sounds a reasonable approach.

Interesting to see, that according to this, roughly one third of the articles are non-printed characters.
Specially interesting, that the templates have been resolved for this measurement. However, I think, Wikistats can do nothing with empty templates, where data comes from Wikidata (https://hu.wikipedia.org/w/index.php?title=Donzdorf&action=edit). Or?

May I expect some support for the first time when I try to generate your stats on Wikimedia wikis? See my previous question (on 1st of March).
Or maybe from @Nemo_bis? :)