Usually, I do the editor's statistics for the Hungarian Wikipedia and I had several questions about the stats generated by Erik Zachte over the years [1].
Please be aware, that I am not a developer/programmer, I am mainly read-only concerning program codes. It can happen that some of the reports below come from my misunderstanding or lack of knowledge. Below, I have questions and I listed potential bugs. Please answer the questions here and open separate tickets for the real problems.
- I didn't find description or definition how the number of words counted, only this sentence: "Total number of words (excl. redirects, html/wiki codes and hidden links)". "There is an apple" means 4 words and "They're children" means 2 words? [[Example]] was [[born]] in [[1999]] means 2 or 5 words (is it wiki code)?
- I tried to check the answers in the codes, but the links are broken:
- What is the reason that part of the tables are empty (for example the Hungarian table since March 2014). I understand that these columns are generated from the full archive dump and analysis of it needs longer time and more resources, but it is available. Can we expect that these rows will be filled out in the near future?
- Column K is "Percentage of articles with at least 2 Kb readable text (see F)" based on the description but it links to the page "https://stats.wikimedia.org/EN/TablesArticlesGt1500Bytes.htm" Is the limit 1500 bytes or 2048 bytes?
- I tried to compare my data analysis with Erik's stats, but it is not easy since I don't know which dumps are used for which row.
- If there is a dump from end of January (for example https://dumps.wikimedia.org/archive/2007/2007-02/huwiki/20070130/) it is used for the row of Jan 2007 or for the row of Feb 2007 in the table?
- If there was no dump in a month, which ones were used for the row of this months?
- If there was more than one dump for a month, which ones were used for the row of this months?
- Based on the dump file huwiki-20070130-page.sql I calculated the database size and the average article size. (I know, that the definition of the article [2] is different in this case, and therefore the number of articles, the database size and the average size of the articles differ slightly, but this difference is quite small.) According to the row of Jan 2007 in the table [1] there were 46 k articles (that contain at least one internal link excl. redirects) with the average size of articles (that contain at least one internal link excl. redirects) of 2580 bytes and the with the database size (combined size of all articles incl. redirects) of 158 MB that time in the Hungarian Wikipedia.
- If I counted the number of articles I get 49400 in the main namespace excl. redirects and 70391 in the main namespace incl. redirects. Using the official definition of an article 46 k looks possible (excl. redirects).
- If I add the length of the articles I get 159.8 MB excl. redirects and 160.4 MB incl. redirects. The 158 MB (excl. redirects) in the table looks possible again.
- If I calculate the average size of the articles, the results are 159.8 MB/49400 article=3391 bytes excl. redirects and 160.4 MB/70391=2389 bytes incl. redirects. The number in the table 2580 bytes, which is quite far from the calculated 3391 bytes. If I back-calculate the number of articles excl. redirects from the 158 MB (which number contains the size of the redirects, but this is about 0.6 MB and has small effect here) and from the 2580 bytes (numbers in Zachte's table), I got 64 k. This result looks wrong for me, see the first point. Is it possible, that the script uses the number of articles incl. redirects here?
- https://stats.wikimedia.org/EN/TablesWikipediaHU.htm#anonymous says "All together 747,130 article edits were made by anonymous users, out of a total of 11346,643 article edits (7%)"
- The used commas are quite misleading: they are used in the Hungarian language as the decimal separator (747 130,0 and 11 346 643,0), in the U.S. as a groups delimiter of numbers (747,130.0 and 11,346,643.0). The second number follows none of these rules.
- Based on my analysis of the huwiki-20160111-stub-meta-history.xml dump there were more than 16 million edits until end of year 2015. The actual number of diffs is more than 17.2 millions now, so the 16 million edits at end of the year looks more realistic than than the 11 million edits on the stats.wikimedia.org page.
[1]: https://stats.wikimedia.org/EN/TablesWikipediaHU.htm
[2]: https://www.mediawiki.org/wiki/Analytics/Metric_definitions