Tons of documentation on this task
|Resolved||None||T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics|
|Resolved||None||T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days|
|Open||None||T130256 Wikistats 2.0.|
|Resolved||Halfak||T133957 Socialize metrics standardization work with analytics metrics work|
|Resolved||JAllemandou||T131783 Examine wikistats reports, make a summary of the most granular data needed that would serve all reports|
Notes on this so far: https://etherpad.wikimedia.org/p/wikistats-edits
We are leaning to 1st load data from mysql , which might mean we need to load data using mediawiki.
Open questions (that will shape our tasks going forward)
0. What data are we loading on 1st instance?
Are we loading only data to do the verticals?https://phabricator.wikimedia.org/T131779
Are we loading more data so we can optimize our data extraction?
- Do we need mw client to load data from db? If so estimate work.
- Estimate work of loading data we need from dump/hdfss, using as basis joseph work in altiscale.
After reviewing charts and talking with @ezachte, here is a schema-like view of data that would be needed to replicate most of the useful charts.
All of the information discussed here is easily accessible from dumps except for archived and historical changes (page title change or contrib rights changefor instance), and categories, for which template building would be needed to be correct.
- creation_date - [Date earliest edit]
- Regions associated (see other fact, this data is currently hard-coded in perl, with some html scrapping to retrieve - numbers)
- project-class [wikipedia, wikitravel, wiktionary ..., otherwikis]
- is_countable (mostly namespace 0, but for some wikis, community decided that some other namespaces should counted in the -articles definition - This list can be retrieved using an API call)
- creation_date [date of 1st revision]
- creator [contrib of 1st revision]
- restrictions [Present in dumps, NOT USED in wikistat]
- engagement_date [date 1st edit]
- rights [Access: A=Admin, B=Bureaucrat, C=Checkuser, D=Developer, O=Oversight, X=bot]
- revert_info [can either be reverted_revision in case of sha1 equality, or if revert is present in comment (Not sure this - last one should be continued becasue of cross-language issues) ]
- count of links (internal, otherwikis(really needed now that wikidata handles cross-wiki links?), binaries, external) [ - This one is tricky to get using dumps: implies parsing]
- minor? [present in dumps, NOT USED in wikistats]
- model?? [present in dumps, NOT USED in wikistats]
- format?? [present in dumps, NOT USED in wikistats]
- parent_id [present in dumps, NOT USED in wikistats]
- ------------To Be Discussed----------------
- bytes -- Easy to get, good proxy for article size without having the issues of counting words or chars]
- chars without wikitext -- Better mesure of article size, but tricky in case of language where characters and words - overlap (japanese for instance)
- words without wikitext - Idem, with the even more difficult job to split sentences.
About categories: Erik and I agreed that categories, while being very interesting, could be a project in itself.
-categories in edits (count and possibly list) [categories inserted by template is an issue]
- Other facts
- estimated population
- Languages associated + Estimated number of spearkers
Content (=countable) namespaces are collected daily via API call  using perl file  and bash file , and written to csv files  
Namespaces with string 'content' do qualify.
The perl script adds a few that are not in the API but historically deemed countable. In particular ns 6 for Commons which signal binary uploads, the most important events on Commons.
In the past an issue was that new namespaces were invented, articles were moved from ns0 to new ns, and only much later after a community vote that new namespace was officially deemed content namespace, and even later some config file was updated and the API informed us. Until that happened article count could be seen as falling in Wikistats.
A good example (one of few) where rebuilding all data every month proved helpful.
Bytes, chars, words
As for counting bytes vs chars vs words, here are some considerations.
Surely counting bytes is easy and cheap (and that explains its popularity) and Wikistats reports these in e.g. .
Character count is more costly to collect (in fact much more costly, with Wikistats' very strict regexp, a good-enough lighter version of that reg-exp might be advisable).
English is an exception where most texts only contain 1 byte characters. So for English speakers byte count is a good proxy. Less so for French, Swedish, German, and much much less for ideographic languages .
English telephone and French téléphone are words of equal size, in characters, not in bytes. Arguably comparisons of text volume between languages are more fair in chars than in bytes.
Word count is really tricky. And may be to ambitious. But this seems the default unit of comparison to compare text volumes, in particular encyclopedias. 
Wikistats went as far as guesstimating conversion ratios, for text size 'normalization' by comparing official translations of US Constitution in English, Japanese, Chinese and some other ideographic languages, and use this (language dependent) ratio to calculate a 'normalized' word count. Yet only a few ideographic languages underwent this treatment. And its validity can be questioned.
(the kind of html table than few people will use, but it can be helpful to detect anomalies, e.g. when average size of article drops dramatically)
Regional codes and number of speakers per language.
I said indeed this was html scraped from Wikipedia, but I confused traffic reports where some demographic data are indeed html scraped from Wikipedia (population count, # internet connections). Those are stored as csv, and used e.g. in 
For dump based reports region codes and number of speakers is taken manually from language articles in English Wikipedia. (#speakers including secondary speakers where an estimation is available, hence bypassing the page where all languages are listed with native speakers only) And these are indeed stored in perl file.  I will export this to csv when the need arises.
Wikistats groups editors, edits and creates by user type, registered (and logged in) user, anonymous user or bot.
Often the xml dump tells which edits have been done by an unregistered or logged out user, by adding an <ip> tag.
But in early years this wasn't always the case, and ip addresses weren't always a series of numeric triplets (but instead e.g. [username]@comcast.net . Hence Wikistats vets user name and if it contains two or more dots, treats it as anon, a few false positives taken for granted (and a handful of names excluded explicitly when users wrote me about being registered names).
Note that many users like puns or letter games and make up a nick with 'bot' in it just because they can.
Recap on how Wikistats detects bots:
- Is a name registered as bot, in other words is there a bot flag in user group table?
- Does it sound like a bot? (nowadays only allowed for bot, on many wikis)
- Is it known to be an unregistered bot ? (Wikipedia has a list of false negatives at http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/Unflagged_bots ) I copied that list long ago but do not keep it auto-updated.
- Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project. (in the past when user names could easily collide this was more relevant). Basic rationale is that on smaller wikis bot registrations are often forgotten. With SUL it is unlikely that people use same name as bot on one wiki and as regular user on another wiki.
- Three names that sound like bot are hard coded exceptions (people who wrote me to tell me they are human): Paucabot|Niabot|Marbot
Wikistats is certainly more restrictive in 'does it sound like a bot' than what I saw elsewhere.
Perl: if (($user =~ /bot\b/i) || ($user =~ /_bot_/i))
Meaning only names sound like a bot for Wikistats where
- 'bot' is end of string or is followed by non alpha-numerical char
- or is preceded and followed by underscores (in Mediawiki often place holder for spaces)
(it would be interesting (but too much work right now) to break this down by language. I guess some languages are more prone to have 'bot' in real names than others.)
From a 2014 mail: 7453 / 21589 names with 'bot' in it (35%) are *perhaps* not a bot.
BTW this is an example where complete rebuild of stats is tricky, as bot flags can disappear.
Quote: "revert_info [can either be reverted_revision in case of sha1 equality, or if revert is present in comment (Not sure this - last one should be continued because of cross-language issues) ]"
Does cross-language issues mean: the 'REV' acronym in edit comments could be spelled differently in other languages? If so, yes that's an issue, but leaving these out would under-report reverts on English Wp  by 13%, on German Wp  22%, on Dutch Wp  also 20%. I'd rather see the list of likely acronyms per language extended (community curated input file?)
Count of links
Counts of links are seldom mentioned anywhere. This is also susceptible to skewing as many internal links occur in templates (which Wikistats doesn't parse).
If anything I would favor external links only, but there might be a better way to collect these than via full archive dump. (namely dump [somewiki]-[yyyymmdd]-langlinks.sql.gz)