Let's migrate current geowiki data to analytics stack druid so we can have a (thus far) private dashboard on top of it, any path to make this data public needs to go through migration of infrstructure first.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Milimetric | T131280 Make aggregate data on editors per country per wiki publicly available | |||
Resolved | Milimetric | T176996 Private geo wiki data in new analytics stack | |||
Resolved | Milimetric | T182944 Read the python code and design the Hadoop version | |||
Resolved | Milimetric | T184759 Sqoop cu_changes table for geowiki | |||
Resolved | Milimetric | T188113 Oozie job to compute geowiki on top of sqooped data | |||
Resolved | Milimetric | T191343 Vet new geo wiki data | |||
Resolved | • fdans | T190856 Archive old geowiki data (editors per country) and make it easily available at WMF | |||
Resolved | • fdans | T190059 Turn off old geowiki jobs | |||
Resolved | Milimetric | T193429 Rename new geowiki to geoeditors |
Event Timeline
Moving notes here:
We're still setting up logins for that but we can look at it together in the meeting and see if it meets your needs
The UI is not important, the numbers are important
- Looking over the logic, it seems that Geowiki computes statistics on all activity types, including some system administration activities, therefore its numbers are inflated, by something like 20%-30%. We figure you probably only want to look at actual revisions and page creations, but let us know otherwise.
Asaf not aware of the difference, not interested in actions beyond revisions and page creations.
Corrected measurement would be better, putting it in line with other stats we're gathering, say by language. If things are add up (numbers are comparable when thy should be, say for a country
like turkey in which edits per language and country should be similar, fake example)
- Geowiki normalizes incomplete months so that you can compare data even if the current month is not over.
We should talk about this and whether it's still useful
Ideally, the new system would still project out numbers for the full month on incomplete months
- Geowiki does not include anonymous editing, should we include them? Categorize them?
Include them in the next sytem, categorize them
- Geowiki excludes bots, should we categorize them?
exclude from any default totals, but categorize them so we have the data (maybe exclude them completely because they're mostly just running on tool labs)
- Geowiki only includes wikipedia right now? Yes, it does
include geowiki stats for all projects
Data will be updated monthly (?) (maybe data needs to be updated more frequently?)
- join the data with user and user_groups for the bot/non bot distinction
- build end table of data in hive with counts (per month or per day)
Does cu_changes_table have data for more than 1 month? Yes, it seems to have 3 months.
per wiki, per month (per day maybe), per country, 5 or more edits, 100 more edits
table:
date, project, country , editors with 5 or more [1-4] , editors with 5-99 [5-99], editors with 100 or more
How abour reportupdater + dashiki?
To clarify with @Ijon do use counts of editors with 5 edits and a 100 edits or more detailed counts like "number of editors with 1,2,3 , 5, 100 edits"