Page MenuHomePhabricator

Private geo wiki data in new analytics stack
Closed, ResolvedPublic0 Estimated Story Points

Description

Let's migrate current geowiki data to analytics stack druid so we can have a (thus far) private dashboard on top of it, any path to make this data public needs to go through migration of infrstructure first.

Event Timeline

Nuria created this task.Sep 28 2017, 5:51 PM
Nuria updated the task description. (Show Details)Sep 28 2017, 5:53 PM
Ijon awarded a token.Sep 29 2017, 4:19 PM
fdans raised the priority of this task from Medium to High.Oct 2 2017, 4:09 PM
fdans moved this task from Incoming to Wikistats on the Analytics board.
fdans removed subscribers: Aklapper, csteipp, Milimetric and 11 others.
Nuria assigned this task to Milimetric.Dec 19 2017, 10:38 PM
Nuria moved this task from Wikistats to Operational Excellence Future on the Analytics board.
Nuria added a comment.Jan 10 2018, 6:42 PM

Moving notes here:

We're still setting up logins for that but we can look at it together in the meeting and see if it meets your needs
The UI is not important, the numbers are important

  • Looking over the logic, it seems that Geowiki computes statistics on all activity types, including some system administration activities, therefore its numbers are inflated, by something like 20%-30%. We figure you probably only want to look at actual revisions and page creations, but let us know otherwise.

Asaf not aware of the difference, not interested in actions beyond revisions and page creations.
Corrected measurement would be better, putting it in line with other stats we're gathering, say by language. If things are add up (numbers are comparable when thy should be, say for a country
like turkey in which edits per language and country should be similar, fake example)

  • Geowiki normalizes incomplete months so that you can compare data even if the current month is not over.

We should talk about this and whether it's still useful
Ideally, the new system would still project out numbers for the full month on incomplete months

  • Geowiki does not include anonymous editing, should we include them? Categorize them?

Include them in the next sytem, categorize them

  • Geowiki excludes bots, should we categorize them?

exclude from any default totals, but categorize them so we have the data (maybe exclude them completely because they're mostly just running on tool labs)

  • Geowiki only includes wikipedia right now? Yes, it does

include geowiki stats for all projects

Nuria added a comment.Jan 11 2018, 6:43 PM

Data will be updated monthly (?) (maybe data needs to be updated more frequently?)

  • join the data with user and user_groups for the bot/non bot distinction
  • build end table of data in hive with counts (per month or per day)

Does cu_changes_table have data for more than 1 month? Yes, it seems to have 3 months.

per wiki, per month (per day maybe), per country, 5 or more edits, 100 more edits

table:
date, project, country , editors with 5 or more [1-4] , editors with 5-99 [5-99], editors with 100 or more

How abour reportupdater + dashiki?

To clarify with @Ijon do use counts of editors with 5 edits and a 100 edits or more detailed counts like "number of editors with 1,2,3 , 5, 100 edits"

Nuria added a comment.Jan 11 2018, 6:45 PM

We can get started in scooping the cu_changes_table

fdans edited projects, added Analytics-Kanban; removed Analytics.Feb 26 2018, 5:19 PM
fdans set the point value for this task to 0.
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.
Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Operational Excellence Future to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria closed this task as Resolved.Jul 16 2018, 6:56 PM