"Edit" equivalent of pageviews daily available to use in Turnilo and Superset
Open, HighPublic

Description

The ability for non-technical users to manually pivot and visualize pageview data in tools like turnilo and superset has been invaluable. While the edit data is clearly richer and more complicated, it would be great if we could start with a similar table for edits, using raw edits in place of pageviews. Wikistats 2.0 is a great tool and lets you do some of this, but doesn't have the flexibility that we often require. Obviously the desired result is to have

Specifically, the desired dimensions might be:

  • User/Bot/Group bot/anon
  • VE/Wikitext/Other
  • Project
  • Platform: desktop web, mobile web, iOS, Android
  • Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)
  • Continent
  • Edit namespace

Advanced:

  • Revert status
  • made by editor with edit bucket (1-5, 6-10..)
  • made by editor within 1 day of first edit, 2-14 days, 15-30 days, 90 days, 1 year

I was recently told by @Nuria that this was all possible (and more) with the mediawiki history table, which was removed from turnilo. If that is the case, and the challenge was social, perhaps we can bring that table back and deal with the social issue later. @Neil_P._Quinn_WMF pinging you as you were mentioned as someone who could verify that the above was possible with that table.

JKatzWMF created this task.Dec 5 2018, 3:19 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 5 2018, 3:19 AM
Nuria added a comment.EditedDec 5 2018, 6:26 PM

Importing the mediawiki_history into turnilo I think should be possible, leaving up to @Neil_P._Quinn_WMF to decide whether this is the best format to answer this questions as some of the dimensions requested as just not present in mediawiki datasets.

VE/Wikitext/Other

FYI that Mediawiki history data does not include tags at this time, this means you do not know whether the edit was done by api, wikitext or VE. The dataset will include tags once we figure out how to import those performantly so while this field might not exist in initial imports it will in later ones. See: https://phabricator.wikimedia.org/T161149

Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)

Aggregated data of editors per country per wiki is already available on both superset and druid. This is a different dataset than the mediawiki_history (which is not agreggated at all) and it is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geoeditors. Superset dashboard: https://bit.ly/2zHyxMo

Edits per country data exists as of January 2018, older data than January 2018 is also in druid but its level of quality is questionable. The older data is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Archive/Geowiki

fdans added a subscriber: fdans.Dec 10 2018, 4:52 PM

We'll be working on this on Q3 2019. This is easier to achieve if you don't need article title.

fdans triaged this task as High priority.

I think a straight import of mediawiki_history isn't the way to go here, because it's designed with a ton of dimensions so analysts can have as much flexibility as possible. I love that, but it's not a great fit for this use case, because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).

@Nuria, would that be a feasible strategy? If so, I or another analyst can draft a schema for discussion.

Tbayer added a subscriber: Tbayer.Mon, Dec 31, 9:00 PM
Nuria added a comment.EditedThu, Jan 3, 3:02 PM

because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

Agreed, I made this point to @JKatzWMF earlier.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).
would that be a feasible strategy?

Ya, I think this is the way to go, we rather avoid costly transformations of data so if we can do the data transforms using druid transforms it would be best. See:

In your example the transform would be as follows (not sure if this would work, pseudo-code)

"transformSpec": {

  "transforms": [
    {
      "type": "expression",
      "name": "is_adminitrator",
      "expression": "strpos( cast(event_user_groups_historical, String), sysop) != -1" }
  ]
},

So the next thing to do would be to define the column transformations that would make data useful and map those to druid functions that can be applied to each column upon ingestion. Can you work on defining the simplified version @Neil_P._Quinn_WMF ?

Milimetric added a subscriber: Milimetric.

Assigning this to @Neil_P._Quinn_WMF to provide us the definition of the simplified version.

@Milimetric How quickly will you be able to set the cube up in Turnilo once we provide the transform spec? We're trying to prioritize this among our other work 😁

Nuria added a subscriber: mforns.Fri, Jan 18, 3:54 PM

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Fri, Jan 18, 6:45 PM

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

Thanks, good to know!

Unassigning myself so we do team triage on it and figure out who should do it (could be me).

Nuria added a comment.Fri, Jan 18, 7:33 PM

The code to ingest this data already exists but it does not work well due to number of dimensions and how hard it is to understand the dimensions and measures (at least for me) in the fully denomalized dataset, see:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/druid/load_mediawiki_history.json.template