Page MenuHomePhabricator

"Edit" equivalent of pageviews daily available to use in Turnilo and Superset
Open, HighPublic

Description

The ability for non-technical users to manually pivot and visualize pageview data in tools like turnilo and superset has been invaluable. While the edit data is clearly richer and more complicated, it would be great if we could start with a similar table for edits, using raw edits in place of pageviews. Wikistats 2.0 is a great tool and lets you do some of this, but doesn't have the flexibility that we often require.

Specifically, the desired dimensions might be:

  • User/Bot/Group bot/anon
  • VE/Wikitext/Other
  • Project
  • Platform: desktop web, mobile web, iOS, Android
  • Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)
  • Continent
  • Edit namespace

Advanced:

  • Revert status
  • made by editor with edit bucket (1-5, 6-10..)
  • made by editor within 1 day of first edit, 2-14 days, 15-30 days, 90 days, 1 year

Key stakeholders

Draft schema

Simplified mediawiki history data for druid - Working Notes

Event Timeline

JKatzWMF created this task.Dec 5 2018, 3:19 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 5 2018, 3:19 AM
Nuria added a comment.EditedDec 5 2018, 6:26 PM

Importing the mediawiki_history into turnilo I think should be possible, leaving up to @Neil_P._Quinn_WMF to decide whether this is the best format to answer this questions as some of the dimensions requested as just not present in mediawiki datasets.

VE/Wikitext/Other

FYI that Mediawiki history data does not include tags at this time, this means you do not know whether the edit was done by api, wikitext or VE. The dataset will include tags once we figure out how to import those performantly so while this field might not exist in initial imports it will in later ones. See: https://phabricator.wikimedia.org/T161149

Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)

Aggregated data of editors per country per wiki is already available on both superset and druid. This is a different dataset than the mediawiki_history (which is not agreggated at all) and it is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geoeditors. Superset dashboard: https://bit.ly/2zHyxMo

Edits per country data exists as of January 2018, older data than January 2018 is also in druid but its level of quality is questionable. The older data is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Archive/Geowiki

fdans added a subscriber: fdans.Dec 10 2018, 4:52 PM

We'll be working on this on Q3 2019. This is easier to achieve if you don't need article title.

fdans triaged this task as High priority.

I think a straight import of mediawiki_history isn't the way to go here, because it's designed with a ton of dimensions so analysts can have as much flexibility as possible. I love that, but it's not a great fit for this use case, because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).

@Nuria, would that be a feasible strategy? If so, I or another analyst can draft a schema for discussion.

Tbayer added a subscriber: Tbayer.Dec 31 2018, 9:00 PM
Nuria added a comment.EditedJan 3 2019, 3:02 PM

because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

Agreed, I made this point to @JKatzWMF earlier.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).
would that be a feasible strategy?

Ya, I think this is the way to go, we rather avoid costly transformations of data so if we can do the data transforms using druid transforms it would be best. See:

In your example the transform would be as follows (not sure if this would work, pseudo-code)

"transformSpec": {

  "transforms": [
    {
      "type": "expression",
      "name": "is_adminitrator",
      "expression": "strpos( cast(event_user_groups_historical, String), sysop) != -1" }
  ]
},

So the next thing to do would be to define the column transformations that would make data useful and map those to druid functions that can be applied to each column upon ingestion. Can you work on defining the simplified version @Neil_P._Quinn_WMF ?

Milimetric added a subscriber: Milimetric.

Assigning this to @Neil_P._Quinn_WMF to provide us the definition of the simplified version.

@Milimetric How quickly will you be able to set the cube up in Turnilo once we provide the transform spec? We're trying to prioritize this among our other work 😁

Nuria added a subscriber: mforns.Jan 18 2019, 3:54 PM

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Jan 18 2019, 6:45 PM

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

Thanks, good to know!

Unassigning myself so we do team triage on it and figure out who should do it (could be me).

Nuria added a comment.Jan 18 2019, 7:33 PM

The code to ingest this data already exists but it does not work well due to number of dimensions and how hard it is to understand the dimensions and measures (at least for me) in the fully denomalized dataset, see:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/druid/load_mediawiki_history.json.template

kzimmerman updated the task description. (Show Details)Feb 14 2019, 11:24 PM
kzimmerman edited subscribers, added: MNeisler; removed: mforns, Milimetric, kzimmerman.

@MNeisler will take on the task of creating the schema with guidance from @Neil_P._Quinn_WMF.

kzimmerman moved this task from Triage to Next Up on the Product-Analytics board.Feb 14 2019, 11:28 PM

@MNeisler Nuria mentioned that @mforns will be testing ways to load datasets related to this ask (as I understand it, he's wrapping up some other work first). Can the two of you sync up and coordinate work on this?

@kzimmerman @MNeisler
Sure, we can discuss here, or have a meeting, what's better for you. I also just talked to @Neil_P._Quinn_WMF about whether we should extract the data from mediawiki_history to an intermediate Hive table, and then load from that one. Or just use Druid transforms to ingest directly from mediawiki_history. I lean towards the second option, because it doesn't need the extra step (table which will have to be maintained). But let's discuss!

@mforns Thanks! Yes, happy to discuss and coordinate on this. I reviewed this task with @Neil_P._Quinn_WMF today. I'm going to first work on defining our desired dimensions and transformations based on the type of queries we'd want to run and how the data will be used, which might help inform the best method for loading the dataset. I’ll reach out to discuss once we have a better idea of the needed transforms if that works for you.

@MNeisler Cool :]
Here's the Druid transforms expression list, so that you know the possibilities and the limitations: http://druid.io/docs/latest/misc/math-expr.html
Let me know if I can help!

MNeisler moved this task from Next Up to Doing on the Product-Analytics board.Sun, Feb 24, 8:39 PM

Hi @MNeisler! We'd like to have this done by the end of this quarter. Is there anything we can do, I can help you build a job that loads that data. Maybe we can have a meeting and you can pass me the requirements of the data set.

Hi @mforns!

Thanks for the update re the timeline. A meeting would be great - I’ll set up a time for us and @Neil_P._Quinn_WMF to meet this week if possible. I’ve worked with Neil to identify the simplified list of mediawiki_history dimensions and mapped those to druid expressions. I'll share with you soon and we can discuss at the meeting.

@MNeisler
Great, I already accepted the invite. Thank you!