Page MenuHomePhabricator

"Edit" equivalent of pageviews daily available to use in Turnilo and Superset
Closed, ResolvedPublic

Description

The ability for non-technical users to manually pivot and visualize pageview data in tools like turnilo and superset has been invaluable. While the edit data is clearly richer and more complicated, it would be great if we could start with a similar table for edits, using raw edits in place of pageviews. Wikistats 2.0 is a great tool and lets you do some of this, but doesn't have the flexibility that we often require.

Specifically, the desired dimensions might be:

  • User/Bot/Group bot/anon
  • VE/Wikitext/Other
  • Project
  • Platform: desktop web, mobile web, iOS, Android
  • Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)
  • Continent
  • Edit namespace

Advanced:

  • Revert status
  • made by editor with edit bucket (1-5, 6-10..)
  • made by editor within 1 day of first edit, 2-14 days, 15-30 days, 90 days, 1 year

Key stakeholders

Draft schema

Simplified mediawiki history data for druid - Working Notes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Importing the mediawiki_history into turnilo I think should be possible, leaving up to @Neil_P._Quinn_WMF to decide whether this is the best format to answer this questions as some of the dimensions requested as just not present in mediawiki datasets.

VE/Wikitext/Other

FYI that Mediawiki history data does not include tags at this time, this means you do not know whether the edit was done by api, wikitext or VE. The dataset will include tags once we figure out how to import those performantly so while this field might not exist in initial imports it will in later ones. See: https://phabricator.wikimedia.org/T161149

Country (I know the individual data gets thrown out after 90, but the aggregates could be kept?)

Aggregated data of editors per country per wiki is already available on both superset and druid. This is a different dataset than the mediawiki_history (which is not agreggated at all) and it is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geoeditors. Superset dashboard: https://bit.ly/2zHyxMo

Edits per country data exists as of January 2018, older data than January 2018 is also in druid but its level of quality is questionable. The older data is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Archive/Geowiki

We'll be working on this on Q3 2019. This is easier to achieve if you don't need article title.

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

I think a straight import of mediawiki_history isn't the way to go here, because it's designed with a ton of dimensions so analysts can have as much flexibility as possible. I love that, but it's not a great fit for this use case, because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).

@Nuria, would that be a feasible strategy? If so, I or another analyst can draft a schema for discussion.

because the complexity makes the data harder to use and because a lot of those dimensions wouldn't work well with Druid anyway.

Agreed, I made this point to @JKatzWMF earlier.

So my ideal would be a simplified version of mediawiki_history where we eliminate some dimensions, bucket others, and apply some simple transforms (e.g. applying the is_administrator flag if event_user_groups_historical includes sysop).
would that be a feasible strategy?

Ya, I think this is the way to go, we rather avoid costly transformations of data so if we can do the data transforms using druid transforms it would be best. See:

In your example the transform would be as follows (not sure if this would work, pseudo-code)

"transformSpec": {

  "transforms": [
    {
      "type": "expression",
      "name": "is_adminitrator",
      "expression": "strpos( cast(event_user_groups_historical, String), sysop) != -1" }
  ]
},

So the next thing to do would be to define the column transformations that would make data useful and map those to druid functions that can be applied to each column upon ingestion. Can you work on defining the simplified version @Neil_P._Quinn_WMF ?

Milimetric subscribed.

Assigning this to @Neil_P._Quinn_WMF to provide us the definition of the simplified version.

@Milimetric How quickly will you be able to set the cube up in Turnilo once we provide the transform spec? We're trying to prioritize this among our other work 😁

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

That is quite easy, it just will take a few hours to load data with different transformations and see how the data looks in turnilo, probably @mforns will be doing this work

Thanks, good to know!

Unassigning myself so we do team triage on it and figure out who should do it (could be me).

The code to ingest this data already exists but it does not work well due to number of dimensions and how hard it is to understand the dimensions and measures (at least for me) in the fully denomalized dataset, see:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/druid/load_mediawiki_history.json.template

kzimmerman edited subscribers, added: MNeisler; removed: mforns, Milimetric, kzimmerman.

@MNeisler will take on the task of creating the schema with guidance from @Neil_P._Quinn_WMF.

@MNeisler Nuria mentioned that @mforns will be testing ways to load datasets related to this ask (as I understand it, he's wrapping up some other work first). Can the two of you sync up and coordinate work on this?

@kzimmerman @MNeisler
Sure, we can discuss here, or have a meeting, what's better for you. I also just talked to @Neil_P._Quinn_WMF about whether we should extract the data from mediawiki_history to an intermediate Hive table, and then load from that one. Or just use Druid transforms to ingest directly from mediawiki_history. I lean towards the second option, because it doesn't need the extra step (table which will have to be maintained). But let's discuss!

@mforns Thanks! Yes, happy to discuss and coordinate on this. I reviewed this task with @Neil_P._Quinn_WMF today. I'm going to first work on defining our desired dimensions and transformations based on the type of queries we'd want to run and how the data will be used, which might help inform the best method for loading the dataset. I’ll reach out to discuss once we have a better idea of the needed transforms if that works for you.

@MNeisler Cool :]
Here's the Druid transforms expression list, so that you know the possibilities and the limitations: http://druid.io/docs/latest/misc/math-expr.html
Let me know if I can help!

Hi @MNeisler! We'd like to have this done by the end of this quarter. Is there anything we can do, I can help you build a job that loads that data. Maybe we can have a meeting and you can pass me the requirements of the data set.

Hi @mforns!

Thanks for the update re the timeline. A meeting would be great - I’ll set up a time for us and @Neil_P._Quinn_WMF to meet this week if possible. I’ve worked with Neil to identify the simplified list of mediawiki_history dimensions and mapped those to druid expressions. I'll share with you soon and we can discuss at the meeting.

@MNeisler
Great, I already accepted the invite. Thank you!

Here's the updated edit table schema with suggested transforms to ingest directly from mediawiki_history into Druid.

@mforns Can you use this to build a job that loads the data for testing? Let me know if you need any additional info you need or would like to coordinate on any needed changes.

Note there are a few dimensions we'd like to add later pending changes to mediawiki_history table (noted under the dimensions still pending section in the doc) . I'll create a separate task for that.

Thanks a lot @MNeisler!
@fdans took this task and is already working on this, will reassign the task to him
Fran, the awesome doc Megan and Neil put together has everything you need I believe.

@Nuria @MNeisler Oh yea, I got totally confused. Sorry.
Will assign this task to myself and start working on it this week.
Sorry @fdans for the mess.

Change 499917 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Modify mediawiki/history/druid job to ingest a simpler data set to druid

https://gerrit.wikimedia.org/r/499917

Change 499917 abandoned by Mforns:
Modify mediawiki/history/druid job to ingest a simpler data set to druid

Reason:
This is taken care in another patch after refactor.

https://gerrit.wikimedia.org/r/499917

Change 501607 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add oozie job to load edit_hourly to druid

https://gerrit.wikimedia.org/r/501607

Hey all!

The first test of edits_hourly is in Druid: https://turnilo.wikimedia.org/#edits_hourly
And in Hive under wmf.edit_hourly (the extra 's' in the druid datasource name is to be consistent with druid pageviews_hourly vs. hive pageview_hourly, but we can change that).

Regarding performance, I think it runs quick enough. A line-chart visualization for all history (2001-2019) with 1 filter and 1 split takes around 5 seconds.
When we add more dimensions, we might see decrease in performance, though. Not sure if we should consider changing to daily base resolution.

IMPORTANT NOTE: There's one extra metric that Turnilo adds by default called 'count'. This metric does not make any sense, it's broken given the way we imported the data and shouldn't be there. We can remove it from the Turnilo config.yaml file once this data set is in production. However, be careful and change it to edit_counts or text_bytes_diff when you test, because it is the selected metric by default (thanks Turnilo :).

The hourly granularity on display looks a bit strange (see screenshot), could it be we are not loading data every other hour (or that hour intervals are not continuous somehow so we have 1-2 and 3-4 but not 2-3?)

Screen Shot 2019-04-05 at 11.32.36 AM.png (1×2 px, 1 MB)

Wow, that's weird.

What I see is that all pm hours are not there... I can see data for am but not pm.
OK, will look into this.

It's a problem with the generation of edit_hourly in Hive.
The timestamp is not well formatted, I was using:

FROM_UNIXTIME(
    UNIX_TIMESTAMP(event_timestamp, 'yyyy-MM-dd hh:mm:ss.sss'),
    'yyyy-MM-dd hh:00:00.0'
) AS dt

But it's converting that to am-only hours, it's an easy-to-fix.
Will fix that on Monday.

Thanks @mforns! And sorry for the delay. I'm reviewing the edits_hourly in turnilo and it looks good to me so far. The hourly resolution doesn't seem to be impacting the performance too much when I test adding various splits and filters so I'd recommend keeping it unless there are any major concerns.

I also noticed that the user_tenure_bucket dimension from the suggested schema was not included in this iteration. Was that due to concerns about inefficiency? Is there a way we can include this dimension (or something similar) the next iteration? Thanks again!

Thanks @MNeisler,

Cool! Glad that you guys liked it.
Yes, I left user_tenure_bucket for next iteration as per Nuria's suggestion in the doc.
User_tenure_bucket was a bit more complex than the other fields, but I checked and I believe it's feasible.

If you're OK then, I will move on with CRs and deploying this version.
And will create another task to add the user_tenure_field and also the edit_tags (Visualeditor, Wikitext, etc.) that will be added to the next snapshot of MediaWiki history.
If you think of any other dimension or metric to include, you could add it to that task.
Would that be OK?

Thanks, @mforns! Looks good to me as well.

If you're OK then, I will move on with CRs and deploying this version.

Sorry, what does CR mean?

And will create another task to add the user_tenure_field and also the edit_tags (Visualeditor, Wikitext, etc.) that will be added to the next snapshot of MediaWiki history.
If you think of any other dimension or metric to include, you could add it to that task.
Would that be OK?

Sounds good to me!

And will create another task to add the user_tenure_field and also the edit_tags (Visualeditor, Wikitext, etc.) that will be added to the next snapshot of MediaWiki history.
If you think of any other dimension or metric to include, you could add it to that task.
Would that be OK?

Sounds good to me!

FYI - I created a task earlier to track additional dimensions we'd like to add for the next iteration: https://phabricator.wikimedia.org/T219323. Feel free to edit as needed.

Thank you all!

My understanding is that wider launch (sharing with stakeholders) of this will necessarily be blocked until some data inconsistencies (particularly with identifying content edits) between Data Lake & MediaWiki replicas are resolved - see T218819, where @Neil_P._Quinn_WMF has tracked his investigation & added some findings. I wanted to see if that makes sense with those of you working on this task.

pinging @Neil_P._Quinn_WMF cause i think some of the concerns listed on ticket (like "many small wikis missing") might have been resolved.

@MNeisler @Neil_P._Quinn_WMF: wanted to check in with your thoughts on timing for release of this dataset/when can we start introducing PMs & stakeholders to it?

I think the main concern at this point are the outstanding data issues; is there anything else we should wrap up before release?

Data issue dependencies:
T218463 - this has been addressed & looks like it will be resolved in the April snapshot
T221338 - looks like this is still a WIP

Change 501607 merged by Mforns:
[analytics/refinery@master] Add oozie job to load edit_hourly to druid

https://gerrit.wikimedia.org/r/501607

FYI, I have filed T224067 for the separate work of creating a Druid cube that allows for counting of editors rather than just edits.

@MNeisler @Neil_P._Quinn_WMF: wanted to check in with your thoughts on timing for release of this dataset/when can we start introducing PMs & stakeholders to it?

I think the main concern at this point are the outstanding data issues; is there anything else we should wrap up before release?

Data issue dependencies:
T218463 - this has been addressed & looks like it will be resolved in the April snapshot
T221338 - looks like this is still a WIP

Belated follow-up: yes, T221338 is still not addressed, which means the content edit counts here are not fully reliable.

@mforns, @MNeisler : I think we have two options:

  1. We hold off on announcing this more broadly until T221338 is resolved. Assuming that will happen shortly (within the next month or so), this seems like the best option.
  2. We remove the namespace dimensions from the cube until the issue is resolved.

I do not think we ned to modify anything quite yet, next import will carry any changes done for namespaces and others.

@Nuria is T221338 live? If not, we should keep this task open until we can report on active editors with better accuracy.

@kzimmerman You're right, I think T221338 is not ready yet.
Now, when the data is fixed, we won't need to do anything here, edit_hourly in Hive and edits_hourly in Druid will update automatically.
I'd say it's safe to close, no?

Mostly l was thinking for accuracy of documentation/communication; given the extent of the impact from T221338 I think it's a blocker for this to be considered available for use.

(I think that's the only blocker; anything else would be extras that could go in another set of tickets.)

Reopening for the reasons described by @kzimmerman. T221338 is the only blocker to closing this.