Page MenuHomePhabricator

vet edit data on the data lake
Open, NormalPublic

Description

The Data Lake [1] is the place we're putting analytics-friendly data in Hadoop. The first data to land there is from the Mediawiki History Reconstruction project. We have computed metrics that power this dashboard [2] and want to vet that the new data hasn't screwed up the metrics compared to their old counterparts in vital signs. The new numbers are close to the old numbers with some notable exceptions. Our analysis is in this spreadsheet [3]. We know the reasons behind the differences and want to work with you (Research) to make sure they're forgivable enough to power Wikistats 2.0.

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
[2] https://analytics.wikimedia.org/dashboards/standard-metrics/
[3] https://docs.google.com/spreadsheets/d/12nHxfp5cerKwAc1Q7W_DudSJ-ZhmynDK6VINzb857zE/edit#gid=1232097690

Event Timeline

Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 23 2017, 4:53 PM
Milimetric renamed this task from Coordinate with research to vet metrics calculated from edit data lake to Coordinate with research to vet metrics calculated from the data lake .Jan 25 2017, 5:07 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
leila added a subscriber: leila.Jan 26 2017, 2:45 AM
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jan 30 2017, 4:44 PM
Nuria added a comment.Jan 30 2017, 6:35 PM

Moving to "paused" as we wait for feedback from research.

@Nuria let's go over this at our next 1:1 since it goes beyond Erik's involvement (wikistats transition) and we need to scope it, it's unclear at this stage if we'll be able to help in the coming weeks.

I learned from @Neil_P._Quinn_WMF yesterday that the data lake doesn't know about redirects. If indeed that is the case, I'm curious: how do we discern countable pages (aka articles) from all the rest in so called 'content' namespaces? (In dumps there is the redirect tag, which isn't so easy to set, as #REDIRECT tag can be localized into many language versions). Thx

The data lake knows about the redirect status of a page today, the page_is_redirect field. It doesn't know how page_is_redirect changed over time because we're not yet parsing wiki text. When we do that, we'll add that field and until then we'll use page_is_redirect_latest as a hopefully decent substitute.

@Milimetric Ah OK, that's already as good as the current solution. The dumps don't contain redirect on a per revision level, only the current status. Thanks

Nuria added a comment.Feb 4 2017, 8:25 PM

pinging @Neil_P._Quinn_WMF so he is aware that indeed redirects are taken care of, maybe this deserves an entry in some kind of FAQ?

What were the plans for looking at this? We have some new numbers from the public labs import that are just slightly off from our production numbers. So it's even more interesting to look at now :)

Nuria added a comment.Feb 24 2017, 9:28 PM

Research is short on resources this quarter, thus we were planning on tackling this end of quarter on early next quarter. cc @DarTar @leila

Nuria renamed this task from Coordinate with research to vet metrics calculated from the data lake to vet metrics calculated from the data lake .May 25 2017, 4:03 PM
Nuria removed a project: Research.
Nuria added a comment.Aug 9 2017, 12:21 AM

Ping @ezachte for after wikimanina is done

Hey @Nuria I spoke to Erik last Friday and he told me he was going to pick this up this week. I'll drop him an email if for whatever reason he missed these pings.

Nuria added a comment.Sep 11 2017, 7:09 PM

Excellent, let's sync up on this once you are back, Erik.

@Nuria Ah I missed these. Sorry about that. A good example of why restructuring my mailbox was dearly needed, so I finally fine-tuned Gmail filters this weekend. So I could start vetting later this week. Shall we do a hangout?

Nuria added a comment.Sep 14 2017, 4:13 PM

Will set up meeting

Please do. Tomorrow any time till 12 AM PDT works for me. Preferably a bit earlier.

Nuria renamed this task from vet metrics calculated from the data lake to vet edit data on the data lake .Sep 18 2017, 4:28 PM
Nuria added a comment.EditedSep 18 2017, 5:46 PM

Three vettings:

  • import from mediawiki into hadoop, analytics vetted that fairly well
  • metric calculation on metrics calculated on top of data lake versus wikistats metrics
  • vetting usage of raw data in its current form by community as our goal is to put this data in labs for easy community access (is structure understandable? what would be helpful examples of usage?). The goal is to have a low bar to have questions answered, community members shouldn't have to know alot about the inners of how mediawiki works in order to answer questions about edits on a given project.

Some links:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits

Nuria reassigned this task from Milimetric to ezachte.Sep 28 2017, 5:36 PM

Assigning to Eric and moving to radar

Nuria edited projects, added Analytics; removed Analytics-Kanban.Sep 28 2017, 5:36 PM
Nuria moved this task from Incoming to Radar on the Analytics board.

I collected feedback in https://phabricator.wikimedia.org/T178591 (I don't know how to link it here as a subtask, I never did such)

leila added a comment.Oct 23 2017, 7:16 PM

@Erik_Zachte I just added it as a subtask. For the future, you can click on Edit Related Task on the top right-hand-side of your page and then click on Edit Subtasks and add the phabricator ticket number to be added as a subtask. If the subtask didn't exit, you would click on Edit Related Tasks and then Create Subtask. Please ping me off-thread if you'd like to indulge in more phabricator tips and tricks. ;)