Page MenuHomePhabricator

vet edit data on the data lake
Closed, DeclinedPublic

Description

The Data Lake [1] is the place we're putting analytics-friendly data in Hadoop. The first data to land there is from the Mediawiki History Reconstruction project. We have computed metrics that power this dashboard [2] and want to vet that the new data hasn't screwed up the metrics compared to their old counterparts in vital signs. The new numbers are close to the old numbers with some notable exceptions. Our analysis is in this spreadsheet [3]. We know the reasons behind the differences and want to work with you (Research) to make sure they're forgivable enough to power Wikistats 2.0.

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
[2] https://analytics.wikimedia.org/dashboards/standard-metrics/
[3] https://docs.google.com/spreadsheets/d/12nHxfp5cerKwAc1Q7W_DudSJ-ZhmynDK6VINzb857zE/edit#gid=1232097690

Event Timeline

Milimetric renamed this task from Coordinate with research to vet metrics calculated from edit data lake to Coordinate with research to vet metrics calculated from the data lake .Jan 25 2017, 5:07 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Moving to "paused" as we wait for feedback from research.

@Nuria let's go over this at our next 1:1 since it goes beyond Erik's involvement (wikistats transition) and we need to scope it, it's unclear at this stage if we'll be able to help in the coming weeks.

I learned from @Neil_P._Quinn_WMF yesterday that the data lake doesn't know about redirects. If indeed that is the case, I'm curious: how do we discern countable pages (aka articles) from all the rest in so called 'content' namespaces? (In dumps there is the redirect tag, which isn't so easy to set, as #REDIRECT tag can be localized into many language versions). Thx

The data lake knows about the redirect status of a page today, the page_is_redirect field. It doesn't know how page_is_redirect changed over time because we're not yet parsing wiki text. When we do that, we'll add that field and until then we'll use page_is_redirect_latest as a hopefully decent substitute.

@Milimetric Ah OK, that's already as good as the current solution. The dumps don't contain redirect on a per revision level, only the current status. Thanks

pinging @Neil_P._Quinn_WMF so he is aware that indeed redirects are taken care of, maybe this deserves an entry in some kind of FAQ?

What were the plans for looking at this? We have some new numbers from the public labs import that are just slightly off from our production numbers. So it's even more interesting to look at now :)

Research is short on resources this quarter, thus we were planning on tackling this end of quarter on early next quarter. cc @DarTar @leila

Nuria renamed this task from Coordinate with research to vet metrics calculated from the data lake to vet metrics calculated from the data lake .May 25 2017, 4:03 PM
Nuria removed a project: Research.

Hey @Nuria I spoke to Erik last Friday and he told me he was going to pick this up this week. I'll drop him an email if for whatever reason he missed these pings.

Excellent, let's sync up on this once you are back, Erik.

@Nuria Ah I missed these. Sorry about that. A good example of why restructuring my mailbox was dearly needed, so I finally fine-tuned Gmail filters this weekend. So I could start vetting later this week. Shall we do a hangout?

Please do. Tomorrow any time till 12 AM PDT works for me. Preferably a bit earlier.

Nuria renamed this task from vet metrics calculated from the data lake to vet edit data on the data lake .Sep 18 2017, 4:28 PM

Three vettings:

  • import from mediawiki into hadoop, analytics vetted that fairly well
  • metric calculation on metrics calculated on top of data lake versus wikistats metrics
  • vetting usage of raw data in its current form by community as our goal is to put this data in labs for easy community access (is structure understandable? what would be helpful examples of usage?). The goal is to have a low bar to have questions answered, community members shouldn't have to know alot about the inners of how mediawiki works in order to answer questions about edits on a given project.

Some links:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits

Assigning to Eric and moving to radar

Nuria moved this task from Incoming to Radar on the Analytics board.

I collected feedback in https://phabricator.wikimedia.org/T178591 (I don't know how to link it here as a subtask, I never did such)

@Erik_Zachte I just added it as a subtask. For the future, you can click on Edit Related Task on the top right-hand-side of your page and then click on Edit Subtasks and add the phabricator ticket number to be added as a subtask. If the subtask didn't exit, you would click on Edit Related Tasks and then Create Subtask. Please ping me off-thread if you'd like to indulge in more phabricator tips and tricks. ;)

Removing assignee @ezachte as that Phabricator account has been deactivated. (If there are questions, it seems that @erik_zachte could be contacted.)

In the time since we made this task, the Product Analytics and Analytics Engineering teams have been working closely on this dataset. We made some quality improvements and continue to vet together. This task is therefore outdated.