Page MenuHomePhabricator

Data analysis with (python) MediaWiki-Utilities -- A unix philosophy-inspired collection of packages
Closed, ResolvedPublic

Description

MediaWiki utilities is a collection of simple, sharp tools for extracting and processing MediaWiki data. These libraries are inspired by the Unix philosophy. Each library is designed to *do one thing and do it well*. The libraries are designed to *work together*. Where applicable, they also include unix-style command line utilities that *handle text streams, because that is a universal interface*.

In this session, I'll introduce participants to what utilities are already available. Specifically, I'll demo the extremely easy to use and high power XML parser use of Wikimedia's massive XML dumps. Then we'll talk about new work on current utilities and the development of new utilities.

New utilities proposals:

  • mwrefs -- Handle <ref> extraction, bibliography extraction and metadata fetching for academic identifiers.
  • mwmetrics -- Standardized library for deploying quality and behavioral metric strategies
  • mwviews -- Parsing old view logs, accessing new pageview APIs, etc.
  • mwdiscussions -- Parsing utilities for analyzing discussion pages

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Backlog on the Wikimedia-Developer-Summit-2016 board.
Halfak subscribed.
Halfak set Security to None.
Halfak updated the task description. (Show Details)

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

Hi, I'm definitely interested on hacking on some of these :)

Yep, might be good. Hope we can integrate some of the ideas from my similar hack

Hi @Halfak, this proposal is focusing on a Summit session but there is no indication about topics that could be discussed here before, and therefore it is missing active discussion now. Note that pre-scheduled Summit sessions are expected to be preceded by online discussions and a plan to reach to conclusions and next steps. It would be good to sort out these problems before the next deadline on November 6.

Hi @Qgil, can you give me an example of the type of discussion you'd like to see. Is discussion here taken as a predictor of participation during the workshop? Maybe you could reply in a PM since I'll be trying to kick off *some* discussion here, so that we can have some space to hack together in Jan.


@Ladsgroup, I've been eyeing up pywikibase as another of the set of Unix-style utilities. There are a couple of things that I ought to file as feature requests and discuss somewhere else, but maybe also discuss here. There's still some pywikibot terminology that remains in the split. E.g. ItemPage -- why not refer to it as an "Item"? Further, it looks like the "get()" method just takes the content as a parameter, but the documentation still says "Fetch all page data, and cache it." It doesn't seem like anything is being fetched. What do you think about spending some time during the hackathon to make a cleaner API/docs?

@Yurik, I've been thinking about your work in graph and maps in MediaWiki. Any interest in extracting some historical information about their use. Would you see value in having a mwmaps or mwgraph utility? What would you do with them?

@DarTar (see description). Would you be interested in working on mwrefs at the MW Dev. Summit? I figure that we could merge mwcites functionality into it and incorporate the work you have been doing with metadata extraction. Personally, I want some command-line utilities that we can run against the XML dump and the Identifier dump to gather metadata. It would be nice if these utilities could also maintain a local cache so that updating is fast. I have some ideas for making an efficient file-based cache that we might check into version control (or github's large file store).

Maps has a dashboard in the discovery - http://searchdata.wmflabs.org/maps/
and I recently tried to add Graphoid stuff to it - discussion pending https://gerrit.wikimedia.org/r/#/c/247779/

Hi @Qgil, can you give me an example of the type of discussion you'd like to see. Is discussion here taken as a predictor of participation during the workshop? Maybe you could reply in a PM since I'll be trying to kick off *some* discussion here, so that we can have some space to hack together in Jan.


@Ladsgroup, I've been eyeing up pywikibase as another of the set of Unix-style utilities. There are a couple of things that I ought to file as feature requests and discuss somewhere else, but maybe also discuss here. There's still some pywikibot terminology that remains in the split. E.g. ItemPage -- why not refer to it as an "Item"? Further, it looks like the "get()" method just takes the content as a parameter, but the documentation still says "Fetch all page data, and cache it." It doesn't seem like anything is being fetched. What do you think about spending some time during the hackathon to make a cleaner API/docs?

That would be great. I'd also like to write more tests for pywikibase and use pywikibase in pywikibot too. (put up a patch in pywikibot/core)

@Yurik, does the maps dashboard use DB queries? Maybe we can use mwdb to standardize some of those common queries so that they can be run in labs & and private replicas all the same (/me grabs at straws)

@Ladsgroup, it might also be nice to have a conversation about the pattern of basic libraries becoming dependencies of pywikibot. There's probably some substantial concerns that I'm not aware of.

@Halfak, @Ironholds is managing dashboarding, not sure what his plans are

We're doing a talk on dashboards; we're not planning on doing anything at the developer summit.

@Ironholds, maybe we could dig into finish up mwdiscussions -- a library for parsing wikitext discussion syntax. What do you think?

Like I said, not at the summit :(. Frances might be, though, and is definitely interested in the area (and knows Python!)

@Fhocutt, see Ironholds' comment above. We started work (see https://github.com/Ironholds/talk-parser) and then got busy. A hackathon seems like a good opportunity to get un-busy. :D

@Ladsgroup, it might also be nice to have a conversation about the pattern of basic libraries becoming dependencies of pywikibot. There's probably some substantial concerns that I'm not aware of.

Yeah, I like that :)

@Halfak: for a side project, we ended up improving WikiTalkParser. Some of it is better used as inspiration than re-used in a library, but it works now.

@Fhocutt, my goal is to take the insights from that code on how to process discussions and make something that's powerful and versatile that we can more easily iterate on. Motivation-wise, I think that it would be good if we had better ways for people to explore conversation patterns on Wikis. Too long have individual research labs re-invented discussion parsers. Along with some solid, generalized code, we could start producing periodic data dumps. I'm guessing that your work with @Ironholds would have been immensely easier if a dataset of discussions was already available.

So one thing that we can do now/at the summit is discuss the API structure for such a parser.

Right now, the talk-parser code is aimed towards processing an entire page and returning its Topics. Each topic would contain a threaded sequence of posts so that a reply structure could be understood. We'll need to handle sudden {{outdent}}s and other weirdness. I imagine we can get it right 95% of the time without much work by borrowing some of the hacks from Laniado's code. It would be nice if the same parser (maybe a different *codec*) could be used for processing deletion discussions and RFCs.

@Halfak: Sounds good. I look forward to working on it. And yes, that would have been very helpful, as informative and interesting as the bug-hunting was...

I just generated an updated dataset with @Tarrow using mwcites and will be using his code to start pulling pmid metadata this weekend. :)

@Milimetric just added https://github.com/mediawiki-utilities/python-mwviews

Currently the library only supports the new PageView API released by Analytics

Here's a few bits of functionality that I think this library should support:

  1. Processing hourly view log files
    • these files are a pain to process, but we could use some simple parallelization to make the work easier
    • there's also some old formatting issues in these files. We could encode our institutional knowledge about that into the processor
  2. Processing redirect (and page move) logs
  3. Accessing alternative API endpoints for view logs like the old counter field in page_props.

Any thoughts, @Milimetric?

+1 to all the things on your wishlist, @Halfak, and also +1 to this library's approach to the API: https://github.com/Commonists/pageview-api. In short, it does more interesting aggregations.

Hey folks, We'll be scheduling this session in the unconference rooms tomorrow (Tuesday, 5th) at 2PM. Room still TBD. Check https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016 at 8:30 AM PST.

gpaumier updated the task description. (Show Details)

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!