Data analysis with (python) MediaWiki-Utilities -- A unix philosophy-inspired collection of packages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 30 2015, 2:32 PM

Description

MediaWiki utilities is a collection of simple, sharp tools for extracting and processing MediaWiki data. These libraries are inspired by the Unix philosophy. Each library is designed to *do one thing and do it well*. The libraries are designed to *work together*. Where applicable, they also include unix-style command line utilities that *handle text streams, because that is a universal interface*.

In this session, I'll introduce participants to what utilities are already available. Specifically, I'll demo the extremely easy to use and high power XML parser use of Wikimedia's massive XML dumps. Then we'll talk about new work on current utilities and the development of new utilities.

New utilities proposals:

mwrefs -- Handle <ref> extraction, bibliography extraction and metadata fetching for academic identifiers.
mwmetrics -- Standardized library for deploying quality and behavioral metric strategies
mwviews -- Parsing old view logs, accessing new pageview APIs, etc.
mwdiscussions -- Parsing utilities for analyzing discussion pages

Etherpad: https://etherpad.wikimedia.org/p/WikiDev16-T114247 **

Related Objects

Mentioned In: T119593: Define the list of "must have" sessions for WikiDev '16
T119029: WikiDev 16 working area: Content access and APIs

Event Timeline

Halfak created this task.Sep 30 2015, 2:32 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Wikimedia-Developer-Summit-2016.

Halfak moved this task to Backlog on the Wikimedia-Developer-Summit-2016 board.

Halfak subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2015, 2:32 PM

Halfak added a project: Research.Sep 30 2015, 2:37 PM

Halfak set Security to None.

Halfak updated the task description. (Show Details)

Halfak claimed this task.Sep 30 2015, 3:29 PM

nshahquinn-wmf subscribed.Sep 30 2015, 10:46 PM

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

Niharika subscribed.Oct 6 2015, 8:32 AM

Qgil moved this task from Backlog to Missing expected fields on the Wikimedia-Developer-Summit-2016 board.Oct 12 2015, 9:12 PM

Hi, I'm definitely interested on hacking on some of these :)

Yep, might be good. Hope we can integrate some of the ideas from my similar hack

Ladsgroup subscribed.Oct 23 2015, 8:03 AM

@Yurik, +1 :)

Hi @Halfak, this proposal is focusing on a Summit session but there is no indication about topics that could be discussed here before, and therefore it is missing active discussion now. Note that pre-scheduled Summit sessions are expected to be preceded by online discussions and a plan to reach to conclusions and next steps. It would be good to sort out these problems before the next deadline on November 6.

Hi @Qgil, can you give me an example of the type of discussion you'd like to see. Is discussion here taken as a predictor of participation during the workshop? Maybe you could reply in a PM since I'll be trying to kick off *some* discussion here, so that we can have some space to hack together in Jan.

@Ladsgroup, I've been eyeing up pywikibase as another of the set of Unix-style utilities. There are a couple of things that I ought to file as feature requests and discuss somewhere else, but maybe also discuss here. There's still some pywikibot terminology that remains in the split. E.g. ItemPage -- why not refer to it as an "Item"? Further, it looks like the "get()" method just takes the content as a parameter, but the documentation still says "Fetch all page data, and cache it." It doesn't seem like anything is being fetched. What do you think about spending some time during the hackathon to make a cleaner API/docs?

@Yurik, I've been thinking about your work in graph and maps in MediaWiki. Any interest in extracting some historical information about their use. Would you see value in having a mwmaps or mwgraph utility? What would you do with them?

@DarTar (see description). Would you be interested in working on mwrefs at the MW Dev. Summit? I figure that we could merge mwcites functionality into it and incorporate the work you have been doing with metadata extraction. Personally, I want some command-line utilities that we can run against the XML dump and the Identifier dump to gather metadata. It would be nice if these utilities could also maintain a local cache so that updating is fast. I have some ideas for making an efficient file-based cache that we might check into version control (or github's large file store).

Maps has a dashboard in the discovery - http://searchdata.wmflabs.org/maps/
and I recently tried to add Graphoid stuff to it - discussion pending https://gerrit.wikimedia.org/r/#/c/247779/

In T114247#1761598, @Halfak wrote:

Hi @Qgil, can you give me an example of the type of discussion you'd like to see. Is discussion here taken as a predictor of participation during the workshop? Maybe you could reply in a PM since I'll be trying to kick off *some* discussion here, so that we can have some space to hack together in Jan.

@Ladsgroup, I've been eyeing up pywikibase as another of the set of Unix-style utilities. There are a couple of things that I ought to file as feature requests and discuss somewhere else, but maybe also discuss here. There's still some pywikibot terminology that remains in the split. E.g. ItemPage -- why not refer to it as an "Item"? Further, it looks like the "get()" method just takes the content as a parameter, but the documentation still says "Fetch all page data, and cache it." It doesn't seem like anything is being fetched. What do you think about spending some time during the hackathon to make a cleaner API/docs?

That would be great. I'd also like to write more tests for pywikibase and use pywikibase in pywikibot too. (put up a patch in pywikibot/core)

@Yurik, does the maps dashboard use DB queries? Maybe we can use mwdb to standardize some of those common queries so that they can be run in labs & and private replicas all the same (/me grabs at straws)

@Ladsgroup, it might also be nice to have a conversation about the pattern of basic libraries becoming dependencies of pywikibot. There's probably some substantial concerns that I'm not aware of.

@Halfak, @Ironholds is managing dashboarding, not sure what his plans are

We're doing a talk on dashboards; we're not planning on doing anything at the developer summit.

@Ironholds, maybe we could dig into finish up mwdiscussions -- a library for parsing wikitext discussion syntax. What do you think?

Like I said, not at the summit :(. Frances might be, though, and is definitely interested in the area (and knows Python!)

Qgil added a subscriber: • Fhocutt.Oct 28 2015, 2:38 PM

@Fhocutt, see Ironholds' comment above. We started work (see https://github.com/Ironholds/talk-parser) and then got busy. A hackathon seems like a good opportunity to get un-busy. :D

In T114247#1761659, @Halfak wrote:

@Ladsgroup, it might also be nice to have a conversation about the pattern of basic libraries becoming dependencies of pywikibot. There's probably some substantial concerns that I'm not aware of.

Yeah, I like that :)

@Halfak: for a side project, we ended up improving WikiTalkParser. Some of it is better used as inspiration than re-used in a library, but it works now.

Qgil moved this task from Missing expected fields to On track on the Wikimedia-Developer-Summit-2016 board.Oct 28 2015, 11:03 PM

Tarrow subscribed.Oct 29 2015, 7:46 AM

@Fhocutt, my goal is to take the insights from that code on how to process discussions and make something that's powerful and versatile that we can more easily iterate on. Motivation-wise, I think that it would be good if we had better ways for people to explore conversation patterns on Wikis. Too long have individual research labs re-invented discussion parsers. Along with some solid, generalized code, we could start producing periodic data dumps. I'm guessing that your work with @Ironholds would have been immensely easier if a dataset of discussions was already available.

So one thing that we can do now/at the summit is discuss the API structure for such a parser.

Right now, the talk-parser code is aimed towards processing an entire page and returning its Topics. Each topic would contain a threaded sequence of posts so that a reply structure could be understood. We'll need to handle sudden {{outdent}}s and other weirdness. I imagine we can get it right 95% of the time without much work by borrowing some of the hacks from Laniado's code. It would be nice if the same parser (maybe a different *codec*) could be used for processing deletion discussions and RFCs.

@Halfak: Sounds good. I look forward to working on it. And yes, that would have been very helpful, as informative and interesting as the bug-hunting was...

• ggellerman moved this task from Backlog to In Progress on the Research board.Nov 5 2015, 11:40 PM

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 5 2015, 11:40 PM

NealMcB subscribed.Nov 12 2015, 4:32 PM

• RobLa-WMF mentioned this in T119029: WikiDev 16 working area: Content access and APIs.Nov 19 2015, 12:47 AM

I just generated an updated dataset with @Tarrow using mwcites and will be using his code to start pulling pmid metadata this weekend. :)

• RobLa-WMF moved this task from On track to On track: Content access and APIs on the Wikimedia-Developer-Summit-2016 board.Nov 24 2015, 6:36 AM

• bmansurov mentioned this in T119593: Define the list of "must have" sessions for WikiDev '16.Dec 1 2015, 1:41 AM

Ricordisamoa subscribed.Dec 1 2015, 7:06 AM

@Milimetric just added https://github.com/mediawiki-utilities/python-mwviews

Currently the library only supports the new PageView API released by Analytics

Here's a few bits of functionality that I think this library should support:

Processing hourly view log files
- these files are a pain to process, but we could use some simple parallelization to make the work easier
- there's also some old formatting issues in these files. We could encode our institutional knowledge about that into the processor
Processing redirect (and page move) logs
- doing this along with page view logs is important to know what page was actually viewed https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf
Accessing alternative API endpoints for view logs like the old counter field in page_props.

Any thoughts, @Milimetric?

+1 to all the things on your wishlist, @Halfak, and also +1 to this library's approach to the API: https://github.com/Commonists/pageview-api. In short, it does more interesting aggregations.

Halfak updated the task description. (Show Details)Jan 4 2016, 6:09 PM

Hey folks, We'll be scheduling this session in the unconference rooms tomorrow (Tuesday, 5th) at 2PM. Room still TBD. Check https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016 at 8:30 AM PST.

Halfak updated the task description. (Show Details)Jan 4 2016, 7:16 PM

• gpaumier updated the task description. (Show Details)Jan 5 2016, 10:02 PM

• gpaumier updated the task description. (Show Details)Jan 5 2016, 10:06 PM

• gpaumier updated the task description. (Show Details)

Thanks folks!

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

• DarTar edited projects, added Research-Archive; removed Research.Feb 3 2016, 3:15 AM

Ironholds unsubscribed.Feb 3 2016, 3:19 AM

• ggellerman moved this task from Default to Q3-FY16 on the Research-Archive board.Apr 29 2016, 10:09 PM

• ggellerman closed this task as Resolved.Apr 29 2016, 10:10 PM

Data analysis with (python) MediaWiki-Utilities -- A unix philosophy-inspired collection of packagesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Data analysis with (python) MediaWiki-Utilities -- A unix philosophy-inspired collection of packages
Closed, ResolvedPublic
Actions