Page MenuHomePhabricator

wikiwho on pywikibot
Open, LowPublic

Description

wikiwho currently depends on https://bitbucket.org/halfak/wikimedia-utilities , which is great at xml dump processing , with limited API support.

it would be useful to integrate wikiwho with pywikibot to work on live revisions from the wiki.

Event Timeline

jayvdb created this task.Feb 17 2015, 9:54 PM
jayvdb raised the priority of this task from to Normal.
jayvdb updated the task description. (Show Details)
jayvdb added a project: Pywikibot.
jayvdb added subscribers: Evanontario, jayvdb, Unknown Object (MLST) and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 17 2015, 9:54 PM

Eek! And that's an old unmaintained library! See https://github.com/halfak/MediaWiki-Utilities for the current version.

I also implement authorship tracking(http://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-persistence) and WikiWho's diffing strategy(http://pythonhosted.org/deltas/detection.html#module-deltas.detection.segment_matcher).

I can promise to respond to bugs and feature requests quickly. :)

@Halfak , wikiwho imports from 'wmf' , which appears to not exist in the current version. I can see a few other dependencies on the old version also. The current version (packaged at https://pypi.python.org/pypi/mediawiki-utilities) does look better, but it appears to not be backwards compatible. Sounds like the first step should be to upgrade the wikiwho code to work with the current mediawiki-utilities...? If so, we can create a new task for that. ;-)

@jayvdb agreed. mediawiki-utilities is a compatibility breaking change with substantial improvements in performance.

Note also that mediawiki-utilities is only compatible with python3.x.

If you give me some microtasks on this or the other thread (Accuracy Review) I'll be happy to do/attempt them @jayvdb. You told me to give you a poke if there were none created by now. I did notice that it looks like the primary mentor for Accuracy Review is concerned about whether the project is a 2-3 week contribution.

Jsalsman added a comment.EditedFeb 22 2015, 6:56 PM

@Halfak, would you please show the code outline for how to use http://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-persistence to obtain the age of a given unique word (for purposes of example let's say an article is titled "Economy of Jakarta" with a string in question as "GRDP (Gross Regional Domestic Product) was Rp. 566 trillion" with the specific word of which we want to know the age being "566".) Does it start with dumps and annotate them, or do a WikiBlame-style binary search on revisions, for starters?

@Jsalsman

The libraries I linked to are at a higher level of abstraction than I think you are imagining. The implementation details (like using binary search and indexes) are up to you. But since you asked for some code, here's a simple strategy that would generate the answer on demand. Note that I reworked the example to use a real snippet from an article so I could run it to make sure it worked too.

from mw import api
from mw.lib import persistence

# Initialize api session and page state
session = api.Session("https://en.wikipedia.org/w/api.php")
page_state = persistence.State()

# Query for the page's revisions
rev_docs = session.revisions.query(titles={"Antoine Beauvilliers"},
                                   properties={"content", "user", "timestamp", "sha1"},
                                   direction="newer")

# Use the page_state to process the revisions (and store the revision's timestamps)
last_tokens = None
for rev_doc in rev_docs:
    tokens, _, _ = page_state.process(rev_doc.get("*", ""),
                                      rev_doc['timestamp'],
                                      checksum=rev_doc['sha1'])
    last_tokens = tokens

# This gnarely bit of code is just used to find the specific tokens we are looking for
expected = "Of humble parentage, Beauvilliers worked his way up from kitchen boy"
len_expected = len(persistence.tokenization.wikitext_split(expected))
match_ranges = [(i, i+len_expected) for i in range(len(last_tokens))
                                    if "".join(t.text for t in last_tokens[i:i+len_expected]) == expected]

# Print out the tokens and the first revision they appeared in
for start, end in match_ranges:
    for token in last_tokens[start:end]:
        if len(token.text.strip()) == 0: continue
        print("'{0}' was added {1}".format(token.text, token.revisions[0]))

The output looks like this:

'Of' was added 2013-05-24T20:07:27Z
'humble' was added 2013-06-01T05:39:29Z
'parentage' was added 2013-05-24T20:07:27Z
',' was added 2013-05-24T20:07:27Z
'Beauvilliers' was added 2013-05-24T20:07:27Z
'worked' was added 2014-08-29T08:26:55Z
'his' was added 2014-08-29T08:26:55Z
'way' was added 2014-08-29T08:26:55Z
'up' was added 2014-08-29T08:26:55Z
'from' was added 2014-08-29T08:26:55Z
'kitchen' was added 2014-08-29T08:26:55Z
'boy' was added 2014-08-29T08:26:55Z
Jsalsman closed this task as Resolved.Feb 26 2015, 12:44 AM
Jsalsman claimed this task.

@Halfak, thank you so much for this; what a tremendous help!

@jayvdb, since that clearly works for our purposes, I'm resolving this. And since it clearly means there is no need to touch the dumps, I will take all the dump-related requirements and mentions out of T89416

jayvdb reopened this task as Open.Feb 26 2015, 1:52 AM

As far as I know, wikiwho functionality (whether by wikiwho or wikimedia-utilities) has not been integrated into pywikibot, which is what this task is about. Reopening.

Jsalsman removed Jsalsman as the assignee of this task.Apr 25 2015, 6:39 PM
Jsalsman set Security to None.

Is it better to do this in pywikibot or mediawiki-utilities?

FaFlo added a subscriber: FaFlo.Jun 4 2015, 5:21 PM

hi, I just now became aware of this thread. Several things: (i) a collaborator just recently updated wikiwho to run with the newest version of the wikimedia-utilities and python3 (see https://github.com/maribelacosta/wikiwho/tree/python3 ). (ii) Be aware that wikiwho is the *only* solution (apart from Luca de Alfaro's A3 algo that we evaluated as well) that was soundly tested for the accuracy of the provenance attribution and that the solution is not trivial in many instances. Or I'm not aware of accuracy testing that might have been done so far for other approaches. This can be critical if used in a real editing scenario by an end-user. (iii) Not so related but maybe helpful: our API is already giving authorship information and you can use that too; example: wikiwho.net/wikiwho/wikiwho_api_api.py?revid=649876382&name=Laura_Bush&format=json&params=author. (iv) if anyone needs input or has feature requests we are happy to respond to them. it's just that the whole phabricator line of communication was not on my radar, but I will have an eye on it. Else, drop me a talk page message or an email to f.floeck@gmail.com or at github

jayvdb lowered the priority of this task from Normal to Low.Jun 5 2015, 1:23 AM

@FaFlo, the recent merges to the wikiwho tree look great!
@Jsalsman, there is your answer; wikiwho using the modern mediawiki-utilities is being maintained. Use it, if your project needs the functionality in that library.

As wikiwho's excellent functionality is able to be accessed using a modern supported client library, adding wikiwho functionality/support to pywikibot is now a low priority.