Page MenuHomePhabricator

Generate partial dumps of wiki articles
Closed, ResolvedPublic

Description

Implement functionality to download dumps of wiki articles for analysis.
Look through the Wikiwho API code for the same and refactor it if necessary.

Microtask for T89416 - Accuracy review of Wikipedias

Event Timeline

I'm currently refactoring the wikiwho api code. Thanks @FaFlo for the help! The repo can be found here.
The script to be run is wikiwho_api_api.py.
It can be run as:

python wikiwho_api_api.py <article_name>

It outputs all the revids relating to the article of interest.

@Jsalsman , next steps for this task? Should we extract latest addition dates of every word in the article?

@prnk28 is this extracting the word dates yet?

Hadn't we decided that I'll work on this once the review system is ready? Right now we are working on tasks that fall in the July 4 - Aug 3 category in the schedule here T129536. So 'obtaining article dumps' and 'manual list based input' have been pushed to post whatever we're working on now.

@prnk28 yes but please do this now because I'm still stuck on the
single-directory refactor of the new design. I will send you and Fabian an
update in a few minutes.

The wikiwho code now outputs the time for each revid as well (in json format)
Check this particular output file for an article called 'Kendhoo'
https://github.com/priyankamandikal/wikiwho_api/blob/master/out.txt

@Jsalsman , next steps?
Some form of filtering out the words based on importance/relevance?

@Jsalsman , should I close this as resolved? This refactoring has long been done.

@prnk28, please close with a link code that maps an article name to the
mean, median, and standard deviation of the dates each word was added to
the article.

@Jsalsman the datetime analysis is part of task T138953
This task was just for refactoring wikiwho to obtain dates for each revid. Since that's done I'm closing this task as resolved :)