Page MenuHomePhabricator

Extract and Sort WikiApiary Database
Closed, ResolvedPublic

Description

WikiApiary has amassed information belonging to 25,000 public, active MW sites and 3,000 inactive MW sites. We will query WikiApiary to sort the information into meaningful categories.

To start, we will look at the active sites to identify the third party, non-WMF users and refine their classifications beyond "commercial", "university", "government", etc. We would also look to identify the release versions they are using and anonymize the data.

For the inactive sites, we would like to know why they are inactive. See T1246: Mentor Google Code-in 2014 Student(s) Who Will Research MediaWiki Sites Classified as Defunct on WikiApiary

Event Timeline

Palexis created this task.Nov 6 2014, 10:37 PM
Palexis raised the priority of this task from to Needs Triage.
Palexis updated the task description. (Show Details)
Palexis added a project: Wiki-Release-Team.
Palexis changed Security from none to None.
Palexis updated the task description. (Show Details)Nov 7 2014, 12:00 AM
Palexis updated the task description. (Show Details)Nov 7 2014, 1:27 AM
Dzahn added a subscriber: Dzahn.Nov 7 2014, 2:05 AM

i have about 53,783 wikis on http://wikistats.wmflabs.org/ of which 14,170 are listed as "single" wikis, not being part of clusters/hives/wikifarms, and the rest are one table per farm. i can provide release version for most of them, available formats are HTML, wiki text, csv or XML.

@Dzahn, when can we get a csv file !?! :)

Dzahn added a comment.Nov 7 2014, 2:26 AM

@Dzahn, when can we get a csv file !?! :)

start here:

http://wikistats.wmflabs.org/api.php?action=dump&table=mediawikis&format=csv

and here:

http://wikistats.wmflabs.org/largest_csv.php

i can try make it better (more release version info) though

@Dzahn, I have the two files: Largest MediaWikis (5,001) and MediaWikis (14,185). Can you provide version info for Largest MediaWikis? On the MediaWikis, I see version info for some of the wikis.

Dzahn added a comment.Nov 7 2014, 3:10 AM

"largest" is only 5001 because of an arbitrary limit in the code, should be way more. and yes, but i need to run scripts for that, i'll check tomorrow to provide something better

This comment was removed by Palexis.
Palexis updated the task description. (Show Details)Nov 8 2014, 12:46 AM
Palexis moved this task from Backlog to Doing on the Wiki-Release-Team board.Nov 11 2014, 4:34 PM
Palexis updated the task description. (Show Details)Nov 11 2014, 5:04 PM
Palexis added a comment.EditedNov 19 2014, 8:05 PM

Exporting data has been limited by bug https://bugzilla.wikimedia.org/show_bug.cgi?id=49203.

@MarkAHershberger, can you elaborate on the fix that you plan? Thanks.

In T1136#19749, @Dzahn wrote:

"largest" is only 5001 because of an arbitrary limit in the code, should be way more. and yes, but i need to run scripts for that, i'll check tomorrow to provide something better

Hi @Dzahn, Have you had any success with the scripts? Thanks.

Successfully imported data file from WikiApiary!

Palexis triaged this task as Medium priority.Dec 8 2014, 3:31 PM
Palexis moved this task from Doing to Ready to Go on the Wiki-Release-Team board.Dec 8 2014, 3:59 PM
Palexis moved this task from Ready to Go to Doing on the Wiki-Release-Team board.Dec 15 2014, 12:43 PM

Need to publish the script I have that does this.

Palexis closed this task as Resolved.Jan 12 2015, 9:33 PM