Page MenuHomePhabricator

Extract and Sort WikiApiary Database
Closed, ResolvedPublic

Description

WikiApiary has amassed information belonging to 25,000 public, active MW sites and 3,000 inactive MW sites. We will query WikiApiary to sort the information into meaningful categories.

To start, we will look at the active sites to identify the third party, non-WMF users and refine their classifications beyond "commercial", "university", "government", etc. We would also look to identify the release versions they are using and anonymize the data.

For the inactive sites, we would like to know why they are inactive. See T1246: Mentor Google Code-in 2014 Student(s) Who Will Research MediaWiki Sites Classified as Defunct on WikiApiary

Event Timeline

Palexis raised the priority of this task from to Needs Triage.
Palexis updated the task description. (Show Details)
Palexis added a project: Wiki-Release-Team.
Palexis changed Security from none to None.

i have about 53,783 wikis on http://wikistats.wmflabs.org/ of which 14,170 are listed as "single" wikis, not being part of clusters/hives/wikifarms, and the rest are one table per farm. i can provide release version for most of them, available formats are HTML, wiki text, csv or XML.

@Dzahn, when can we get a csv file !?! :)

@Dzahn, when can we get a csv file !?! :)

start here:

http://wikistats.wmflabs.org/api.php?action=dump&table=mediawikis&format=csv

and here:

http://wikistats.wmflabs.org/largest_csv.php

i can try make it better (more release version info) though

@Dzahn, I have the two files: Largest MediaWikis (5,001) and MediaWikis (14,185). Can you provide version info for Largest MediaWikis? On the MediaWikis, I see version info for some of the wikis.

"largest" is only 5001 because of an arbitrary limit in the code, should be way more. and yes, but i need to run scripts for that, i'll check tomorrow to provide something better

This comment was removed by Palexis.

@MarkAHershberger, can you elaborate on the fix that you plan? Thanks.

In T1136#19749, @Dzahn wrote:

"largest" is only 5001 because of an arbitrary limit in the code, should be way more. and yes, but i need to run scripts for that, i'll check tomorrow to provide something better

Hi @Dzahn, Have you had any success with the scripts? Thanks.

Successfully imported data file from WikiApiary!

Palexis triaged this task as Medium priority.Dec 8 2014, 3:31 PM

Need to publish the script I have that does this.