Page MenuHomePhabricator

Extract CSV file from Wikipedia/Commons/MediaWiki Special pages
Open, Needs TriagePublic

Description

I would propose to create a "CSV download" button to replace a tiresome screen scraping process for all of the MediaWiki Special pages.

Some examples:

  • During an edit-a-thon I needed a list of articles created by the participants. I was able to get the list via https://en.wikipedia.org/wiki/Special:Contributions (pages created) but then I needed to fallback to screen scraping to produce an Excel list.
  • I needed to obtain a list of deleted pages, so I required the list via a tiresome screen scraping of Special:Log?type=delete
  • I needed a list of uploaded files in Commons via "Uploaded files". Again only screen scraping with a painfull removing of all other comments and dates was necessary.

Then those CSV lists could be easily used with Quickstatements, AutoWikiBrowser, AC/DC or any other tool that accepts lists of objects or pages.

Event Timeline

Geertivp created this task.Mar 25 2020, 3:06 AM
Restricted Application added a project: Wikidata. · View Herald TranscriptMar 25 2020, 3:06 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
  1. One possibility would be to write a MariaDB SQL Query to execute via https://quarry.wmflabs.org - Then we need to know the internal datamodel of MediaWiki.
Jarekt added a subscriber: Jarekt.May 8 2020, 1:48 PM

I think CSV download option would be great. Over time I found that I can often get list of pages from many MediaWiki Special pages using AutoWikiBrowser, and when that fails then there is always https://quarry.wmflabs.org, but a simple CSV download option would be preferable. Even better would be CSV/Excel download option, as files with UTF-8 or unicode characters are tricky to open in such a way as keep them intact.

Tgr added a subscriber: Tgr.May 9 2020, 3:32 PM

Log-type special pages are displayed with a pager. How would you handle query limits for CSV download? A separate CSV file for every page?

Bawolff added a subscriber: Bawolff.May 9 2020, 6:20 PM

These pages all have json downloads (via api) so i guess is this asking to add csv as an output format for the api?

@Bawolff Could you give an example how to download a JSON file by API? Then we might be able to convert or import the JSON file to CSV with an external tool?

You can click on the links at https://www.mediawiki.org/w/api.php?action=help&modules=query for examples for different types of queries. As an example, here is a list of my contribs: https://www.mediawiki.org/w/api.php?action=query&list=usercontribs&ucuser=Bawolff&format=json&formatversion=2

The data is not neccesary flat, so converting to a flat format might be tricky, but at the very least generator modules should be representable as csv

For reference, if you wanted to build this into mediawiki, these sort of things are implemented as subclasses of ApiFormatterBase: https://doc.wikimedia.org/mediawiki-core/master/php/classApiFormatBase.html

Tgr added a comment.May 10 2020, 1:20 PM

Converting API output to CSV is pretty easy with jq:

$ curl -s 'https://www.mediawiki.org/w/api.php?action=query&list=usercontribs&ucuser=Bawolff&format=json&formatversion=2' | jq -r '.query.usercontribs[] | [.user, .title, .timestamp, .comment] | @csv'
"Bawolff","User talk:Bawolff","2020-04-01T15:43:57Z","/* Help: Rollback right */ phab"
"Bawolff","User talk:Bawolff/Reflections on graphs","2020-03-29T18:02:53Z","cm"
...

(test)

Other example for article deletion log:

curl -s 'https://fr.wikipedia.org/w/api.php?action=query&list=logevents&leuser=ALDO_CP&format=json' | jq -r '.query.logevents[] | [.user, .title, .timestamp, .comment] | @csv'

List all the users that contributed to an article:

curl -s 'https://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Project:Wikidata&format=json&rvlimit=9999' |jq -r '.query.pages[].revisions[].user' |sort -u
curl -s 'https://www.wikidata.org/w/api.php?action=query&prop=revisions&titles=Project:Wikipedia&format=json&rvlimit=500' |jq -r '.query.pages[].revisions[].user' |sort -u