Page MenuHomePhabricator

Tabular Data support in pywikibot
Open, LowPublic

Description

Dear friends, I've been using Pywikibot for long in order to handle the reporting tasks of the Spanish editions of the Wiki Loves (Monuments, Earth...) contests. See here, for instance. The output is generated by a pywikibot-enabled Python notebook run from PAWS (I don't know which Pywikibot release it's running, sorry) and deeply relies on pandas and matplotlib for creating the content. The "dataset" contains information about the images being uploaded as part of the contest (see the dataset here). As you can see, I'm using <pre></pre> for showing the content and removing the first and the last line when reading the content from the notebook to update the reports.

I'm not quite happy with this poor-man CSV approach and when I noticed that there was a way to "properly" store datasets in commons (Tabular Data) I tried to get advantage of it (BTW, handling this Tabular Data from pandas is much harder than doing it from CSV, but it seems to be the new "standard" in the WM projects). However, after creating the JSON structure in the notebook, turning it into a string and using the regular 'save' method associated to a 'Page', I was not successful at all. See the results in here. It seems as a valid Tabular Data structure (in fact, it's is copied from here (just a row)), but it doesn't show as a valid Tabular Data file and, what's more important, if I try to edit it regularly all I get the following message (and no option to actually edit the page):

Content format not supported.

The content format application/json+pretty is not supported by the content model wikitext.

I've asked the responsible people and the answer is really uninformative.

Have you got the slightest idea of what I'm doing wrong? Please, let me know which additional information you need. Thanks.

Event Timeline

Discasto created this task.Mar 5 2018, 8:57 AM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptMar 5 2018, 8:57 AM
Discasto updated the task description. (Show Details)Mar 5 2018, 8:57 AM
Discasto updated the task description. (Show Details)
Discasto updated the task description. (Show Details)
Discasto updated the task description. (Show Details)
Mpaa added a subscriber: Mpaa.Mar 6 2018, 12:03 AM

You are not doing anything wrong.
Page() class does not support all values of:

- rvcontentformat: Serialization format used for rvdifftotext and expected for output of content.
One of the following values: text/x-wiki, application/json, text/plain, text/javascript, text/css

but only the default (I think it is text/x-wiki).

If you want to save json, you need to subclass Page() and overwrite save() method.
I did it in proofreadpage.py. You can use it as reference.

Unfortunately, this works only when saving.
There is currently no way to load using rvcontentformat other that text/plain.

I tried once but gave it up. See https://gerrit.wikimedia.org/r/#/c/224852/.
You can try resurrect that concept.

Thanks for the info, Mpaa. I guess a can derive a basic saver for Tabular Data from your proofreadpage.py following Yurik statements. I'll let you know whether it works. On the other hand, reading is not actually a problem, I think I can use the MediaWiki API to retrieve the content. At the end of the day, it's the saving feature the actual stopper. Best regards

Mpaa added a comment.Mar 6 2018, 7:13 PM

I do not know about Tabular Data. However, it would be nice to expand the library with new features.
Not having the possibility of getting different contentmodel/format is a Pywikibot limitation which would be nice to remove.
If you feel like to 'standardize' your new class within pywikibot, you're welcome :-)

Discasto added a comment.EditedMar 7 2018, 9:26 AM

Hi all, I've successfully posted TabularData through pywikibot following your code :-) However, I haven't used "contentmodel" (Tabular.JsonConfig does not seem to be supported) but only "contentformat" (set to "application/json"). That is, simply overwriting the save() method with

kwargs['contentformat'] = 'application/json'

And using json.dumps with ensure_ascii set to False has been enough for writing (see here)

And yes, there should be a new class called TabularDataPage (same for maps). However, I'm far from being a Python expert and feel as if my code is possibly not suitable to be included in a professional package as Pywikibot (moreover, I find TabularData a quite inconvenient format given that I mostly work with pandas, which does not directly support this format). On the other hand, I haven't investigated yet how to read TabularData and possibly is not that simple (however, I guess that something similar is currently being done with wikidata Items, as they're also modeled as json structures, and therefore shouldn't be impossible).

To sum up, I'm trying to write some nice code to write and read TabularData and will be glad to contribute it to Pywikibot. However, I'm unsure about how to do it. Any guidance will be appreciated

Xqt triaged this task as Low priority.May 12 2018, 5:47 AM