Page MenuHomePhabricator

Commons limit on data is 2,048 kilobytes
Open, Needs TriagePublic

Description

I was adding data to Commons and got "Error: The text you have submitted is 3,171.888 kilobytes long, which is longer than the maximum of 2,048 kilobytes. It cannot be saved."

This was to create an interactive heat map like this on Wikipedia https://ourworldindata.org/per-capita-co2

Event Timeline

Aklapper changed the task status from Open to Stalled.Oct 7 2019, 8:59 AM

@Doc_James: Please follow https://www.mediawiki.org/wiki/How_to_report_a_bug and provide context and clearer steps to reproduce - thanks!

Sure so I tried to add a bunch more data for another 100 years here https://commons.wikimedia.org/wiki/Data:CO2PerCapita.tab

It was 3,171.888 kilobytes long which generated the error mentioned.

Thanks! I assume this is about StructuredDataOnCommons. Hence adding project tag so others can find this task when searching for tasks under that project or looking at that project workboard.

Aklapper renamed this task from Commons limit on data to Commons limit on data is 2,048 kilobytes.Oct 7 2019, 9:49 AM
Aklapper changed the task status from Stalled to Open.

The Data: namespace is tabular data, not structured data, and as far as I’m aware that’s a separate project.

Correct, this is the tabular data hitting the 2MB page limit. One relatively simple solution would be to fix JsonConfig base class to store data as "compact", rather than pretty-printed JSON (there shouldn't be any externally visible consequences because JSON is always reformatted before saving). That would immediately increase max storage by a significant percentage, especially for .map (geojson tends to have a lot of small arrays, so when they break up between lines and prefixed with tons of spaces, the size increases several times the original). I suspect Wikibase has had to solve a similar problem storing their items in the MW engine.

It’s already stored compactly:

lucaswerkmeister-wmde@mwmaint1002:~$ mwscript shell.php commonswiki
Psy Shell v0.9.9 (PHP 7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1 — cli) by Justin Hileman
>>> $services = MediaWiki\MediaWikiServices::getInstance();
=> MediaWiki\MediaWikiServices {#208}
>>> $revision = $services->getRevisionStore()->getRevisionByTitle( $services->getTitleParser()->parseTitle( 'Data:CO2PerCapita.tab' ) );
=> MediaWiki\Revision\RevisionStoreRecord {#742}
>>> $services->getBlobStore()->getBlob( $revision->getSlot( 'main' )->getAddress() );
=> "{"license":"CC-BY-4.0","description":{"en":"CO<sub>2</sub> emissions per capita"},"sources":"https://ourworldindata.org/per-capita-co2","schema":{"fields":[{"name":"country","type":"string","title":{"en":"ISO Country Code"}},{"name":"year","type":"number","title":{"en":"Year"}},{"name":"tonnes","type":"number","title":{"en":"tonnes per capita"}}]},"data":[["AFG",1900,0],["AFG",1901,0],["AFG",1902,0],["AFG",1903,0],…

Is it possible to double or triple the maximum allowed size?

@Lucas_Werkmeister_WMDE thanks, but this is very surprising, I was 99.99% certain it was storing it pretty-printed... Either that, or it did the size limit check in the pretty-printed version before storing. Would it be possible to do a direct SQL query for that data, and also to run a MAX( LEN( data ))to see the largest page in the Data namespace on Commons? Thanks for checking!

I don’t think that’s possible, but you can check the page_len for yourself in Quarry.

So somehow @Sic19 managed to circumvent the size limit check with "Data:Canada/Nunavut.map" (I suppose AWB is old enough that it's not sensitive to the workings of tabular data). Is there a way to make the size limit check apply to the compacted data?

He has not circumvented any size limit check. Data:Canada/Nunavut.map is 826740 bytes long –

lucaswerkmeister-wmde@tools-sgebastion-07:~$ sql commonswiki 'SELECT page_len FROM page wHERE page_namespace = 486 AND page_title = "Canada/Nunavut.map";'
+----------+
| page_len |
+----------+
|   826740 |
+----------+
lucaswerkmeister-wmde@tools-sgebastion-07:~$ curl -s 'https://commons.wikimedia.org/w/index.php?title=Data:Canada/Nunavut.map&action=raw' | wc -c
826740

– which is less than 2048 kilobytes.

He has not circumvented any size limit check.

I contend that something was circumvented when Simon originally created the page, since I cannot make any changes to this page, even though the resulting length of the data when I 1) add an Inuktitut description and 2) re-compacted the data manually to ~826KB should be less than 2MB:

Annotation 2019-11-05 151547.jpg (249×1 px, 75 KB)

Annotation 2019-11-05 151700.jpg (597×1 px, 89 KB)

where the message in red states "Error: The text you have submitted is 6,223.228 kilobytes long, which is longer than the maximum of 2,048 kilobytes. It cannot be saved."

(I suspect that I'm repeating what has been demonstrated initially when James created this task and what Yuri last suggested, but) it is this faultily placed check that needs to be adjusted so that it examines only the compacted data and not the pretty-printed data.

See my above comment, and @Lucas_Werkmeister_WMDE response -- while it stores things in the compact JSON form, the length is checked while it is in the "pretty-printed" format. A way to work around it might be to upload it to the server in the compact form via API, in which case it might get accepted.

The process with which I uploaded this geojson map data uses the AWB CSVLoader plug-in and the spaces/linebreaks were removed during the reformatting from json to csv. So, what @Yurik is saying makes sense.

If it is useful, I can provide further details of the AWB upload process.

See my above comment, and @Lucas_Werkmeister_WMDE response -- while it stores things in the compact JSON form, the length is checked while it is in the "pretty-printed" format. A way to work around it might be to upload it to the server in the compact form via API, in which case it might get accepted.

Just tried this for an upload that was running into this problem, it works. It's quite unfortunate that large datasets can only be edited via the API. For non-API edits, the effective maximum size is quite small.

The use of the API works up to a point. I am noticing that for GeoJSON files at or above 250KB I'm getting read timeouts when using Pywikibot. Any way to get past those errors?

How do I condense a file like this? http://opendata.columbus.gov/datasets/corporate-boundary
It read as 2,936.44 kilobytes. It's one single shape. I tried 'simplifying' the shape into something less detailed and it then read as 4,807.333 KB.

Running into this issue as well.. I think.

I was trying to create https://commons.wikimedia.beta.wmflabs.org/wiki/Data:Leesonderzoek2013.tab for testing, but couldn't: "Error: The text you have submitted is 4,052.042 kilobytes long, which is longer than the maximum of 2,048 kilobytes. It cannot be saved."

This is, however, a lie! I entered the exact same input on https://commons.wikimedia.beta.wmflabs.org/wiki/User:AJ/long and I was able to submit it! And according to https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=User:AJ/long&action=info this page is only 1,108,022 bytes long. Not 4,052.042! Only in Data: namespace does it magically get almost four times longer.

@AlexisJazz per my above comments -- it seems the system pretty-prints JSON, checks the size, and only then it stores it in the compact format. To make it work properly, the system should only validate json size after serializing it in compact form.

See my above comment, and @Lucas_Werkmeister_WMDE response -- while it stores things in the compact JSON form, the length is checked while it is in the "pretty-printed" format. A way to work around it might be to upload it to the server in the compact form via API, in which case it might get accepted.

Just tried this for an upload that was running into this problem, it works. It's quite unfortunate that large datasets can only be edited via the API. For non-API edits, the effective maximum size is quite small.

I found another workaround. Create the page in your userspace (or anywhere else) and move it to the Data: namespace (don't forget the .tab extension) The resulting page doesn't render pretty, but you can query it.

Page import works as well (same result) but requires a Commons administrator.