They are almost 60GB, which is an order of magnitude larger than anything else.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
don't try to abstract things that aren't text or wikitext | mediawiki/extensions/ActiveAbstract | master | +8 -3 |
Event Timeline
5487381 titles for enwiki abstracts, 5G. 37327702 titles for wikidatawiki abstracts, 59G. The content of wikidata abstracts is mostly garbage though. A sample:
<doc> <title>Wikidata: Q28</title> <url>https://www.wikidata.org/wiki/Q28</url> <abstract>{"type":"item","id":"Q28","labels":{"nb":{"language":"nb",& quot;value":"Ungarn"},"en":{"language":"en","value":"Hungary"},"nn":{" ;language":"nn","value":"Ungarn"},"se":{"language":"se","value":"Ung\ u00e1ra"},"de":{"language":"de","value":"Ungarn"},"fr":{"language":"f r","value":"Hongrie"},"it":{"language":"it","value":"Ungheria"},"pl&q uot;:{"language":"pl","value":"W\u0119gry"},"eo":{"language":"eo","value&q uot;:"Hungario"},"ru":{"language":"ru","value":"\u0412\u0435\u043d\u0433\u0440\u0438\u044f&quo t;},"es":{"language":"es","value":"Hungr\u00eda"},"be-tarask":{"language":" ;be-tarask","value":"\u0412\u0443\u0433\u043e\u0440\u0448\u0447\u044b\u043d\u0430"},"sgs":{"language":&q uot;sgs","value":"Vengr\u0117j\u0117"},"rup":{"language":"rup","value":"Ungaria "},"nan":{"language":"nan","value":"Magyar-kok"},"vro":{"language":"v ro","value":"Ungari"},"roa-tara":{"language":"roa-tara","value":"Ungherie" },"yue":{"language":"yue","value":"\u5308\u7259\u5229"},"lzh":{"language":" ;lzh","value":"\u5308\u7259\u5229"},"nds-nl":{"language":"nds-nl","value</abstract> <links> </links> </doc>
Note the growth trend shown at https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaWIKIDATA.htm
- August 2014: 16.1 million articles.
- August 2015: 19 million
- August 2016: 24.6 million
- August 2017: 35.5 million
Relevant code: https://github.com/wikimedia/mediawiki-extensions-ActiveAbstract/blob/master/AbstractFilter.php#L131
I'm not sure how Wikidata abstracts could be meaningful… I can make a (rather bold) suggestion to just drop an empty string in case we're dealing with non TextContent in AbstractFilter?!
Change 416409 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext
I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. But your approach is better, in case similar content creeps into other projects. @hoo, what do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?
If this is just NS0 (or content namespaces… which are all Wikibase entity namespaces), this definitely makes sense to me.
Hm… not sure about returning "NONTEXTCONTENT", maybe either omit the <abstract> or do something like <abstract not-applicable="" />?
I've updated it accoring to your second suggestion (untested though). I prefer to have empty abstract tags in there rather than skip them completely. The file ought to compress down to something pretty tiny at least!
Well, on wikidatawiki in beta, the new code generates a whole lot of <abstract not-applicable="" /> as we expect; on other wikis it produces the usual output. So that looks good.
Now trying to find out about standard xml libraries.
Actually, is this any different than having 'deleted="deleted"'as the attribute when a revision, contributor or comment is no longer available? AFAIK that's not a standard attribute or anything, it's just in our schema. Which reminds me, the change about needs to go into an updated schema too if we agree on it.
Does anyone know where the schema for these xml files lives? I've grepped around in mw core and in the abstract extension repos and found nothing.
Adding @brion to the patchset since he's the last person to do anything substantial (!), all the way back in 2007. Also maybe he knows where the schema is. Also, I have no idea who to ask to get this merged.
Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)
Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)
Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.
We don't have a schema in our repos anywhere that must be updated though, right?
Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)
Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185
Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.
This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P
Right. I'm not sure anything needs changing in the schema though (making the 'abstract' el optional I guess? Existing code makes it optional if the revision isn't filled in but that seems unlikely to occur, so consumers may not expect that)
Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)
Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185
There's the description field, where we could pick a language (English uber alles) and emit "Costa Rican singer". But as a user of the data I'd want the more structured data anyway, probably. :)
I think it's fine to just stub them out blank for now.
Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.
This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P
*nod* That may be what it takes. ;)
On ms1001 in the public dumps dir I did this:
list=*wik* for dirname in $list; do echo "doing $dirname"; zcat "${dirname}/20180320/${dirname}-20180320-stub-articles.xml.gz" |grep -A16 '<ns>0</ns>' | grep '<model>' | grep -v wikitext | grep -v wikibase-item | grep -v wikibase-property ; done
These wikis had 1 or several MassMessageListContent entries and nothing else: arwiki, arwikinews, cawikiquote, commonswiki, mkwiki, mrwikisource, orwiki, swwiktionary, ukwiki
The remaining odities were:
- fiwikimedia with one flow-board entry
- incubatorwiki with one css entry
- mediawikiwiki with one javascript and several flow-board entries (https://www.mediawiki.org/wiki/Topic:Sjtrser51udrcfwr?uselang=en and I have no idea how that is in the main namespace)
- metawiki with 1 text, 2 json and several MassMessageListContent entries
- sewikimedia with 1 flow-board entry
- tawikisource with one proofread-page entry (https://ta.wikisource.org/wiki/%E0%AE%95._%E0%AE%85%E0%AE%AF%E0%AF%8B%E0%AE%A4%E0%AF%8D%E0%AE%A4%E0%AE%BF%E0%AE%A4%E0%AE%BE%E0%AE%B8%E0%AE%AA%E0%AF%8D_%E0%AE%AA%E0%AE%A3%E0%AF%8D%E0%AE%9F%E0%AE%BF%E0%AE%A4%E0%AE%B0%E0%AF%8D_%E0%AE%9A%E0%AE%BF%E0%AE%A8%E0%AF%8D%E0%AE%A4%E0%AE%A9%E0%AF%88%E0%AE%95%E0%AE%B3%E0%AF%8D-1.pdf/158?uselang=en yes it is a redirect page and I have no idea how that came out to be this weird model)
- test2wiki and testwiki with a few pages of various crap each
Looking at this list I think we are good to go.
Definitely. I still think this should be announced, but given the very limited scope we might even get away without a waiting period before applying the change?
Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)
This is probably somewhere in between a Insignificant change and a Significant change, per the Wikidata:Stable Interface Policy.
Due to this, I think one week notice is enough. Given the data didn't make any sense for Wikidata before, I don't think we need to do special announcements for the Wikidata community.
Change 416409 merged by jenkins-bot:
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext