Page MenuHomePhabricator

Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow.
Closed, ResolvedPublic

Description

They are almost 60GB, which is an order of magnitude larger than anything else.

Event Timeline

ArielGlenn triaged this task as Medium priority.Oct 12 2017, 8:30 AM
ArielGlenn created this task.

5487381 titles for enwiki abstracts, 5G. 37327702 titles for wikidatawiki abstracts, 59G. The content of wikidata abstracts is mostly garbage though. A sample:

<doc>
<title>Wikidata: Q28</title>
<url>https://www.wikidata.org/wiki/Q28</url>
<abstract>{&quot;type&quot;:&quot;item&quot;,&quot;id&quot;:&quot;Q28&quot;,&quot;labels&quot;:{&quot;nb&quot;:{&quot;language&quot;:&quot;nb&quot;,&
quot;value&quot;:&quot;Ungarn&quot;},&quot;en&quot;:{&quot;language&quot;:&quot;en&quot;,&quot;value&quot;:&quot;Hungary&quot;},&quot;nn&quot;:{&quot
;language&quot;:&quot;nn&quot;,&quot;value&quot;:&quot;Ungarn&quot;},&quot;se&quot;:{&quot;language&quot;:&quot;se&quot;,&quot;value&quot;:&quot;Ung\
u00e1ra&quot;},&quot;de&quot;:{&quot;language&quot;:&quot;de&quot;,&quot;value&quot;:&quot;Ungarn&quot;},&quot;fr&quot;:{&quot;language&quot;:&quot;f
r&quot;,&quot;value&quot;:&quot;Hongrie&quot;},&quot;it&quot;:{&quot;language&quot;:&quot;it&quot;,&quot;value&quot;:&quot;Ungheria&quot;},&quot;pl&q
uot;:{&quot;language&quot;:&quot;pl&quot;,&quot;value&quot;:&quot;W\u0119gry&quot;},&quot;eo&quot;:{&quot;language&quot;:&quot;eo&quot;,&quot;value&q
uot;:&quot;Hungario&quot;},&quot;ru&quot;:{&quot;language&quot;:&quot;ru&quot;,&quot;value&quot;:&quot;\u0412\u0435\u043d\u0433\u0440\u0438\u044f&quo
t;},&quot;es&quot;:{&quot;language&quot;:&quot;es&quot;,&quot;value&quot;:&quot;Hungr\u00eda&quot;},&quot;be-tarask&quot;:{&quot;language&quot;:&quot
;be-tarask&quot;,&quot;value&quot;:&quot;\u0412\u0443\u0433\u043e\u0440\u0448\u0447\u044b\u043d\u0430&quot;},&quot;sgs&quot;:{&quot;language&quot;:&q
uot;sgs&quot;,&quot;value&quot;:&quot;Vengr\u0117j\u0117&quot;},&quot;rup&quot;:{&quot;language&quot;:&quot;rup&quot;,&quot;value&quot;:&quot;Ungaria
&quot;},&quot;nan&quot;:{&quot;language&quot;:&quot;nan&quot;,&quot;value&quot;:&quot;Magyar-kok&quot;},&quot;vro&quot;:{&quot;language&quot;:&quot;v
ro&quot;,&quot;value&quot;:&quot;Ungari&quot;},&quot;roa-tara&quot;:{&quot;language&quot;:&quot;roa-tara&quot;,&quot;value&quot;:&quot;Ungherie&quot;
},&quot;yue&quot;:{&quot;language&quot;:&quot;yue&quot;,&quot;value&quot;:&quot;\u5308\u7259\u5229&quot;},&quot;lzh&quot;:{&quot;language&quot;:&quot
;lzh&quot;,&quot;value&quot;:&quot;\u5308\u7259\u5229&quot;},&quot;nds-nl&quot;:{&quot;language&quot;:&quot;nds-nl&quot;,&quot;value</abstract>
<links>
</links>
</doc>

Note the growth trend shown at https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaWIKIDATA.htm

  • August 2014: 16.1 million articles.
  • August 2015: 19 million
  • August 2016: 24.6 million
  • August 2017: 35.5 million

Relevant code: https://github.com/wikimedia/mediawiki-extensions-ActiveAbstract/blob/master/AbstractFilter.php#L131

I'm not sure how Wikidata abstracts could be meaningful… I can make a (rather bold) suggestion to just drop an empty string in case we're dealing with non TextContent in AbstractFilter?!

Change 416409 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext

https://gerrit.wikimedia.org/r/416409

I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. But your approach is better, in case similar content creeps into other projects. @hoo, what do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?

I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract.

If this is just NS0 (or content namespaces… which are all Wikibase entity namespaces), this definitely makes sense to me.

But your approach is better, in case similar content creeps into other projects. @hoo, what do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?

Hm… not sure about returning "NONTEXTCONTENT", maybe either omit the <abstract> or do something like <abstract not-applicable="" />?

I've updated it accoring to your second suggestion (untested though). I prefer to have empty abstract tags in there rather than skip them completely. The file ought to compress down to something pretty tiny at least!

Well, on wikidatawiki in beta, the new code generates a whole lot of <abstract not-applicable="" /> as we expect; on other wikis it produces the usual output. So that looks good.
Now trying to find out about standard xml libraries.

Actually, is this any different than having 'deleted="deleted"'as the attribute when a revision, contributor or comment is no longer available? AFAIK that's not a standard attribute or anything, it's just in our schema. Which reminds me, the change about needs to go into an updated schema too if we agree on it.

Does anyone know where the schema for these xml files lives? I've grepped around in mw core and in the abstract extension repos and found nothing.

Adding @brion to the patchset since he's the last person to do anything substantial (!), all the way back in 2007. Also maybe he knows where the schema is. Also, I have no idea who to ask to get this merged.

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

We don't have a schema in our repos anywhere that must be updated though, right?

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

We don't have a schema in our repos anywhere that must be updated though, right?

Right. I'm not sure anything needs changing in the schema though (making the 'abstract' el optional I guess? Existing code makes it optional if the revision isn't filled in but that seems unlikely to occur, so consumers may not expect that)

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185

There's the description field, where we could pick a language (English uber alles) and emit "Costa Rican singer". But as a user of the data I'd want the more structured data anyway, probably. :)

I think it's fine to just stub them out blank for now.

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P

*nod* That may be what it takes. ;)

On ms1001 in the public dumps dir I did this:

list=*wik*
for dirname in $list; do echo "doing $dirname"; zcat "${dirname}/20180320/${dirname}-20180320-stub-articles.xml.gz"  |grep -A16 '<ns>0</ns>' | grep '<model>' | grep -v wikitext | grep -v wikibase-item  | grep -v wikibase-property ; done

These wikis had 1 or several MassMessageListContent entries and nothing else: arwiki, arwikinews, cawikiquote, commonswiki, mkwiki, mrwikisource, orwiki, swwiktionary, ukwiki
The remaining odities were:

Looking at this list I think we are good to go.

Looking at this list I think we are good to go.

Definitely. I still think this should be announced, but given the very limited scope we might even get away without a waiting period before applying the change?

Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)

Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)

This is probably somewhere in between a Insignificant change and a Significant change, per the Wikidata:Stable Interface Policy.

Due to this, I think one week notice is enough. Given the data didn't make any sense for Wikidata before, I don't think we need to do special announcements for the Wikidata community.

Email sent to xmldatadumps-l and wikitech-l.

Change 416409 merged by jenkins-bot:
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext

https://gerrit.wikimedia.org/r/416409

@ArielGlenn Do we want to close this, yet? Or wait for the first new dumps?

I'd like to wait for the first run. I'll retitle the task then too :-)

I've checked some output files from this month's run and they look good! Closing.

ArielGlenn renamed this task from Investigate why wikidata abstracts dumps are so large to Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow..May 8 2018, 7:31 AM
ArielGlenn moved this task from Active to Done on the Dumps-Generation board.