⚓ T178047 Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow.

	Subject	Repo	Branch	Lines +/-
	don't try to abstract things that aren't text or wikitext	mediawiki/extensions/ActiveAbstract	master	+8 -3

ArielGlenn triaged this task as Medium priority.Oct 12 2017, 8:30 AM

ArielGlenn created this task.

5487381 titles for enwiki abstracts, 5G. 37327702 titles for wikidatawiki abstracts, 59G. The content of wikidata abstracts is mostly garbage though. A sample:

<doc>
<title>Wikidata: Q28</title>
<url>https://www.wikidata.org/wiki/Q28</url>
<abstract>{&quot;type&quot;:&quot;item&quot;,&quot;id&quot;:&quot;Q28&quot;,&quot;labels&quot;:{&quot;nb&quot;:{&quot;language&quot;:&quot;nb&quot;,&
quot;value&quot;:&quot;Ungarn&quot;},&quot;en&quot;:{&quot;language&quot;:&quot;en&quot;,&quot;value&quot;:&quot;Hungary&quot;},&quot;nn&quot;:{&quot
;language&quot;:&quot;nn&quot;,&quot;value&quot;:&quot;Ungarn&quot;},&quot;se&quot;:{&quot;language&quot;:&quot;se&quot;,&quot;value&quot;:&quot;Ung\
u00e1ra&quot;},&quot;de&quot;:{&quot;language&quot;:&quot;de&quot;,&quot;value&quot;:&quot;Ungarn&quot;},&quot;fr&quot;:{&quot;language&quot;:&quot;f
r&quot;,&quot;value&quot;:&quot;Hongrie&quot;},&quot;it&quot;:{&quot;language&quot;:&quot;it&quot;,&quot;value&quot;:&quot;Ungheria&quot;},&quot;pl&q
uot;:{&quot;language&quot;:&quot;pl&quot;,&quot;value&quot;:&quot;W\u0119gry&quot;},&quot;eo&quot;:{&quot;language&quot;:&quot;eo&quot;,&quot;value&q
uot;:&quot;Hungario&quot;},&quot;ru&quot;:{&quot;language&quot;:&quot;ru&quot;,&quot;value&quot;:&quot;\u0412\u0435\u043d\u0433\u0440\u0438\u044f&quo
t;},&quot;es&quot;:{&quot;language&quot;:&quot;es&quot;,&quot;value&quot;:&quot;Hungr\u00eda&quot;},&quot;be-tarask&quot;:{&quot;language&quot;:&quot
;be-tarask&quot;,&quot;value&quot;:&quot;\u0412\u0443\u0433\u043e\u0440\u0448\u0447\u044b\u043d\u0430&quot;},&quot;sgs&quot;:{&quot;language&quot;:&q
uot;sgs&quot;,&quot;value&quot;:&quot;Vengr\u0117j\u0117&quot;},&quot;rup&quot;:{&quot;language&quot;:&quot;rup&quot;,&quot;value&quot;:&quot;Ungaria
&quot;},&quot;nan&quot;:{&quot;language&quot;:&quot;nan&quot;,&quot;value&quot;:&quot;Magyar-kok&quot;},&quot;vro&quot;:{&quot;language&quot;:&quot;v
ro&quot;,&quot;value&quot;:&quot;Ungari&quot;},&quot;roa-tara&quot;:{&quot;language&quot;:&quot;roa-tara&quot;,&quot;value&quot;:&quot;Ungherie&quot;
},&quot;yue&quot;:{&quot;language&quot;:&quot;yue&quot;,&quot;value&quot;:&quot;\u5308\u7259\u5229&quot;},&quot;lzh&quot;:{&quot;language&quot;:&quot
;lzh&quot;,&quot;value&quot;:&quot;\u5308\u7259\u5229&quot;},&quot;nds-nl&quot;:{&quot;language&quot;:&quot;nds-nl&quot;,&quot;value</abstract>
<links>
</links>
</doc>

Note the growth trend shown at https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaWIKIDATA.htm

August 2014: 16.1 million articles.
August 2015: 19 million
August 2016: 24.6 million
August 2017: 35.5 million

hoo added projects: Wikidata, MediaWiki-extensions-WikibaseRepository.Oct 12 2017, 9:38 AM

hoo subscribed.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Oct 24 2017, 11:47 AM

Relevant code: https://github.com/wikimedia/mediawiki-extensions-ActiveAbstract/blob/master/AbstractFilter.php#L131

I'm not sure how Wikidata abstracts could be meaningful… I can make a (rather bold) suggestion to just drop an empty string in case we're dealing with non TextContent in AbstractFilter?!

Change 416409 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext

https://gerrit.wikimedia.org/r/416409

gerritbot added a project: Patch-For-Review.Mar 5 2018, 10:44 AM

I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. But your approach is better, in case similar content creeps into other projects. @hoo, what do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?

In T178047#4023267, @ArielGlenn wrote:

I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract.

If this is just NS0 (or content namespaces… which are all Wikibase entity namespaces), this definitely makes sense to me.

In T178047#4023267, @ArielGlenn wrote:

But your approach is better, in case similar content creeps into other projects. @hoo, what do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?

Hm… not sure about returning "NONTEXTCONTENT", maybe either omit the <abstract> or do something like <abstract not-applicable="" />?

I've updated it accoring to your second suggestion (untested though). I prefer to have empty abstract tags in there rather than skip them completely. The file ought to compress down to something pretty tiny at least!

Well, on wikidatawiki in beta, the new code generates a whole lot of <abstract not-applicable="" /> as we expect; on other wikis it produces the usual output. So that looks good.
Now trying to find out about standard xml libraries.

Actually, is this any different than having 'deleted="deleted"'as the attribute when a revision, contributor or comment is no longer available? AFAIK that's not a standard attribute or anything, it's just in our schema. Which reminds me, the change about needs to go into an updated schema too if we agree on it.

Does anyone know where the schema for these xml files lives? I've grepped around in mw core and in the abstract extension repos and found nothing.

Adding @brion to the patchset since he's the last person to do anything substantial (!), all the way back in 2007. Also maybe he knows where the schema is. Also, I have no idea who to ask to get this merged.

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

In T178047#4073899, @brion wrote:

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

We don't have a schema in our repos anywhere that must be updated though, right?

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P

In T178047#4073991, @ArielGlenn wrote:

In T178047#4073899, @brion wrote:

Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference)

We don't have a schema in our repos anywhere that must be updated though, right?

Right. I'm not sure anything needs changing in the schema though (making the 'abstract' el optional I guess? Existing code makes it optional if the revision isn't filled in but that seems unlikely to occur, so consumers may not expect that)

Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here. Isn't there something already for the built-in search dropdown and such? But just stubbing them out is probably fine as a preliminary measure. :)

Trust me, from wikidata entities there is nothing useful that can be gotten out as a text abstract. I stuffed a sample semi-pretty-print-formatted revision text here: F15971185

There's the description field, where we could pick a language (English uber alles) and emit "Costa Rican singer". But as a user of the data I'd want the more structured data anyway, probably. :)

I think it's fine to just stub them out blank for now.

Should we consider retooling this dump to a more manageable... documented... schema? Would have to find out who depends on the current one though.

This might be nice future work. I have no idea who relies on this dump though. We could try looking up ips of downloaders but I'm not sure what that would get us, and previous calls of "who uses this?" have fallen on deaf ears. If I were a bit more vicious I would turn them off for a run and see who complained :-P

*nod* That may be what it takes. ;)

On ms1001 in the public dumps dir I did this:

list=*wik*
for dirname in $list; do echo "doing $dirname"; zcat "${dirname}/20180320/${dirname}-20180320-stub-articles.xml.gz"  |grep -A16 '<ns>0</ns>' | grep '<model>' | grep -v wikitext | grep -v wikibase-item  | grep -v wikibase-property ; done

These wikis had 1 or several MassMessageListContent entries and nothing else: arwiki, arwikinews, cawikiquote, commonswiki, mkwiki, mrwikisource, orwiki, swwiktionary, ukwiki
The remaining odities were:

fiwikimedia with one flow-board entry
incubatorwiki with one css entry
mediawikiwiki with one javascript and several flow-board entries (https://www.mediawiki.org/wiki/Topic:Sjtrser51udrcfwr?uselang=en and I have no idea how that is in the main namespace)
metawiki with 1 text, 2 json and several MassMessageListContent entries
sewikimedia with 1 flow-board entry
tawikisource with one proofread-page entry (https://ta.wikisource.org/wiki/%E0%AE%95._%E0%AE%85%E0%AE%AF%E0%AF%8B%E0%AE%A4%E0%AF%8D%E0%AE%A4%E0%AE%BF%E0%AE%A4%E0%AE%BE%E0%AE%B8%E0%AE%AA%E0%AF%8D_%E0%AE%AA%E0%AE%A3%E0%AF%8D%E0%AE%9F%E0%AE%BF%E0%AE%A4%E0%AE%B0%E0%AF%8D_%E0%AE%9A%E0%AE%BF%E0%AE%A8%E0%AF%8D%E0%AE%A4%E0%AE%A9%E0%AF%88%E0%AE%95%E0%AE%B3%E0%AF%8D-1.pdf/158?uselang=en yes it is a redirect page and I have no idea how that came out to be this weird model)
test2wiki and testwiki with a few pages of various crap each

Looking at this list I think we are good to go.

ArielGlenn removed a project: User-ArielGlenn.Mar 29 2018, 2:43 PM

In T178047#4091341, @ArielGlenn wrote:

Looking at this list I think we are good to go.

Definitely. I still think this should be announced, but given the very limited scope we might even get away without a waiting period before applying the change?

Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)

In T178047#4097208, @ArielGlenn wrote:

Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)

This is probably somewhere in between a Insignificant change and a Significant change, per the Wikidata:Stable Interface Policy.

Due to this, I think one week notice is enough. Given the data didn't make any sense for Wikidata before, I don't think we need to do special announcements for the Wikidata community.

Email sent to xmldatadumps-l and wikitech-l.

Change 416409 merged by jenkins-bot:
[mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext

https://gerrit.wikimedia.org/r/416409

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-05-01 (1.32.0-wmf.2)).Apr 28 2018, 7:00 PM

@ArielGlenn Do we want to close this, yet? Or wait for the first new dumps?

I'd like to wait for the first run. I'll retitle the task then too :-)

I've checked some output files from this month's run and they look good! Closing.

ArielGlenn renamed this task from Investigate why wikidata abstracts dumps are so large to Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow..May 8 2018, 7:31 AM

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.

Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow.
Closed, ResolvedPublic
Actions

Description

Details

Event Timeline

	F15971185: wikidata revision content sample
	Mar 22 2018, 8:10 PM

Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow.Closed, ResolvedPublicActions

Description

Details

Event Timeline

Investigate why wikidata abstracts dumps are so large, see if we can reduce the size somehow.
Closed, ResolvedPublic
Actions