Page MenuHomePhabricator

Infoboxes are mistaken for abstracts in page abstract dumps.
Open, Needs TriagePublic

Description

Infobox parts are mistaken for abstracts in page abstract dumps.
For example at http://dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-abstract24.xml line 20 is:
"<abstract>| birth_place = Veracruz, Mexico</abstract>".

This record is about https://en.wikipedia.org/wiki/Roberto_Bueno_Campos and should be something like:
"<abstract>Roberto Eugenio Bueno Campos (born 21 January 1946) is a Mexican politician from the National Action Party.</abstract>".

Event Timeline

Ebonetti90 raised the priority of this task from to Needs Triage.
Ebonetti90 updated the task description. (Show Details)
Ebonetti90 subscribed.

I came to report this issue and I found that it exists since 2015.
My estimate is that half of the abstracts dataset does not contain any info at all, but rather few bytes from the info-boxes. A simple parsing issue I assume.

hi @ArielGlenn do you know who maintains this dataset?

"Maintains" is a strong word :-) There's not a real owner for the ActiveAbstracts extension; insofar as it's used for dumps, I'd be the person to nag. But I cannot get to this task for the foreseeable future. I would be happy to look at a patch if anyone wants to poke at it though.

Not a problem. Could you possibly point me to the source code?

I agree with @DED in their May 24, 2020 comment. I stumbled upon this while exploring the dumps. Their estimate of half the data being incorrect seems accurate. As such, I would suggest that this dump is worthless and should be eliminated. I can't imagine anybody making use of this in its current form. Producing it is a waste of machine resources, but worse than that, it's a time sink for people like me who are trying to understand the dumps.

Isn't the problem of generating abstracts already solved by the Page-Previews extension?

This feels like a place where it would be valuable to provide an optional dependency on the short description (from Wikidata) or from the TextExtracts extension, which apparently does the same thing just better. And then remove ActiveAbstract.

I'm not adverse to that if we can determine there are no real users of the data, and there's an acceptable substitute. These dumps aren't needed for makiong a replica, or for doing analysis of the content.