Page MenuHomePhabricator

German Infoboxes parsed into TextExtract
Open, LowestPublic

Description

Description:
For certain German TextExtracts, we are seeing parts of the “Staat” and “Fußballklub” Infoboxes included. The problem could exist on German pages with other Infobox templates as well. See examples below.

Steps to Reproduce:

  1. Go to: https://de.wikipedia.org/wiki/Afghanistan?action=raw
  2. Compare raw page from step 1 to text extract json output: https://de.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Crevisions&iwurl=1&titles=Afghanistan&utf8=1&formatversion=2&exlimit=1&explaintext=1&rvprop=ids&redirects=1&converttitles=1
  3. We expect:
  • The text extract should begin with “Afghanistan (paschtunisch und persisch (Dari) افغانستان Afghānestān, offiziell Islamische Republik Afghanistan) ist ein Binnenstaat Südasiens an der Schnittstelle von Süd- zu Zentralasien, der an den Iran, Turkmenistan, Usbekistan, Tadschikistan, die Volksrepublik China und Pakistan grenzt. Drei Viertel des Landes bestehen aus schwer zugänglichen Gebirgsregionen."
  • Instead, we see an extract beginning with “"Vorlage:Infobox Staat/Wartung/NAME-DEUTSCH Afghanistan ist ein Binnenstaat Südasiens an der Schnittstelle von Süd- zu Zentralasien, der an den Iran, Turkmenistan, Usbekistan, Tadschikistan, die Volksrepublik China und Pakistan grenzt."
  1. We see this occurring on several other German pages, including:

Event Timeline

Hi @Emilywelsch, please see T217566#5009100 how you can fix this. I'm closing this task as a duplicate.

Whoops. This is about TextExtracts and not about search result previews, sorry! Reopening this task.

Jdlrobson subscribed.

To set expectations, TextExtracts is in maintenance mode so this is unlikely to get fixed by reading web team.

If the reading web team isn't appropriate, is there another team assigned to work on TextExtracts? What does "maintenance mode" mean in terms of timeline? Thanks!

This piece of software is not actively worked on; "maintenance mode" means that contributed patches to fix a problem will usually get reviewed.

In particular for Infobox Template "Staat”, we see that this problem is prolific across all country pages in DE. Is it possible this is actually an error associated with the configuation of the Infobox Template and not TextExtracts? We see Infobox Template: Vorlage:Infobox Staat/Wartung/NAME-DEUTSCH appearing in the Extract for all DE states.

Additional examples:

And this list goes on.

@Aklapper @Jdlrobson If TextExtracts is in maintenance mode, is there anything you'd recommend clients to use instead for programmatically reading Wikipedia pages? Is there another API we should be aware of, or do you recommend a different way entirely to read from Wikipedia without using TextExtracts?

I don't recommend anything else, no.

I'm not sure about your use case and whether this would work with it but https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_summary__title_ is the preferred and recommended way for obtained summaries.

That API looks like it fits our use case pretty well actually. This Infobox text doesn't appear in the text extract either, which is an added plus :)

Aklapper triaged this task as Lowest priority.Jan 5 2021, 8:16 AM