Page MenuHomePhabricator

Automatically generate descriptions for items based on their P31 (instance of) values
Open, Needs TriagePublic

Description

As an editor I don't want to maintain redundant descriptions in order to reduce the amount of data to keep track of.
As the dev team we don't want to store a large amount of redundant descriptions in several hundred languages.

Problem:
We have certain classes of Items where the description is more or less the same as the instance of statement and just causes additional maintenance and scaling issues with the query service.

Example:

  • scientific articles
  • chemical compounds

Proposed solution:

  • We have a fallback of the description to the value of P31 for all descriptions that don't exist in a language. We continue to use the manually set descriptions where available.
  • If a description in a fallback language exists we use that one over the P31 statement.
  • If multiple P31 statements exist then we want to list them all (comma separated?)
  • We only consider best-ranked statements for this.
  • Where does it show up?
    • We do not want these automated descriptions to be materialized in Blazegraph. We can accomplish this by not including them in the dump flavor of the RDF produced by the Linked Data endpoint: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=dump
    • We do want it in action API, Linked Data endpoint except RDF with flavor dump, database dumps

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Things to consider still:

  • If we put it into the Linked Data endpoint and the action API, then people might be inclined to use that for editing and then put the automatically generated description back as a materialized one. We don't want that and might need to introduce a flag to indicate that a description was generated automatically. (e.g. en: { language: "en", "value": "chemical compound", generated: true })
  • We don’t want to have the generated descriptions as triples in Blazegraph, but users might still want to have working ?itemDescription variables via the label service when there’s only a generated description. Should the label service reimplement the description generation logic?

Original report:

T91981 was closed without comment in late 2020 (was it out of staleness?) despite the only objections to the issue provided being made over the course of a few days in August 2015. At that time Blazegraph was still maintained and there were between 20 and 21 million items in Wikidata (and possibly a sense of optimism in the air regarding how descriptions on individual items would turn out). Now there are more than 97 million items, due primarily to the imports of scientific articles in particular—with astronomical objects coming later, to boot—and we routinely speak of a potential Blazegraph failure and the need to seek alternatives to that software. One way that we might forestall a Blazegraph failure without disturbing people is to reduce the amount of excess triples that actually need to be separately stored, and one such place from which triples might be taken out is the set of descriptions.

Like it or not, there are certain classes of items that simply will not get descriptions more imaginative or customized or detailed than the ones which over time have been added to them in different languages. Yet there are users whose entire existence on Wikidata, judging from their edit history, seems to be the addition and maintenance of these repetitive/unimaginative/etc. descriptions, needing to run so many batches of edits just to correct a single letter across millions of items. An automatic description generation mechanism based on language and item class (following a P‌31/P‌279+ path, possibly involving a few other selected properties), whose outputs may be adjusted in exactly one place rather than in millions of items separately, would at least free these users of their labors, and would allow us to remove the excess of triples for their corresponding non-automatic but equally repetitive/unimaginative/etc. counterparts.

Some classes of items that would dearly benefit from such a thing immediately include items for

  1. scientific articles (33,000,000+),
  2. Wikimedia categories (5,000,000+),
  3. Wikimedia templates (~1,000,000),
  4. stars (~3,000,000),
  5. galaxies (~2,000,000),
  6. Unicode characters (~150,000),
  7. researchers (200,000+)

This is already near half the total number of items on Wikidata at the moment, and there are likely more item classes that are missing, and there are likely more items in the noted classes that will add to the above numbers.

Note to developers and other maintainers: It is vigorously beseeched that this task not be closed as a duplicate of the previous task, since circumstances have significantly changed over the last six and a half years.

Tasks may be obsoleted by this task: T159106: Show P31 in the Wikidata search results, T141553: [feature request] DAB1: add standard description to disambiguation items at Wikidata

Event Timeline

I'm very strongly in favour of having some form of dynamically generated descriptions. The current situation is completely absurd.

Here's most of the items Mahir listed plus some more that I could think of and the number of descriptions which are identical to the corresponding label on the item.

i.e. there are 1.5 billion descriptions which simply duplicate the labels of these 12 items.

That doesn't take into account any slight differences in spelling, capitalisation or language code, e.g. 20 variations of the "Wikimedia category" labels cover another 100 million descriptions.

(I have more thoughts on this, but I'll continue another time)

Bugreporter renamed this task from Provide auto-generated descriptions for certain classes of items to Automatic generate descriptions for items based on their P31 (instance of) values.Mar 28 2022, 8:25 PM
Bugreporter renamed this task from Automatic generate descriptions for items based on their P31 (instance of) values to Automatically generate descriptions for items based on their P31 (instance of) values.
Bugreporter updated the task description. (Show Details)

For people the P106 value may be more useful than P31.

One thing to consider: this may degrade ElasticSearch results.