Change Details

As an editor I don't want to maintain redundant descriptions in order to reduce the amount of data to keep track of. As the dev team we don't want to store a large amount of redundant descriptions in several hundred languages. **Problem:** We have certain classes of Items where the description is more or less the same as the instance of statement and just causes additional maintenance and scaling issues with the query service. **Example:** * scientific articles * chemical compounds **Proposed solution:** * We have a fallback of the description to the value of P31 for all descriptions that don't exist in a language. We continue to use the manually set descriptions where available. * If a description in a fallback language exists we use that one over the P31 statement. * If multiple P31 statements exist then we want to list them all (comma separated?) * We only consider best-ranked statements for this. * Where does it show up? ** We do not want these automated descriptions to be materialized in Blazegraph. We can accomplish this by not including them in the dump flavor of the RDF produced by the Linked Data endpoint: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=dump ** We do want it in action API, Linked Data endpoint except RDF with flavor dump, database dumps **BDD** GIVEN AND WHEN AND THEN AND **Acceptance criteria:** * **Things to consider still:** * If we put it into the Linked Data endpoint and the action API, then people might be inclined to use that for editing and then put the automatically generated description back as a materialized one. We don't want that and might need to introduce a flag to indicate that a description was generated automatically. (e.g. en: { language: "en", "value": "chemical compound", generated: true }) **Original report:** T91981 was closed without comment in late 2020 (was it out of staleness?) despite the only objections to the issue provided being made over the course of a few days in August 2015. At that time Blazegraph was still maintained and there were between 20 and 21 million items in Wikidata (and possibly a sense of optimism in the air regarding how descriptions on individual items would turn out). Now there are more than 97 million items, due primarily to the imports of scientific articles in particular—with astronomical objects coming later, to boot—and we routinely speak of [[ https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Blazegraph_failure_playbook | a potential Blazegraph failure ]] and the need to seek alternatives to that software. One way that we might forestall a Blazegraph failure without disturbing people is to reduce the amount of excess triples that actually need to be separately stored, and one such place from which triples might be taken out is the set of descriptions. Like it or not, there are certain classes of items that simply **//will not//** get descriptions more imaginative or customized or detailed than the ones which over time have been added to them in different languages. Yet there are users whose entire existence on Wikidata, judging from their edit history, seems to be the addition and maintenance of these repetitive/unimaginative/etc. descriptions, needing to run so many batches of edits just to correct a single letter across millions of items. An automatic description generation mechanism based on language and item class (following a P‌31/P‌279+ path, possibly involving a few other selected properties), whose outputs may be adjusted in exactly one place rather than in millions of items separately, would at least free these users of their labors, and would allow us to remove the excess of triples for their corresponding non-automatic but equally repetitive/unimaginative/etc. counterparts. Some classes of items that would dearly benefit from such a thing //immediately// include items for 1) scientific articles (33,000,000+), 2) Wikimedia categories (5,000,000+), 3) Wikimedia templates (~1,000,000), 4) stars (~3,000,000), 5) galaxies (~2,000,000), 6) Unicode characters (~150,000), 7) researchers (200,000+) This is already near //half// the total number of items on Wikidata at the moment, and //there are likely more item classes// that are missing, and //there are likely more items in the noted classes// that will add to the above numbers. **//Note to developers and other maintainers: It is vigorously beseeched//** that this task not be closed as a duplicate of the previous task, since circumstances have significantly changed over the last six and a half years.

As an editor I don't want to maintain redundant descriptions in order to reduce the amount of data to keep track of. As the dev team we don't want to store a large amount of redundant descriptions in several hundred languages. **Problem:** We have certain classes of Items where the description is more or less the same as the instance of statement and just causes additional maintenance and scaling issues with the query service. **Example:** * scientific articles * chemical compounds **Proposed solution:** * We have a fallback of the description to the value of P31 for all descriptions that don't exist in a language. We continue to use the manually set descriptions where available. * If a description in a fallback language exists we use that one over the P31 statement. * If multiple P31 statements exist then we want to list them all (comma separated?) * We only consider best-ranked statements for this. * Where does it show up? ** We do not want these automated descriptions to be materialized in Blazegraph. We can accomplish this by not including them in the dump flavor of the RDF produced by the Linked Data endpoint: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=dump ** We do want it in action API, Linked Data endpoint except RDF with flavor dump, database dumps **BDD** GIVEN AND WHEN AND THEN AND **Acceptance criteria:** * **Things to consider still:** * If we put it into the Linked Data endpoint and the action API, then people might be inclined to use that for editing and then put the automatically generated description back as a materialized one. We don't want that and might need to introduce a flag to indicate that a description was generated automatically. (e.g. en: { language: "en", "value": "chemical compound", generated: true }) * We don’t want to have the generated descriptions as triples in Blazegraph, but users might still want to have working `?itemDescription` variables via the label service when there’s only a generated description. Should the label service reimplement the description generation logic? **Original report:** T91981 was closed without comment in late 2020 (was it out of staleness?) despite the only objections to the issue provided being made over the course of a few days in August 2015. At that time Blazegraph was still maintained and there were between 20 and 21 million items in Wikidata (and possibly a sense of optimism in the air regarding how descriptions on individual items would turn out). Now there are more than 97 million items, due primarily to the imports of scientific articles in particular—with astronomical objects coming later, to boot—and we routinely speak of [[ https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Blazegraph_failure_playbook | a potential Blazegraph failure ]] and the need to seek alternatives to that software. One way that we might forestall a Blazegraph failure without disturbing people is to reduce the amount of excess triples that actually need to be separately stored, and one such place from which triples might be taken out is the set of descriptions. Like it or not, there are certain classes of items that simply **//will not//** get descriptions more imaginative or customized or detailed than the ones which over time have been added to them in different languages. Yet there are users whose entire existence on Wikidata, judging from their edit history, seems to be the addition and maintenance of these repetitive/unimaginative/etc. descriptions, needing to run so many batches of edits just to correct a single letter across millions of items. An automatic description generation mechanism based on language and item class (following a P‌31/P‌279+ path, possibly involving a few other selected properties), whose outputs may be adjusted in exactly one place rather than in millions of items separately, would at least free these users of their labors, and would allow us to remove the excess of triples for their corresponding non-automatic but equally repetitive/unimaginative/etc. counterparts. Some classes of items that would dearly benefit from such a thing //immediately// include items for 1) scientific articles (33,000,000+), 2) Wikimedia categories (5,000,000+), 3) Wikimedia templates (~1,000,000), 4) stars (~3,000,000), 5) galaxies (~2,000,000), 6) Unicode characters (~150,000), 7) researchers (200,000+) This is already near //half// the total number of items on Wikidata at the moment, and //there are likely more item classes// that are missing, and //there are likely more items in the noted classes// that will add to the above numbers. **//Note to developers and other maintainers: It is vigorously beseeched//** that this task not be closed as a duplicate of the previous task, since circumstances have significantly changed over the last six and a half years.