Jul 21 2023
I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!
@Manuel when you write:
Jul 20 2023
I have a theory as to where a big chunk of the machine generated descriptions are from. They are the phrase "Wikimedia category" in hundreds of languages as a textual transcription of the triple instanceOf Q4167836. For example, Catégorie:Naissance à Seri Menanti has a single label in French and the P31 instanceOf claim which together occupy 802 bytes. Then two bots (Mr.Ibrahembot and Emijrpbot) came along and added another 11.5K (!) of static text (not even anything templated) in 129 languages, none of which have labels for the category.
Is triple count the only important parameter? It seems likely that the descriptions could be larger, on average, than labels.
Mar 21 2023
Mar 14 2023
How does one discover what the resolution was? (Apologies if this should be obvious, but I'm used to bug trackers which link the commits back to the issue.)
Feb 9 2023
I vote for full URLs. Also, HTTPS URLs should probably be used throughout in preference to HTTP URLs to save naive clients from the extra latency of a redirect.
Jan 10 2023
Sep 10 2020
Aug 21 2020
I'm surprised that a private third party proxying such a significant segment of the traffic to Wikidata hasn't prompted the Wikidata Engineering team to take this more seriously.
Jul 4 2020
Jun 30 2020
The documentation claims INCRBYFLOAT was introduced in Redis 2.6.0
If the manifest has to be constructed by hand, it seems like YAML would be a better format than JSON. They are equivalent from a structural and informational point of view, but YAML is much easier to edit without creating invalid documents.
Apr 30 2020
The so-called "Freebase" dataset is actually a mix of data from Freebase and a bunch of URLs that were pulled from Google web crawls by an intern as potential "evidence." They don't have anything to do with the provenance of the data that was in Freebase, which was recorded for every item of data that was written there. Of course it would be silly to suggest a blacklisted site, but I don't believe the intern was provided with a blacklist ahead of time and as the blacklist was developed after the fact it hasn't been used to filter what's presented to users.
It would seem like the 2018-03-13 spreadsheet should be adequate to call this task complete. I would recommend including some qualitative understanding of the source of the Freebase data in addition to just pure curation ratio when making judgements about how to use which data. Things like MusicBrainz IDs and ISFDB IDs went through a heavily QA'd reconciliation process and are going to be high quality. Films, and to a lesser extent TV shows, were an area of focus for the Freebase team, so will generally be both high quality and relatively complete.
Apr 18 2016
It seems bizarre that the utility of this is debated. The solution suggested by Bene sounds simple, straightforward, and useful.
Feb 24 2016
I think these are likely two different bugs. Has anyone looked at either of them in the last 4 months?