Page MenuHomePhabricator

Gather information on users of wb_terms replicas on WMF cloud infrastructure
Open, NormalPublic

Description

wb_terms is replicated on toolforge, and people can build script, tools etc using these.

As we're considering phasing out this database table, we want to understand what and how data is used, so we can offer some reasonable replacement(s) for the users.

If you are the author of a tool using wb_terms replica, or use replicas in any other way, please provide some basic information on what are you using and how.
If you have any other questions that you think are related, do not hesitate to ask!

Example:
Usage: I have created a tool that finds wikidata items that have English label, but are missing a label in my native language
What data I use: I query label data from wb_terms, and then process results in my tool to find gaps.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 5:24 PM
WMDE-leszek updated the task description. (Show Details)Jun 14 2018, 11:38 AM
WMDE-leszek triaged this task as Normal priority.Jun 14 2018, 12:42 PM

I do not operate any specific tool that uses wb_terms—instead opting to write simple queries—but there are a couple things I have found useful about that table:

If it is entirely possible to do both of those things given the time and space limits imposed by users of query.wikidata.org, I'd be very much interested to know how. Most of my efforts to do these with SPARQL have timed out, with any revisions to overcome these futile.

Magnus added a subscriber: Magnus.Jun 20 2018, 9:31 AM

Oh well, there go half my Wikidata-related tools...

Is there any way to grep across all my tools for "wb_terms"? I have no idea which ones use it, I just know it's quite a few :-(

Usage: PetScan

  • Filter for items with all/any/no labels in a set of languages
  • Find items by label
  • Label Wikidata items in result set

Data: term_full_entity_id,term_text,term_type, term_entity_type, term_language

Just within the "wikidata-todo" tool, these sub-tools use wb_terms in various ways:

  • add_name_labels.php
  • artwork_images.php
  • awarder.php
  • beacon.php
  • cloudy_concept.php
  • creator.php
  • dupe_finder.php
  • firstnamebasis.php
  • get_item_names.php
  • important_blank_items.php
  • index.php
  • label_no_instance.php
  • no_statements.php
  • project_stats.php
  • relabel.php
  • related_properties.php
  • sexer.php
  • sparql_rc.php
  • translate_items_with_property.php
  • user_edits.php
  • wdq2graph.php
  • wdq_feed.php

Also, within the same tool, maintenance/cron scripts use it:

  • denkmallisten_sachsen
  • duplicity (multiple scripts)
  • photos_by_employer
  • wikispecies (multiple scripts)

Labs currently lists 132 tools of mine: https://tools.wmflabs.org/admin/tools#!/author/Magnus%20Manske

I will not be able to fix all the ones that use wb_terms once you break it, at least not in a reasonable timeframe. I will instead direct all angry users to @WMDE-leszek :-]

Nikki added a subscriber: Nikki.Jun 20 2018, 10:49 AM

I don't have any tools but I have used the wb_terms table in Quarry a number of times. I am usually trying to select all terms for a particular language, all terms which match a particular regex or to count how many terms there are. Things like finding labels containing disambiguation information, finding descriptions written like sentences, finding terms containing HTML entities, finding labels which have namespace prefixes when they shouldn't or vice versa, finding misspelt words, listing the most common descriptions for a language...

I use the columns term_full_entity_id (or term_entity_id in older queries), term_entity_type, term_language, term_type and term_text.

It is usually not possible to use SPARQL because the queries are too slow and the timeout for queries in the query service is much lower than for Quarry. In particular, querying for all terms in a particular language is very slow which I already created a ticket for - T167361 .

@WMDE-leszek Can you provide some more details on why this is to be phased out?

Thanks @Mahir256, @Magnus, @Nikki for comments, that's very helpful!

@Mahir256 to clarify:

It is also my go-to table whenever there is some cleaning up of labels to be done

Do I interpret correctly that you are generally running such queries regarding labels only? (That's not a leading question, just trying to get more data :))

@Magnus: that's plenty of tools. We'll come up with some solution, for sure.
Regarding PetScan: querying wb_terms happens in regards to labels only, though, right?

I will instead direct all angry users to @WMDE-leszek :-]

That is of course a part of any migration path we'll provide!

@Daniel_Mietchen: simply put, the table has grown too big, and it is not really maintainable. There has been several incidents in last months related to the table, and all short-term measures that have been attempted to improve the situation have failed. On the more general level, it seems (which can also be seen by reading comments in only this ticket), that the table serves many diverse needs/use cases, but it is not really perfect in any of them.

To clarify, I take it as my failure to communicate it clearly: We are certain wb_terms sql table as it is now cannot continue to be used on such big Wikibase installation as Wikidata. That said, the plan is not simply phase it out, but to introduce replacements that scale better, and better serve particular needs and use cases.
As an example one could think of ElasticSearch being now used on Wikidata as the search engine, as the wb_terms as the as the search index was not usable any more (due to amount of data in the table).

So thank you all for the input so far, and of course we're waiting for more!
wb_terms is not going to turned off without the announcement and a reasonable grace period AND without having replacements in place.

@Magnus: that's plenty of tools. We'll come up with some solution, for sure.
Regarding PetScan: querying wb_terms happens in regards to labels only, though, right?

If you look at http://petscan.wmflabs.org/ specifically the "Wikidata" tab, and the "Labels etc." section there, it currently offers LIKE queries on labels, aliases, and/or descriptions.

From the query log, I see that these options, in some form, were used 9039 times so far. Not a crippling loss then, but someone would sure be unhappy...

XTools uses wb_terms to report if a page is missing a label and description on Wikidata for the corresponding language. Performance is paramount so if SPARQL is indeed that slow (T197161#4302203) I don't think it would be an option for XTools. If we can efficiently query for it on the replicas via some other table(s) this is perfectly fine for us :)

Nikki added a comment.Jun 22 2018, 7:15 AM

XTools uses wb_terms to report if a page is missing a label and description on Wikidata for the corresponding language. Performance is paramount so if SPARQL is indeed that slow (T197161#4302203) I don't think it would be an option for XTools. If we can efficiently query for it on the replicas via some other table(s) this is perfectly fine for us :)

Not all queries for terms in SPARQL are slow. The ones which are problematic are ones which involve checking a large number (i.e. millions) of terms to see if they match some criteria (which is not fast in SQL either :)). Because of the way the data is modelled in RDF, this includes queries which try to select all terms in a particular language even when there aren't many terms for that language: The text and the language are both part of the value and have to be extracted separately using functions (similar to how you might extract the year from a date), they're not indexed seperately. Selecting terms for small numbers of items should work fine (e.g. selecting English labels/descriptions for Q1 and Q2 and selecting English labels/descriptions for cats both take less than a second).

Found another one: Mix'n'match "automatch" (preliminary matches by name, that need to be confirmed by a user). This quickly checks tens/hundreds of thousands of names against Wikidata labels on a regular basis. I am quite certain this would not work with SPARQL, and though it might actually yield better quality using the search API, it's one http(s) query per name, and I don't want to DOS Wikidata, not even considering speed...

I don't want to stop a discussion or anything, just to avoid possible misunderstanding. I didn't mention SPARQL as the considered replacement for wb_terms for tools. I am pretty sure it wouldn't be a good alternative for most of the tools, so we are not even really considering this path :)

@WMDE-leszek I think some tools could actually use SPARQL, if you look for a precise label with known language for example. But I mentioned SPARQL only as one of the other ways we currently have. wbsearchentities is another, and so is the fulltext search API. And wbgetentities of course, if you know the item and just want the label. They all have some potential for certain tasks, but especially when dealing with lots of labels to check, none of them scale up.

As for possible solutions, one approach I could live with would be a http-based API where I can submit a large batch of strings (via POST), some parameters (e.g. "items for German labels matching these regexes"), and then get a big mapping back. That would solve the death-by-a-million-http-queries problem with the current APIs, and let you access the data in any way you see fit behind the API. Just a thought.

@Lucas_Werkmeister_WMDE is amazing and created a script that finds tools that are somehow using wb_terms. The results are at P7299. The script, for reference, is P7298. Thanks Lucas!
We'll use this input as well. But of course more detailed and personal input like from folks above is always helpful!

@WMDE-leszek: no, I also have used wb_terms to clean up descriptions and aliases as well; for example, I will most likely find some time to move gom entries to gom-latn or gom-deva as appropriate (similar things are possible for ks, crh, gan, ku, ruq, shi, ug, and with some difficulty sr, tg, tt, and kk). (Apologies for the delayed reply.)

Vvjjkkii renamed this task from Gather information on users of wb_terms replicas on WMF cloud infrastructure to m2aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from m2aaaaaaaa to Gather information on users of wb_terms replicas on WMF cloud infrastructure.Jul 2 2018, 12:22 PM
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.