Page MenuHomePhabricator

Request: expose database tables of the Translate extension to users in replicas on Toolforge (Wikidata, or all Wikis)
Closed, ResolvedPublic

Description

I am currently building a tool on Toolforge, and would like to query information related to pages managed by the Translate extension [1]. For Wikidata, the tables are currently not accessible, thus I request to expose them to users in the wiki replicas. I would not mind if this happes for other wikis that use the Translate extension as well, although I do not currently plan to use it elsewhere than for Wikidata,

Mind that I have no idea whether there is sensitive information in the tables as the documentation for them on MediaWiki.org is rather incomplete. My impression is that they are not accessible yet because nobody ever asked for it.

[1] https://www.mediawiki.org/wiki/Extension:Translate

Event Timeline

Urbanecm subscribed.

Which DB tables would you like to see in Toolforge? Translate introduces a bunch of tables, and perhaps only some of them are really important for replica users (for instance, translate_stash looks to have temporary data not of interest to users).

Also adding Language-Team (maintainers of MediaWiki-extensions-Translate) and Security-Team to help assess implication of adding new tables.

I know that the revtag table is definitely required, but I am not exactly sure about the other ones due to the incomplete documentation. I think all permanent/non-temporary content that is not sensitive should be accessible in the replicas, in order to allow maximum possibilities.

Coincidentally, I missed revtag table on quarry last week when trying to debug a bug report.

Here is the list of tables used by Translate:

  • translate_cache: not yet created in WMF, similar to object cache, should not be exposed
  • translate_stash: not used in WMF
  • translate_groupstats: could be exposed (e.g. to check statistics)
  • translate_sections: could be exposed, not sure if useful
  • translate_groupreviews: could be exposed, probably useful (e.g. to query group workflow status)
  • translate_messageindex: some internal tracking, should not be exposed
  • translate_reviews: could be exposed (e.g. to query for reviewed what)
  • translate_metadata: could be exposed (e.g. to query setting of translatable pages)
  • revtag: ccould be exposed
  • translate_tms, translate_tmt, translate_tmf: not used in WMF

Hey @Nikerabbit and thanks for providing some background on these tables.

You mentioned that translate_messageindex should not be exposed because it contains some internal tracking information. I pulled up an excerpt from that table but I think I would need some help in understanding where you see an issue.

wikiadmin@10.64.16.207(wikidatawiki)> select * from translate_messageindex limit 1\G

*************************** 1. row ***************************
tmi_key: <redacted integer>:help:About_data/1
tmi_value: page-Help:About data|agg-Help

Could you please explain to me why you think this index information is concerning?
That may help me better grasp any security or privacy risk inherent to this specific table.

Hey @Nikerabbit and thanks for providing some background on these tables.

You mentioned that translate_messageindex should not be exposed because it contains some internal tracking information. I pulled up an excerpt from that table but I think I would need some help in understanding where you see an issue.

wikiadmin@10.64.16.207(wikidatawiki)> select * from translate_messageindex limit 1\G

*************************** 1. row ***************************
tmi_key: <redacted integer>:help:About_data/1
tmi_value: page-Help:About data|agg-Help

Could you please explain to me why you think this index information is concerning?
That may help me better grasp any security or privacy risk inherent to this specific table.

@Nikerabbit clarified to me through Slack that his concern with the translate_messageindex revolved around the usefulness of exposing that table.

On the behalf of Security-Team, I reviewed the privacy risks that exposing Translation extension’s tables in replicas may bring about. I’ll share my conclusions below.

Unlike production databases, replicas do not include tables related to the Translate extension. As per the official documentation, the extension comes with 11 tables, though some of them are not yet created or used, as pointed earlier by @Nikerabbit. Therefore, the review I did focused solely on tables that exist in production, as of writing. While reviewing those tables, it appeared that most of them pose low, if not none, privacy risks for end users.

  • revtag does not appear to hold any sensitive data such as unique identifiers or similar fingerprints
  • translate_groupreviews provides indications about the status of translations, whether they are in progress, needs proofreading, etc.
  • translate_sections intel about translations sections, no PII included
  • translate_groupstats holds some aggregated statistics, no PII included
  • translate_messageindex, contains no identifying information

The only table that raised some concerns is translate_reviews. This table contains details about translations as well as the users who did them. However, it is worth noting that such information is already publicly available through other means, and is useful to find active translators, as shown by this existing query. Furthermore, aside from giving specifics about on-wiki editing of some usernames, that data in and of itself is not enough to enable a violation of user’s privacy. Should a malign actor decide to use these details in a harmful way — eg: unnecessarily scrutinizing users — the data would provide them no concrete details about the user's identity.

That being said, I would recommend the data minimization approach and expose only tables that have legitimate utility for replicas users, as opposed to enabling tables “ in order to allow maximum possibilities”. That rule of thumb should be applied to tables that have not yet been created, as well as those that we reviewed.

With the above in mind, the overall privacy risk of exposing these tables to the replicas was categorized as LOW.

Change 735088 had a related patch set uploaded (by AntiCompositeNumber; author: AntiCompositeNumber):

[operations/puppet@production] wikireplicas: add Translate extension tables

https://gerrit.wikimedia.org/r/735088

Change 735088 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wikireplicas: add Translate extension tables

https://gerrit.wikimedia.org/r/735088

I don't see the tables on Quarry yet. Is there more steps to be done, or will it just take time to get replicated there?

I don't see the tables on Quarry yet. Is there more steps to be done, or will it just take time to get replicated there?

The views need to be rebuilt on all of the wiki replicas before the new tables will be exposed.

Nikerabbit claimed this task.

I can run select * from translate_metadata limit 10; for metawiki and wikidatawiki on Quarry \o/. I assume this task is now complete.