Page MenuHomePhabricator

Adapted categories are not saved along with translation
Closed, ResolvedPublic

Description

The target categories that translator added or adapted from source article and decided to keep for publishing are never saved in back-end database. So, when a translation is restored, these edits in target categories are lost. This was not a big problem in CX1, since it was only automatic adaptation of categories and no feature to add/edit categories.

In CX2, we have these features and there should be a way to keep them between translation sessions. We can consider this as a metadata of an article to publish.

  • Reuse existing cx_corpora table to store categories metadata about draft translation. Use special value for section id to distinguish categories from other translation data stored in cx_corpora table
  • Develop the API, PHP classes to use the new table
  • Translation restore should fetch this data
  • Publishing should use the categories
  • Any edit (add, remove, reorder) in the categories collection should get auto saved just like with any other translation section

Event Timeline

Pginer-WMF triaged this task as Medium priority.Feb 26 2018, 2:35 PM

Change 423007 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Save categories in drafts

https://gerrit.wikimedia.org/r/423007

Why do categories need to be stored in a separate table? Can't they be stored in the DOM using the Parsoid DOM spec for categories (or some extension of it)?

https://www.mediawiki.org/wiki/Specs/HTML/1.6.x#Category_links

Why do categories need to be stored in a separate table? Can't they be stored in the DOM using the Parsoid DOM spec for categories (or some extension of it)?

https://www.mediawiki.org/wiki/Specs/HTML/1.6.x#Category_links

We're exploring options how to deal with draft categories in best and most optimal manner. What you are proposing is slightly different version of what's proposed in this patch - to reuse cx_corpora table instead of creating new table specifically for storing draft categories.

That patch is probably better place to continue this discussion and explore counter proposals.

@Petar.petkovic wrote the following in the patch:

Yes, me and Niklas weighted DB possibilities, involving new table and reusing cx_corpora, which is implemented by this patch.

I was almost certain of introducing new table, when Niklas presented me his idea of reusing cx_corpora for saving metadata with special section ID.

cx_translations has similar JSON-encoded construct with translation_progress, adding new column for categories could be one possible solution. Down side is that saved categories are meta information about draft, not about translation. We never purge cx_translations and storing categories doesn't benefit DB size.

What columns would new table for storing categories have? I can see translation_id foreign key and some column to store JSON-encoded BLOB of categories. Connecting categories with their foreign keys would not be beneficial, we don't need to query drafts by categories used and queries would be slower.
Maybe we want some table to store metadata separately from corpora data. Can you foresee what could be stored in such table besides categories? Maybe such construct would have translation_id and BLOB for generic metadata, or separate columns for various types of metadata, which would have multiple NULL values.

Benefits of storing such metadata in pre-existing cx_corpora table are not having overhead of creating new table, which would have to be maintained, just like we (plan to) purge cx_corpora. What isn't good about such re-usability is possibility to open door for future usage of cx_corpora table as metadata table (which may be good or bad thing). Even with this amount of metadata storage in cx_corpora, it no longer serves as pure corpora table. It is less semantic and associated code becomes harder to follow, with small changes just to make some metadata storage work, which in turn damages the code structure.

If we use a special section id "CX_CATEGORY_METADATA" with type as "user" as per https://gerrit.wikimedia.org/r/423007:

  1. This will be a section with out source in cx_corpora table. Introduces a non-obvious assuption about the entries in the table. Implies additional code paths introduced in non-obvious places
  2. The table entries are used for our published parallel corpora, as per the XML or json format we publish, every item there should have a source, unmodified translation, modified translation. So we need to filter out the special entry of categories. The categories is json encoded, so anyway it does not fit to corpora. Categories are not useful in parallel corpora too?
  3. Ease of implementation is not a concern here.
  4. At this scope of CX2, we don't have any other metadata to be stored. But there may be a few things like translation reminders related with old translations. Translation progress is a metadata, but it is in translations table already. Need to think about optional metadata

I am not pushing for a new table, just listing some considerations when we try to reuse cx_corpora table.

I'll write down here what I discussed with Petar and what is my current thinking:

  • Categories are, arguably, part of the draft content. They are part of the content in wikitext and in the DOM too, at least for now. We just happen to treat categories in a specific manner in CX. For this reason I think that they should be saved together with rest of the draft content. If we agree on this way of thinking, we should not call them metadata.
  • I have not seen nor been able to come up with a counter proposal for a new table that is more specific than a blob while being extensible for future metadata. In corpora table we already have a blob easily available and I think it makes sense to use it per my first pont.

Additional thoughts:

  • We can also save the source categories for consistency and simplicity. For dumps we can either explicitly skip them or reformat them in a manner that is suitable for each format.
  • We should not use cx_translations – that table should stay small and only contain translation work related metadata. The translation progress doesn't really belong there and would be a candidate for moving to a separate translation draft metadata table if we ever create such a thing.

I can agree with the notion that categories are not really translation metadata. Saving source categories will help for the corpora format consistancy.

Change 423007 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] CX2: Save categories in drafts

https://gerrit.wikimedia.org/r/423007

@santhosh - should I look at a different db?

wikiadmin@deployment-db04[wikishared]> select  max(cxc_timestamp) from cx_corpora;
+--------------------+
| max(cxc_timestamp) |
+--------------------+
| 20160816200214     |
+--------------------+
1 row in set (0.00 sec)

Checked in cx2 - categories (adding, deleting, editing) are saved with translation drafts.