Adapted categories are not saved along with translation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	santhosh
	Feb 26 2018, 11:44 AM

Description

The target categories that translator added or adapted from source article and decided to keep for publishing are never saved in back-end database. So, when a translation is restored, these edits in target categories are lost. This was not a big problem in CX1, since it was only automatic adaptation of categories and no feature to add/edit categories.

In CX2, we have these features and there should be a way to keep them between translation sessions. We can consider this as a metadata of an article to publish.

Reuse existing cx_corpora table to store categories metadata about draft translation. Use special value for section id to distinguish categories from other translation data stored in cx_corpora table
Develop the API, PHP classes to use the new table
Translation restore should fetch this data
Publishing should use the categories
Any edit (add, remove, reorder) in the categories collection should get auto saved just like with any other translation section

Details

	Subject	Repo	Branch	Lines +/-
	CX2: Save categories in drafts	mediawiki/extensions/ContentTranslation	master	+392 -80

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T188605 CX2: Category support
Resolved	• Petar.petkovic	T188615 CX2: Allow users to add new categories of their choice
Resolved	• Petar.petkovic	T188238 Adapted categories are not saved along with translation

Event Timeline

santhosh created this task.Feb 26 2018, 11:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2018, 11:44 AM

Pginer-WMF triaged this task as Medium priority.Feb 26 2018, 2:35 PM

• Petar.petkovic subscribed.Feb 28 2018, 10:00 AM

Pginer-WMF added a project: Language-2018-Jan-Mar.Feb 28 2018, 11:53 AM

Pginer-WMF moved this task from Backlog to Priority backlog on the Language-2018-Jan-Mar board.Feb 28 2018, 12:26 PM

• Petar.petkovic claimed this task.Mar 1 2018, 8:21 AM

Pginer-WMF added a parent task: T188615: CX2: Allow users to add new categories of their choice.Mar 1 2018, 12:58 PM

Pginer-WMF mentioned this in T188615: CX2: Allow users to add new categories of their choice.

Pginer-WMF moved this task from Needs Triage to CX2 on the ContentTranslation board.Mar 1 2018, 4:45 PM

• Petar.petkovic merged a task: T120347: Categories are not saved along with translations.Mar 4 2018, 2:03 AM

• Petar.petkovic added subscribers: Nikerabbit, StudiesWorld.

Liuxinyu970226 subscribed.Mar 4 2018, 3:38 PM

santhosh mentioned this in T188614: CX2: Support users to adjust category mapping.Mar 13 2018, 5:52 AM

• Petar.petkovic updated the task description. (Show Details)Mar 22 2018, 5:46 PM

• Petar.petkovic moved this task from Priority backlog to In Progress on the Language-2018-Jan-Mar board.Mar 25 2018, 10:28 PM

Pginer-WMF added a project: Language-2018-Apr-June.Mar 28 2018, 9:59 AM

Change 423007 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Save categories in drafts

https://gerrit.wikimedia.org/r/423007

gerritbot added a project: Patch-For-Review.Mar 29 2018, 6:25 PM

• Petar.petkovic moved this task from In Progress to In Review on the Language-2018-Jan-Mar board.Mar 29 2018, 6:26 PM

Why do categories need to be stored in a separate table? Can't they be stored in the DOM using the Parsoid DOM spec for categories (or some extension of it)?

https://www.mediawiki.org/wiki/Specs/HTML/1.6.x#Category_links

Pginer-WMF moved this task from Backlog to In Review on the Language-2018-Apr-June board.Apr 3 2018, 8:15 AM

Pginer-WMF removed a project: Language-2018-Jan-Mar.Apr 3 2018, 8:26 AM

In T188238#4097667, @Esanders wrote:

Why do categories need to be stored in a separate table? Can't they be stored in the DOM using the Parsoid DOM spec for categories (or some extension of it)?

https://www.mediawiki.org/wiki/Specs/HTML/1.6.x#Category_links

We're exploring options how to deal with draft categories in best and most optimal manner. What you are proposing is slightly different version of what's proposed in this patch - to reuse cx_corpora table instead of creating new table specifically for storing draft categories.

That patch is probably better place to continue this discussion and explore counter proposals.

@Petar.petkovic wrote the following in the patch:

Yes, me and Niklas weighted DB possibilities, involving new table and reusing cx_corpora, which is implemented by this patch.

I was almost certain of introducing new table, when Niklas presented me his idea of reusing cx_corpora for saving metadata with special section ID.

cx_translations has similar JSON-encoded construct with translation_progress, adding new column for categories could be one possible solution. Down side is that saved categories are meta information about draft, not about translation. We never purge cx_translations and storing categories doesn't benefit DB size.

What columns would new table for storing categories have? I can see translation_id foreign key and some column to store JSON-encoded BLOB of categories. Connecting categories with their foreign keys would not be beneficial, we don't need to query drafts by categories used and queries would be slower.
Maybe we want some table to store metadata separately from corpora data. Can you foresee what could be stored in such table besides categories? Maybe such construct would have translation_id and BLOB for generic metadata, or separate columns for various types of metadata, which would have multiple NULL values.

Benefits of storing such metadata in pre-existing cx_corpora table are not having overhead of creating new table, which would have to be maintained, just like we (plan to) purge cx_corpora. What isn't good about such re-usability is possibility to open door for future usage of cx_corpora table as metadata table (which may be good or bad thing). Even with this amount of metadata storage in cx_corpora, it no longer serves as pure corpora table. It is less semantic and associated code becomes harder to follow, with small changes just to make some metadata storage work, which in turn damages the code structure.

If we use a special section id "CX_CATEGORY_METADATA" with type as "user" as per https://gerrit.wikimedia.org/r/423007:

This will be a section with out source in cx_corpora table. Introduces a non-obvious assuption about the entries in the table. Implies additional code paths introduced in non-obvious places
The table entries are used for our published parallel corpora, as per the XML or json format we publish, every item there should have a source, unmodified translation, modified translation. So we need to filter out the special entry of categories. The categories is json encoded, so anyway it does not fit to corpora. Categories are not useful in parallel corpora too?
Ease of implementation is not a concern here.
At this scope of CX2, we don't have any other metadata to be stored. But there may be a few things like translation reminders related with old translations. Translation progress is a metadata, but it is in translations table already. Need to think about optional metadata

I am not pushing for a new table, just listing some considerations when we try to reuse cx_corpora table.

I'll write down here what I discussed with Petar and what is my current thinking:

Categories are, arguably, part of the draft content. They are part of the content in wikitext and in the DOM too, at least for now. We just happen to treat categories in a specific manner in CX. For this reason I think that they should be saved together with rest of the draft content. If we agree on this way of thinking, we should not call them metadata.
I have not seen nor been able to come up with a counter proposal for a new table that is more specific than a blob while being extensible for future metadata. In corpora table we already have a blob easily available and I think it makes sense to use it per my first pont.

Additional thoughts:

We can also save the source categories for consistency and simplicity. For dumps we can either explicitly skip them or reformat them in a manner that is suitable for each format.
We should not use cx_translations – that table should stay small and only contain translation work related metadata. The translation progress doesn't really belong there and would be a candidate for moving to a separate translation draft metadata table if we ever create such a thing.

I can agree with the notion that categories are not really translation metadata. Saving source categories will help for the corpora format consistancy.

Change 423007 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] CX2: Save categories in drafts

https://gerrit.wikimedia.org/r/423007

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-05-08 (1.32.0-wmf.3)).May 3 2018, 7:00 PM

• Petar.petkovic removed a project: Patch-For-Review.May 3 2018, 7:23 PM

• Petar.petkovic moved this task from In Review to QA on the Language-2018-Apr-June board.

• Petar.petkovic updated the task description. (Show Details)May 3 2018, 7:35 PM

@santhosh - should I look at a different db?

wikiadmin@deployment-db04[wikishared]> select  max(cxc_timestamp) from cx_corpora;
+--------------------+
| max(cxc_timestamp) |
+--------------------+
| 20160816200214     |
+--------------------+
1 row in set (0.00 sec)