Page MenuHomePhabricator

Content Translation Parallel Corpus API and Dumps have different data
Closed, ResolvedPublic

Description

Per the Content Translation Mediawiki page, you have two options for accessing the data from Content Translation parallel corpus: API or dumps. They are written as if the API and the dump files provide the same information, just provide different ways of accessing it. I've found however, that the dump file seems to be missing information (both translated content that was not added but is included in the API and content that actually was added).

For instance, the article Gradient Boosting was translated from English -> Spanish at '2017-10-11T15:57:19Z'. It is translation ID 374869.
English version at the time: https://en.wikipedia.org/w/index.php?title=Gradient_boosting&oldid=801498395
Resulting Spanish article: https://es.wikipedia.org/w/index.php?title=Gradient_boosting&oldid=102230661

From the API ( https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationcorpora&translationid=374869&striphtml=true ), there are seven sections:

{
  "batchcomplete": "",
  "query": {
      "contenttranslationcorpora": {
          "sections": {
              "mwAQ": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "El gradiente que aumenta es una t\u00e9cnica de aprendizaje de la m\u00e1quina para regresi\u00f3n y problemas de clasificaci\u00f3n, el cual produce un modelo de predicci\u00f3n en la forma de un ensemble de modelos de predicci\u00f3n d\u00e9bil, t\u00edpicamente \u00e1rboles de decisi\u00f3n. Construye el modelo en una etapa-moda sensata como otros m\u00e9todos de aumentar , y les generalice por dejar optimizaci\u00f3n de una funci\u00f3n de p\u00e9rdida diferenciable arbitraria.",
                      "timestamp": "2017-09-29T15:26:33Z"
                  },
                  "user": {
                      "engine": null,
                      "content": "El gradiente que aumenta es una t\u00e9cnica de aprendizaje autom\u00e1tico\u00a0utilizado para el an\u00e1lisis de la regresi\u00f3n y para problemas de clasificaci\u00f3n estad\u00edstica, el cual produce un modelo predictivo en la forma de un conjunto de modelos de predicci\u00f3n d\u00e9biles, t\u00edpicamente \u00e1rboles de decisi\u00f3n. Construye el modelo de forma escalonada como lo hacen otros m\u00e9todos de boosting ,\u00a0y los generaliza permitiendo la optimizaci\u00f3n aribitraria de una funci\u00f3n de p\u00e9rdida diferenciable.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  }
              },
              "mwvQ": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "Gradient boosting can be used in the field of learning to rank. The commercial web search engines Yahoo[12] and Yandex[13] use variants of gradient boosting in their machine-learned ranking engines.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "El gradiente que aumenta puede ser utilizado en el campo de aprender a rango. Los motores de b\u00fasqueda de web comerciales Yahoo[12] y Yandex[13] variantes de uso del gradiente que aumenta en su m\u00e1quina-aprendido ranking motores.",
                      "timestamp": "2017-09-29T15:39:39Z"
                  },
                  "user": {
                      "engine": null,
                      "content": "La potenciaci\u00f3n del gradiente puede ser utilizado en el campo de aprendizaje de clasificaci\u00f3n. Los motores de b\u00fasqueda de web comerciales Yahoo[12] y Yandex[13]\u00a0utilizan variantes de gradient boosting en sus motores de b\u00fasqueda.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  }
              },
              "mwxA": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "AdaBoost Random forest xgboost LightGBM",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "AdaBoost Bosque aleatorio xgboost LightGBM",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "user": null
              },
              "undefined07679025be2a0b218e5c0": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "See also",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "Ve tambi\u00e9n",
                      "timestamp": "2017-09-29T15:42:06Z"
                  },
                  "user": {
                      "engine": null,
                      "content": "Ver tambi\u00e9n",
                      "timestamp": "2017-10-11T15:57:19Z"
                  }
              },
              "undefinedb515e6806294a59a15c5a": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "References",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "Referencias",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "user": null
              },
              "undefinedf9f52c9411a2ae24c8cb3": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "Usage",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": {
                      "engine": "Apertium",
                      "content": "Uso",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "user": null
              },
              "mwCw": {
                  "sequenceid": 0,
                  "source": {
                      "engine": null,
                      "content": "The idea of gradient boosting originated in the observation by Leo Breiman[1] that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman[2][3] simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.[4][5] The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  },
                  "mt": null,
                  "user": {
                      "engine": null,
                      "content": "La idea de la potenciaci\u00f3n del gradiente fue originado en la observaci\u00f3n realizada por Leo Breiman[1] que el Boosting puede ser interpretado como un algoritmo de optimizaci\u00f3n en una funci\u00f3n de coste adecuada. El gradiente de regresi\u00f3n expl\u00edcito que aumenta los algoritmos fue posteriormente desarrollado por Jerome H. Friedman[2][3] Simult\u00e1neamente con el gradiente funcional m\u00e1s general que aumenta perspectiva de Llew Mason, Jonathan Baxter, Peter Bartlett y Marcus Frean.[4][5]\u00a0Estos dos \u00faltimos documentos presentaron la visi\u00f3n abstracta de los algoritmos de aumento de potenciaci\u00f3n como algoritmos iterativos de descenso de gradientes funcionales. Es decir, algoritmos que optimizan una funci\u00f3n de coste sobre el espacio funcional mediante la elecci\u00f3n iterativa de una funci\u00f3n (hip\u00f3tesis d\u00e9bil) que apunta en la direcci\u00f3n del gradiente negativo. Esta visi\u00f3n de gradiente funcional de potenciaci\u00f3n ha llevado al desarrollo de algoritmos de potenciaci\u00f3n en muchas \u00e1reas del aprendizaje autom\u00e1tico y estad\u00edsticas m\u00e1s all\u00e1 de la regresi\u00f3n y la clasificaci\u00f3n.",
                      "timestamp": "2017-10-11T15:57:19Z"
                  }
              }
          }
      }
  }
}

The dump file, however, only has four of those sections:

$ zless ../content_translation/cx-corpora.en2es.text.json.gz | grep '374869'
      "id": "374869/mwAQ",
      "id": "374869/mwvQ",
      "id": "374869/undefined07679025be2a0b218e5c0",
      "id": "374869/mwCw",

Perhaps this should be a separate bug, but I'll note as well that neither the API nor the dump file indicates that the user who translated the page did in fact include the 'Uso' heading in their article. The dump file includes no information about this and the API has a section for it under "undefinedf9f52c9411a2ae24c8cb3" but suggests that the content was not added.

Event Timeline

The dumps exclude sections which have no user translation, because that is not useful information in a comparable corpora. It seems the API does not do this filtering.

For what's it is worth, I think in CX2 we do save the user translation even if it is the same as the source or mt.

The dumps exclude sections which have no user translation, because that is not useful information in a comparable corpora. It seems the API does not do this filtering.

thanks @Nikerabbit, that would explain it. so I suppose this is actually a larger question about what the API/dumps should include or how we should document these differences. I'll update what documentation I've built around this in the meantime.

Pinging @santhosh that I'm happy to update the mediawiki page as well: https://www.mediawiki.org/wiki/Content_translation/Published_translations

Pginer-WMF claimed this task.
Pginer-WMF subscribed.

I added a note to the documentation to make this more explicit. For the larger question, as this data is consumed we'll be learning about the usefulness of it and adapt based on specific requests. For example, recently our data has been integrated by the Opus project.

so I'd consider the ticket completed, and other more specific ones can be created if changes are needed.