Page MenuHomePhabricator

Generate template parameter alignments for the selected small wikis
Closed, ResolvedPublic

Description

As part of the Language annual plan (T225298) we want to better support translation in small wikis where Content translation is available as a default tool (out of beta): Bengali (bn), Malayalam (ml), Tagalog (tl), Javanese (jv), Mongolian (mn), and albanian (sq)

A machine learning approach (T221211) was applied to identify the mappings/alignments of parameters for the most used templates and language pairs (generated alignments). We want to generate additional mappings that are relevant for the target wikis. In particular:

  • id -> jv
  • ru -> mn
  • en -> mn
  • en -> jv
  • en -> bn
  • en -> ml
  • en -> tl
  • en -> sq

Once those are generated, this new metadata for these alignments should be integrated into Content Translation in the same way as the previous alignments (T224721). This could be part of a separate ticket if additional effort is required.

Template parameter alignments process How-To is being documented at: https://github.com/kartikm/templatesAlignment/blob/master/How-To.md

Event Timeline

Pginer-WMF triaged this task as Medium priority.Jul 3 2019, 10:43 AM

The format may change after work on T224721 is completed. So it may be good to wait until T224721 is completed.

@diego, Would it be possible to add a Dockerfile to the repo so that a ML-nonexperienced person can get it running and extract the outputs?

@santhosh , I've never tried that (I understand that docker files are kind of virtual environment, but honestly, I've never used it). We can try, but remember that the person will need access to our spark cluster. Do you know if the docker environment can connect with Yarn?

Reedy renamed this task from Generate template parameter alignments for the selected small wikis to Generate template parameter alignments for the selected small wikis.Sep 16 2019, 1:53 PM
Pginer-WMF raised the priority of this task from Medium to High.Dec 12 2019, 1:34 PM

Currently, pyspark seems not working, so run is stopped as of now. Following up on this with @diego (and with Analytics)

Hey @Ottomata @JAllemandou, please can you check why Pyspark kernels are not working? I've been trying for a week, with the differents pyspark kernels on the notebook machines, but the notebook freezes with any command (even is you try no-spark commands), pure python is working ok. Thx

Problem solved (thanks @elukey and @JAllemandou.

@KartikMistry please go the notebook1003 or install pyspark on the stat1007. Check that the kebros is correctly configured (https://wikitech.wikimedia.org/wiki/SWAP#Kerberos) and then run

from pyspark.sql.functions import regexp_replace
df = spark.read.parquet('/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204')
df = df[df['page_namespace'] == 0]
df = df.withColumn('page', regexp_replace('page_title', '_', ' '))
df = df.select('wiki_db','item_id','page')
df.write.csv('wikidaItemsForKartik.csv')
!hadoop fs -text wikidaItemsForKartik.csv/* > wikidaItemsForKartik.csv

Then in your local script run:

import pandas

df = pandas.read.csv('wikidaItemsForKartik.csv',sep=',')

and you will have the data you need.

Current issue:

reading word vectors from vectors/wiki.en.vec
reading word vectors from vectors/wiki.mn.vec
== mn
Traceback (most recent call last):
 File "02alignmentsSpark.py", line 98, in <module>
   df2 = df[df.wiki_db == '%swiki' % lang1_code].join(df[df.wiki_db ==
'%swiki' % lang2_code].withColumnRenamed("page",
"page2").withColumnRenamed("wiki_db", "wiki_db2"),on='item_id')
 File "/home/kartik/.local/lib/python3.7/site-packages/pandas/core/generic.py",
line 5179, in __getattr__
   return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'withColumnRenamed'

As Diego suggested spark DF to pandas DF conversion is not proper. I'm looking at it with help of Diego.

In the short-term, the solution is to use the code as was designed to work with Spark.

@KartikMistry , as you told me, I understand that you want to remove the dependency on Spark, to run the code on the labs machine, true?

Maybe @Ottomata can tell us if is possible to access the spark cluster from the labs machines?

Maybe @Ottomata can tell us if is possible to access the spark cluster from the labs machines?

Naw, it's not :/

I've generated templatemapping.db and updated cxserver Labs instance (cxserver.wmflabs.org) and testing there.

Number of alignments in script run

  • id->jv: 72
  • ru->mn: 31
  • en->mn: 04
  • en->jv: 34
  • en->bn: 29
  • en->ml: 13
  • en->tl: 100
  • en->sq: 40

Change 572769 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Update templatemapping database

https://gerrit.wikimedia.org/r/572769

Change 572769 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Update templatemapping database

https://gerrit.wikimedia.org/r/572769

Change 574768 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2020-02-24-110149-production

https://gerrit.wikimedia.org/r/574768

Change 574768 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2020-02-24-110149-production

https://gerrit.wikimedia.org/r/574768

Mentioned in SAL (#wikimedia-operations) [2020-02-26T05:41:05Z] <kart_> Updated cxserver to 2020-02-24-110149-production (T227183)

@Jpita For testing you can use following templates:

  • ru->mn: Шаблон:Google books, Шаблон:Cite book
  • en->mn: Template:Infobox album
  • en->jv: Template:Cite book, Template:Cite episode, Template:Cite journal, Template:Cite web
  • en->bn: Template:Cite book, Template:Infobox country, Template:Infobox cricketer
  • en->ml: Template:Infobox person, Template:Infobox cricketer
  • en->tl: Template:Infobox person, Template:Cite episode
  • en->sq: Template:Cite book, Template:Infobox person

It seems id->jv not showing any results in the DB :/ so will check that separately.

QA NOTES

  • ru->mn: Шаблон:Google books, Шаблон:Cite book ✅
  • en->mn: Template:Infobox album ☑️exists but it's incomplete
  • en->jv: Template:Cite book, Template:Cite episode, Template:Cite journal, Template:Cite web ✅
  • en->bn: Template:Cite book, Template:Infobox country, Template:Infobox cricketer ✅
  • en->ml: Template:Infobox person, Template:Infobox cricketer✅
  • en->tl: Template:Infobox person, Template:Cite episode ☑️exists but it's incomplete
  • en->sq: Template:Cite book, Template:Infobox person✅

Is this the expected outcome @KartikMistry @Pginer-WMF ?
If yes, feel free to move to done

Is this the expected outcome @KartikMistry @Pginer-WMF ?
If yes, feel free to move to done

Incomplete is OK, IMHO. As long as we gets some parameters mapped automatically. @Pginer-WMF can confirm about moving task to done.

Is this the expected outcome @KartikMistry @Pginer-WMF ?

Some more information may be needed to be sure. The alignments document provide the mapping for certain template parameters, are those the ones mapped?

Let's take for example:

  • en->mn: Template:Infobox album ☑️exists but it's incomplete

From the list of new mappings, you can check the English-to-Mongolian mapping. In the mappings you'll see the following info for the "Infobox album" template:

"Template:Infobox album": [{"mn": "\u0445\u044d\u043b", "d": 0.082656195408495, "en": "language"}]

This means that the "language" parameter in English corresponds to the "\u0445\u044d\u043b" parameter in Mongolian (I'm not sure why it is encoded that way). Thus, a specific instance of the "Infobox album" is expected to successfully transfer the information stored in the "language" parameter into the corresponding parameter when the template is added to the translation. Maybe other parameters are also transferred (e.g., if they have the same name in both languages) or maybe not, but that is not relevant for this parameter mapping approach.

Thus, if the "language" parameter was among those that were mapped, it is ok even if the final template is still incomplete. If you could confirm this for the incomplete cases we can close this.