Page MenuHomePhabricator

Integrate new section mapping database
Closed, ResolvedPublic


A new approach was used to generate section mappings in T293511. As a result a database was produced capturing those mappings.

This ticket is about integrating this database into Section Translation so that the tool can make use of the new mappings. This is expected to improve the way equivalent sections are identified.

The integration should be done in a way that allows for future updates of the data from the different approaches (previous CX translations, MT, and this new approach) independently. That is, w should be able next year to re-generate the database created in T293511 and replace the old contents without removing the mappings coming from CX translations in the process.

After the integration we may want to re-check the problematic cases captured in T283817 to update their status.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Apr 27 2022, 8:40 AM
Pginer-WMF created this task.
santhosh changed the task status from Open to In Progress.May 10 2022, 11:31 AM
santhosh claimed this task.

Change 790654 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Add script to extract the section title mappings from ML based db

Change 790654 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Add script to extract the section title mappings from ML based db

Change 792598 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Section alignment db: Size optimizations

After running the above script to extract the section titles for the language pairs that are used by CX(we have 733 language pairs with at least 10 translations), our section alignment database is 472MB. It used to be ~55MB.

This is a huge sqlite file to be in a git repository. We cannot push this to gerrit too because of the 100MB limitation for file(We get error error Object too large (495,554,560 bytes), rejecting the pack. Max object size limit is 104,857,600 bytes.).

This sqlite database has a single table with source language, target language, source title, target title columns. Using this as part of nodejs based cxserver has been quite easy and performant enough so far. Even with this large size, it is fast enough.

A simple option I can think of is splitting this db to small db files based on source languages and conditionally use it from the code that reads this database for section suggestions. There are 6692506 records in total, for source language=en, we have 1452320 records, For es: 529625 records. We extracted this from 2.5GB sqlite db file that the research team provided.

@Ladsgroup, Do you have any advice here? (Happy to explain the context if it is not clear). Is there a hassle free solution to place this db somewhere?

Yeah, putting it in gerrit wouldn't work. it's going to bloat the whole repo and its history and cloning it in the future can cause all sorts issues for the server (from apache to gerrit's db) and client. If it's data, it belongs to a database not a vcs.

We don't have a sqlite db in production but really quickly we can set up a dedicated mariadb database in misc cluster (probably m5 or something like that) and we give you creds for it to connect from the cx service. Uploading the data can be either from just sftp to mwmaint or something else like running a mw script that would populate the db. Worst case, hand it over to us and we do the initial dance but you need to build something for future updates.

It's small enough for us to be able to accommodate your db request quickly. We can also have a quick call to go through the details.

m5 should be fine, but connections would need to go thru the proxy (ie: connect to m5-master.eqiad.wmnet)
What's the amount of read/writes you'd expect?

Thanks @Ladsgroup and @Marostegui. I have placed the sql dump of titles table from Sqlite database here - (114 MB)
The whole sqlite db also place at (117.8MB)

There won't be any write operation from cxserver to this table. It is for readonly purpose from cxserver. Data populatation happens through scripts and we might run once in a year or so to update.
The read traffic is very low at this point as the section translation feature which will use this db is deployed only to small sized wikis for logged in users. Roughtly 100 translations per week, but we expect this to grow up as we expose the tool to more users. The data is used to suggest sections missing in articles and invite users to translate. This is shown in a dashboard. Here is an example API that use this data
This is accessed only when a logged in user in a small set of configured wikis access Special:CX page - So very low traffic at this point time

We will need some changes in cxserver code to access this db and proxy configuration to connect. We will take care of that once DB is set up.

I certainly can put the data there but I let Manuel create the database (or we can do it together so I'd be more confident doing it next time) and we would give you credentials so your tool could connect to it and query it.

Give us:

  • db username
  • db name
  • db grants

@Ladsgroup dbname can be cxserverdb. Others can be anything provided by setting up DB in the Production.

@KartikMistry I've created the db on m5 and you can query it on m5-master.eqiad.wmnet, the username is cxserver and I put the password in your home directory in deploy1002.

I haven't moved the data there yet, will do it ASAP.

What are the grants needed for this?

I added the GRANT, it's SELECT only for now.

Let's add them to puppet to on the m5.sql file (if that wasn't done already)

I'll do it once I moved the data.

Change 798661 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Add cxserverdb grant

Change 798661 merged by Ladsgroup:

[operations/puppet@production] mariadb: Add cxserverdb grant

Before I move on to inserting the data:

            (source_language TEXT,
            target_language TEXT,
            source_title TEXT,
            target_title TEXT,
            frequency INTEGER);

Makes creating the index impossible:

root@db1107.eqiad.wmnet[cxserverdb]> CREATE INDEX title_source_target
    ->         ON titles (source_language, target_language);
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

We probably should use something smaller than TEXT. varbinary(255) is more than enough. Would that work?

(And/or we could use lang code which is much much smaller and reduces the size of db drastically but that's your call)

Oh it's already lang code, let's just go with varbinary(63), that's way more than enough.

The data is now there, a very few insertions failed due to encoding issues but it should be done now.

The data is now there, a very few insertions failed due to encoding issues but it should be done now.

Thanks, @Ladsgroup!

Change 799338 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] WIP: Use hosted mysql database for section suggestion

Change 799338 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use hosted mysql database for section suggestion

Change 801663 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2022-05-31-111430-production

Change 801663 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2022-05-31-123738-production

Change 792598 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Section alignment db: Size optimizations

Change 805726 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2022-06-15-074244-production

Change 805726 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2022-06-15-074244-production