Page MenuHomePhabricator

Gather labels as ground truth for section translation
Closed, ResolvedPublic

Description

Description
Labels collected through this task will be fed as ground truth to improve the models developed in T182211.

Languages
Based on the data we observed via T185160 and the series of discussions we've had so far, we've chosen the following list of languages: en, ar, fr, es, ru, ja

Interface and questions
Here is what we need the tool to do:

  • Get the user's username (somehow)

We used babel and user tables.

  • Give the user the option to choose which languages they want to do translation to from the set of possible languages. (Note: their Babel template already tells us that, however, Babel templates may not be updated, so it's good to give the user the chance to choose more languages if they're comfortable with.)

We'll be sending out talk page messages with links to users in their preferred language.

  • Show possible translations identified by the algorithm for each of the languages the user is interested to translate to.

Done as autocompletion in Google spreadsheet.

  • Provide an Other or None-of-the-Above option where the user can enter the section translation themselves, with auto-complete option over the list of available sections in that language.

Done as free text input in Google spreadsheet.

  • Send out talk page messages

Blockers

  • [Diego] Provide recommendations to Baha

Event Timeline

leila triaged this task as High priority.Jan 4 2018, 7:57 PM
leila created this task.
leila moved this task from Staged to In Progress on the Research board.
leila updated the task description. (Show Details)

@diego @bmansurov I updated the task description based on what we discussed last week.

bmansurov: please note that I have not specified how the "source" language is chosen. Let us know if you need our input there.

diego: per our discussion, we need the output of your algorithm for all these language (pairs) before we can push this out.

Both: we agreed on February 15 as the deadline this goes out. Before that, all pieces should come together, including testing the tool. :)

I've started creating a Django app here.

For posterity, we've opted for a simpler solution and decided to gather data using a spreadhseet.

Here's the script that extracts section titles from the JSON files (see above comments) and orders the results by rank.

@bmansurov In order to identify the best way(s) to reach out to the users, I suggest we start with a small circle of users we already know and ask them to participate (we may not have users in all pairs in this initial sample). These users can enter some translations and report any issue they may see. Then, we can widen the circle and ask more users to help.

Update: We eventually decided to go with collecting the labels using a spreadsheet. bmansurov made it and we just sent it out to 25 WMF staff members to help us with translations and reporting bugs/issues they can spot. The plan is to expand the invitation to community members in a few days and once we are sure what we have up is working and the task is clear.

Hi,

I filled like ~300 rows in es:en and I'm done for now :-) Some feedback:

  • sorting each tab is very helpful. I did that myself, but perhaps you could sort by default
  • there are some duplicates entries (they look the same, I'm not sure if case matters)

Example:

Fase final: Final round
Fase Final: Final round

Hi,

I filled like ~300 rows in es:en and I'm done for now :-)

Thank you! :)

Some feedback:

  • sorting each tab is very helpful. I did that myself, but perhaps you could sort by default

The rows were originally sorted by their frequency of occurrence in that Wikipedia language, but that's not mentioned anywhere. It's okay if you sorted your sheet using alphabetic order. We will keep the rest unchanged though. :)

  • there are some duplicates entries (they look the same, I'm not sure if case matters)

Example:

Fase final: Final round
Fase Final: Final round

Thanks for flagging this. Right now, we're not doing any processing to remove similar/same section titles. The reason that you see both of these is that in es they're both being used very frequently. I'm hoping that as a result of this labeling (and the next one, which will focus on similarity/synonym detection) we can catch this type of issue as well as others (Film and Movie are basically the same in en, for example). Of course, this is an easy fix that we hopefully don't have to put in front of the users to label and fix.

@bmansurov we're ready to send out the messages to users exported their data. Given that the number of users we will contact is limited, email or talk page message is probably the best option. How much time each of the two options will take from you to set up?

@leila, do you think we can send emails to users out of the blue? I'm not familiar with how this works. Maybe there's a setting that we need to check before sending emails. Also, we need to make sure our email doesn't end up in spam because we're sending it to multiple people at once. From your experience, how did you guys reach out to users in the past?

@bmansurov can you remind us how many users we're talking about here?

Re sending emails: we need Legal approval, and we need to have the emails translated to the native language of the user or at least a language the user can fluently speak. We also need to check that the user doesn't have the option for receiving emails from other users in Wikimedia unchecked.

We did this back in 2015 (either Ori or Yuvi helped us with it, I think Ori). Otherwise, we should go with talk page messages.

@leila, here are the numbers for the language pairs we're interested in:

language pair# of users
ar-en139
ar-es43
ar-fr113
ar-ja8
ar-ru42
en-es5991
en-fr2796
en-ja203
en-ru5324
es-fr2796
es-ja48
es-ru236
fr-ja43
fr-ru399
ja-ru55
total18236

I'll first look into getting user emails who don't mind receiving emails (as indicated in their settings) because it seems our preferred way of contacting them. I'm not sure how long it will take me to do so — I'll update my status after a couple hours of research. In the meantime, could we ping someone from legal about this and start the process of translating the email message into the five languages (assuming the original is in English)?

Sending an email using the MediaWiki API seems straightforward: documentation. The API allows us to check whether a user is emailable. I think it will take me 3-4 hours to create a script that allows us to gather users by language and send them emails in their native language. I suppose we'll look at the highest proficiency a user has and use it as the language of the email text.

@Tgr We'd like to email many users using the MW API, and I was wondering how we can request a removal of the rate limit. Or is there another way of sending an email to many users?

If it's the exact same email for everyone (or every language group), you could probably use @bd808's sendBulkEmails.php script.

Thanks, @Tgr!

@bd808, we have a list of users from various wikis and we'd like to send bulk emails in six different languages to subsets of these users. What would be the preferred way of running the script for this use case?

we have a list of users from various wikis and we'd like to send bulk emails in six different languages to subsets of these users. What would be the preferred way of running the script for this use case?

The man page I made for it gives an overview: https://www.mediawiki.org/wiki/Manual:SendBulkEmails.php. Actually reading the script should make most of what it is doing pretty obvious if you have other questions. You can run it from any server where mwscript works for your target wikis.

Thanks, @bd808! What are some of the servers where I can run mwscript from? I have access to stats machines, but I think you mean some other servers.

Thanks, @bd808! What are some of the servers where I can run mwscript from? I have access to stats machines, but I think you mean some other servers.

terbium.eqiad.wmnet would be the typical location for running one-off scripts like this. It can also be done from tin.eqiad.wmnet or the codfw deploy server (naos.codfw.wmnet). I don't remember if you need full deployer rights to run wmscript on terbium or if there is a lesser access right that can also do it. The place to start reading is https://wikitech.wikimedia.org/wiki/Production_shell_access and then maybe ping someone from RelEng on irc to see if they can tell you which rights you'll need.

@leila, here are the numbers for the language pairs we're interested in:

language pair# of users
ar-en139
ar-es43
ar-fr113
ar-ja8
ar-ru42
en-es5991
en-fr2796
en-ja203
en-ru5324
es-fr2796
es-ja48
es-ru236
fr-ja43
fr-ru399
ja-ru55
total18236

I'll first look into getting user emails who don't mind receiving emails (as indicated in their settings) because it seems our preferred way of contacting them. I'm not sure how long it will take me to do so — I'll update my status after a couple hours of research. In the meantime, could we ping someone from legal about this and start the process of translating the email message into the five languages (assuming the original is in English)?

@bmansurov Thanks. For the languages with more than a couple of hundred people eligible, the best would be to stagger the emails. We send a batch of 50 or 100, and wait a few days and only if we need more help, we send more emails. We can start with the English batch as we can have the text of the email almost immediately available. I need to run the text by Legal. I know we need an opt-out option.

@Halfak are you using any email address for users to opt-out of labeling task requests? If you are, maybe we can collect the opt-out for this task as part of the same email and add it to the same database.

I think I'm missing some context here. People opt-in to labeling work by signing their name on a labeling campaign description page. E.g. https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality I then use this page and the associated talk page to communicate with them. If they wanted to opt out, they could just delete their name from the signup list and un-watch the page. So far, I'm not sure that has ever happened.

Contacting people via emails is a complex problem on-wiki because it is not open to oversight in the same way that public on-wiki activity is. The only reason I would imagine emailing to be better is if the content of a request is sensitive/private in nature.

I think I'm missing some context here. People opt-in to labeling work by signing their name on a labeling campaign description page. E.g. https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality I then use this page and the associated talk page to communicate with them. If they wanted to opt out, they could just delete their name from the signup list and un-watch the page. So far, I'm not sure that has ever happened.

Got you. I thought that your initial attempt to raise awareness about the labeling request may happen through email, but it seems it's not the case. You use other open ways to say that you're looking for labeling help. Makes sense, then you don't have to worry about opt-out.

Contacting people via emails is a complex problem on-wiki because it is not open to oversight in the same way that public on-wiki activity is. The only reason I would imagine emailing to be better is if the content of a request is sensitive/private in nature.

Right. We generally keep that option for experiments where if the information is shared more openly, the results of the experiment can be questionable. This is of course not one of those cases. Baha will look into messaging options on talk pages.

Thanks, halfak!

I've requested rights to send mass messages to talk pages.

Also requested creation of a repository for translating email texts, but was suggested to use Meta for this purpose.

I've been grated a permission to post mass messages to talk pages. We're ready to post any time the content is ready.

@bmansurov Diego and I did one pass over this task and as we talked about it today, we want to start with en-* given that we have en instructions ready. As we were finalizing the instructions, we figured out that it's best to narrow down the editors we reach out to based on their edit counts in en and *. Can you give us this number for all the pairs that start with en?

@leila, here's the data you've asked for. It includes user's language proficiences for the languages we're intereseted in (i.e. ar, en, es, fr, ja, ru), and their edit counts in those language wikis. (The github repo has been updated to include the script for generating this data.)

Here's a sample of data:

username	ja_proficiency	jawiki_editcount	es_proficiency	eswiki_editcount	fr_proficiency	frwiki_editcount	ru_proficiency	ruwiki_editcount	en_proficiency	enwiki_editcount	ar_proficiency	arwiki_editcount
Χ					N				N		3	29
Pablo Tornielli			N	187		7			3		3	65
Passing.Stranger					4	234			4	32	4	24
Brahim-essaidi						4			3	6	4	4
Rida24											5	282
Sophiaeterna					N	4			4		5	4
Nordin far											5	344
Mouath14					N				N		N	1121
Sky xe									5		N	5027
Meriem Mach					N	151			4	58	N	967

@bmansurov thanks for this. Diego and I spent some time on this and we have a way for listing the top n usernames can reach out to first.

@diego a few things:

  • Can you share the top 5-10 usernames bmansurov can reach out to in this task based on our discussions the other day?
  • We need professional translation of instructions into all the languages. Can you check out https://office.wikimedia.org/wiki/Translation#Translation_firms and choose a company to reach out to and get quotes? Then you can check with Dario if we have budget for it. The cost will not be massive, and we shouldn't spend our time or volunteers' time on it.
  • Can you prepare a first draft of a message bmansurov can use to send to talk-pages? Please put it in a doc/etherpad and I'll do a pass.

*@leila , I've contacted the translation firms, your in cc

@bmansurov , let's wait that @leila give her pass to the invitation letter, and then we use that letter to directly contact people listed on the spreadsheet (all of them speaks English)

@bmansurov the text at https://etherpad.wikimedia.org/p/InstructionsForSectionMapping is ready to go out. As we discussed, let's send a test message to Bob, Diego and myself on our enwiki talk pages.

Next steps after this:

  • Diego will have the talk page message as well as the text in Get Started translated.
  • Diego will gather the list of users that can be contacted (for non-{en,*} pairs.
  • Baha will send out the message to these users. (Baha, note that for Arabic, we should do one test with our own talk pages as right-to-left may break a few things in the message we may have to fix.

Thank you both.

Turns out I need another right (in addition to sending messages) to create lists. I've requested it here. I'll send the test messages once I'm allowed to create lists.

@leila what should the topic of the message be?

@leila, are you by any chance an admin of a wiki? Turns out I need a special permission to create a list of users according to this. Requesting an admin right seems like a long processes.

@bmansurov I am nobody around here. ;) Do you want to check with Quiddity to see what's the best way to handle it?

Thanks, @leila. A wiki admin helped me. I was able to send a message to you, @diego, and @Cervisiarius. Please check if everything's OK and I'll send that message to other users. The only issue I had was that I wasn't able to fill out a placeholder for usernames, so the message doesn't contain the username. I'll look around to see if I can find a solution.

For posterity, I was able to create a list of users by going to https://en.wikipedia.org/wiki/Wikipedia_talk:Mass_message_senders#Mass-mail_subscription_list_shells and moving one of the shells. Then I added the list of users to the newly moved page. Then I went to https://en.wikipedia.org/wiki/Special:MassMessage and added the moved page and other details to send mass messages.

@bmansurov, we need to change the spreadsheet permissions, we need to allow non-logged people to edit the document. Do you know how to do it?

Here's the list of users I'm sending the message to in about 30mins. Let me know if you want to change anything before I send the message.

Update: the message has been sent out to the users.

Thanks @bmansurov. Let's wait until tomorrow and see how it works. Then, I'll give you 30 usernames more.

Hi @bmansurov

Please find the list of people to be contacted here: https://docs.google.com/spreadsheets/d/1vmTvSFitmsbpFKagLVxR2c2VcY8mBIVRdd_KUa_cIc4/edit?usp=sharing

Here is the code to produce such list https://github.com/digitalTranshumant/wmf-interlanguage/blob/master/PeopleToContact.ipynb

You can choose any of the languages to send the invitation (there is no preference between lang1 or lang2 columns)

I'm also sharing the Invitations and Instructions via email (subject Translation / Wikimedia Foundation) , as well and copy/paste here https://etherpad.wikimedia.org/p/InstructionsForSectionMapping

I've sent the message to 25 more enwiki users. Since the way we were sending messages was wiki specific, we'll have to go through the process for each wiki. Unfortunately, this doesn't work as I don't understand the other languages. For this reason we should probably manually leave these messages to user talk pages.

@diego, I'll also need the translated message subjects. In English it reads "Help request for mapping section titles".

@bmansurov:

en: Help request for mapping section titles
es: Solicitud de ayuda para mapear títulos de secciones
fr: Aidez-nous à associer des titres de section
ru: Просьба помочь с переводом заголовков разделов

I'll add the Japanese and Arabic ASAP. In the meantime, please send the invitations for all these 3 new languages (es,fr,ru)

I've manually posted messages to frwiki. ruwiki and eswiki are blocking me because they think I'm spamming user pages. I'll try again later. If that doesn't work, then someone else has to post those messages manually.

Any reason for doing manually? there is no way to automatize this process?

Automation is specific to a wiki. For enwiki, I was able to automate the processes, but for other wikis I haven't. See T184212#4150327 for more info.

Update on the stats. So far I've sent the messages to 107 users, and 58 more are remaining. Messages have been sent in all five language users. See peopleToContact.xls for details (please don't change it).

Here the remaining languages:

JP: マッピング・セクションのタイトルのサポート依頼
AR: طلب مساعدة في فرز عنواين الأقسام

@diego I've sent out Arabic and Japanese messages. Total messages sent is 132 and the remaining is 33. Do you think we can wait and see how many responses we get before we try and resolve T184212#4171283?