Page MenuHomePhabricator

Gather labels as ground truth for section synonym detection
Closed, ResolvedPublic

Event Timeline

leila triaged this task as High priority.Jan 4 2018, 7:58 PM
leila created this task.
leila reassigned this task from leila to diego.EditedFeb 28 2018, 8:37 PM
leila moved this task from Backlog to In Progress on the Research board.
leila added a subscriber: Cervisiarius.

@diego I assigned this task to you as you're working on the method for finding/surfacing synonyms now. Feel free to assign it back to me or others as work progresses.

/me is super happy that we made it this far this quarter. \o/ Great job! :)

Please find the data to labeled here: https://drive.google.com/drive/folders/1pzR3P16ck7FyrE7QgIpcSx1TPumTGA9u?usp=sharing

Those are candidates for synonyms, stratified by section-tfidf-similarty, and fasttext distance. For more details about the procedure, please check the code here: https://github.com/digitalTranshumant/wmf-interlanguage/blob/master/Synonyms.ipynb

@bmansurov : Please, now,we need to upload the sheets, just keeping the columns A (Sec_B) and B (Sec_B), and ask to volunteers to tag the in one of these three categories: synonym, related, not related.

@diego I've created a spreadsheet and invited you, @leila, and Bob to edit. Let me know if you need any other help.

We (@leila and me), have updated the labels, now we will use: same, overlap and different. And translated this in Spanish, and required help from staff and community for translating this labels in the other 4 languages.
We have also added 3 columns, for collecting different assessment in the case different opinions among reviewers.

@diego based on your latest updates, we seem to not aim to collect more labels for now. I resolve this task. Please re-open if you disagree.