Page MenuHomePhabricator

Gather labels as ground truth for section synonym detection
Closed, ResolvedPublic

Event Timeline

leila triaged this task as High priority.
leila reassigned this task from leila to diego.EditedFeb 28 2018, 8:37 PM
leila moved this task from Backlog to In Progress on the Research board.
leila added a subscriber: Cervisiarius.

@diego I assigned this task to you as you're working on the method for finding/surfacing synonyms now. Feel free to assign it back to me or others as work progresses.

/me is super happy that we made it this far this quarter. \o/ Great job! :)

Please find the data to labeled here: https://drive.google.com/drive/folders/1pzR3P16ck7FyrE7QgIpcSx1TPumTGA9u?usp=sharing

Those are candidates for synonyms, stratified by section-tfidf-similarty, and fasttext distance. For more details about the procedure, please check the code here: https://github.com/digitalTranshumant/wmf-interlanguage/blob/master/Synonyms.ipynb

@bmansurov : Please, now,we need to upload the sheets, just keeping the columns A (Sec_B) and B (Sec_B), and ask to volunteers to tag the in one of these three categories: synonym, related, not related.

@diego I've created a spreadsheet and invited you, @leila, and Bob to edit. Let me know if you need any other help.

We (@leila and me), have updated the labels, now we will use: same, overlap and different. And translated this in Spanish, and required help from staff and community for translating this labels in the other 4 languages.
We have also added 3 columns, for collecting different assessment in the case different opinions among reviewers.

@diego based on your latest updates, we seem to not aim to collect more labels for now. I resolve this task. Please re-open if you disagree.