Page MenuHomePhabricator

Update Turkish Wikipedia's labeling campaign for 2020
Open, MediumPublic

Description

The old 2016 campaign hasn't seen much activity. Let's update the campaign with a sample from 2020 and have people label it.

Make sure to ping back on the original discussion once this is done. See https://meta.wikimedia.org/wiki/User_talk:Halfak_(WMF)#ORES_in_Turkish

Event Timeline

Halfak created this task.Jul 7 2020, 7:01 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 7 2020, 7:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak triaged this task as Medium priority.Jul 13 2020, 4:36 PM
Halfak moved this task from Untriaged to New development on the Machine Learning Platform board.
Halfak added a subscriber: calbon.

@calbon this is the Wikilabels task we talked about at backlog grooming.

Halfak updated the task description. (Show Details)Jul 20 2020, 1:57 PM
Halfak added a comment.Sep 1 2020, 2:59 PM

@Evrifaessa, I'd moved to a new job, so I'm not managing the backlog for ORES anymore. For the now, @calbon is responsible for prioritizing tasks like this one.

I'm happy to help out when @calbon is ready to take this on though.

calbon added a comment.Sep 1 2020, 8:20 PM

@Evrifaessa Heyo! I have taken over the team. Right now with Aaron leaving and other things we are down to Kevin and myself. But when some folks start again we can tackle this and other items in the backlog.

Hi @calbon

We are actively trying to develop a counter-vandalism bot at the moment, so it would be of tremendous help to us if this task could be given some precedence / priority in the backlog.

Halfak added a comment.Sep 8 2020, 6:07 PM

Here's a query that gathers a random sample of 20k revisions from the last year: https://quarry.wmflabs.org/query/47980

Next step is to pull the results of this query into the https://github.com/wikimedia/editquality Makefile and run the autolabel utility. If the # of revisions that "need_review": True is between 2k and 5k, we're good to go. Load those up into Wikilabels and we can go from there. If the number of needs_review is too low or too high, we need to adjust the size of the incoming sample.

calbon added a comment.Sep 9 2020, 8:56 PM

@Vito-Genovese Sounds good, I am going to put it to the front of the backlog and try to get to it this week or early next week.

@Halfak Thank you so much for your help as always.

@Halfak I followed these instructions and took the following steps;

  1. Got qrun_id: 495204 from the HTML source code of the query that you shared.
  2. Created this link that enables editquality to download the resultset programmatically.
  3. Replaced old resultset with new resultset in the Makefile.
  4. Renamed config trwiki datasets from *20k_2015 to *20k_2019 in the Makefile.

Based on the instructions, the next step would have been to run the autolabel utility. I couldn't run it because in the Makefile, I see the autolabel utility for other wikis but there's none for trwiki.

Please let me know how I can move forward from this point.

Thanks!

Nice work on the progress @kevinbazira!

I think a good next step to help us work together would be to get a PR up for the changes you are making in the Makefile. It'll be easier to reference parts of the Makefile that way.

So, trwiki was one of the first wikis we supported so it's a bit weird. We'll want to borrow some configuration from the other wikis to see what to do here. In this case, fawiki is a good example to work from because, like trwiki, it has two separate labeling campaigns. Also, fawiki is one of the first wikis we worked on so it's config is weird in the same way! See https://github.com/wikimedia/editquality/blob/master/config/wikis/fawiki.yaml

In that config, you can see that only the 2016 (newer sample for fawiki) gets the "autolabeled_samples" treatment.

Based on this query I can see that the "trusted_groups" should include:

- sysop
- oversight
- bot
- rollbacker
- checkuser
- abusefilter
- bureaucrat
- flow-bot
- interface-admin
- interface-editor

If that doesn't make the next step clear, get what you have in a PR and we'll take it from there :)

@Halfak, wdym by trusted_groups? what is it used for? and why didn't you include the patroller group in that list?

@Evrifaessa

"trusted_groups" are user groups of users who we don't want to waste your time asking you to review. E.g., we can be reasonably sure that admins aren't vandalizing Wikipedia. Is that true for people who are given the Patroller right? Either way, we'll be asking you to review any edits by editors in these "trusted_groups" that were reverted just in case there was some unintentional damage involved.

@kevinbazira

PR is in a good state. I just left the second round of notes. We should add "patroller" to the list of trusted groups if @Evrifaessa confirms that there's enough of a barrier to becoming a patroller that we can trust people with these rights.

Next we'll want to load the "needs_review": true data into Wikilabels. How many revisions in the sample contain that data value? You should be able to find out with some CLI fu like this: cat datasets/trwiki.autolabeled_revisions.20k_2020.json | grep '"needs_review": true' | wc -l

We'll want to pull the data into Wikilabels.

@Halfak, we do not have a rollbacker group, so it'll be nonsense to add rollbackers to trusted_groups since we don't have any. Well, I don't see anything bad in including patrollers instead of rollbackers, because the patroller group kinda includes what rollbacker group does. Either way, in my very personal opinion, we can include patrollers in trusted_groups. I'd also like to see @Vito-Genovese's opinions for this.

Indeed. The Patroller user group at trwiki is a combination of 1) the original Patroller user group (introduced by Extension:Patroller), 2) the Editor user group (introduced by Extension:FlaggedRevs), and 3) the rollback right. So, they are certainly trustworthy.

@Halfak and @calbon, any updates for this task?

@kevinbazira where did we end up with this?

kevinbazira added a comment.EditedThu, Sep 24, 6:02 PM

@calbon, I created this PR that added 2020 trwiki data configurations to editquality.

The next step was to go to Wikilabels and Aaron who was advising on this has been traveling lately as he communicated at the end of the PR above.

We hope to proceed soon as he is available.

Sorry for the delay. Just drove across a continent and I'm moving into a new house! I should be able to get back to supporting this task next week.

@Halfak , have you been able to get to this one?

Thanks for the ping. I do still have this on my todo list and I should be able to give Kevin the stuff he needs to get it done with week.

As promised, I've loaded the new labeling campaign and I've included a summary of the actions I performed below for documenting this process.

First, let's connect to the wikilabels production VM in labs:

$ ssh wikilabels-02.eqiad.wmflabs

Make a backup of the database just in case you make a mistake.

halfak@wikilabels-02:~$ cd backups/
halfak@wikilabels-02:~/backups$ pg_dump -d u_wikilabels -U u_wikilabels -h wikilabels.db.svc.eqiad.wmflabs -W | gzip -c > ../backups/2020-10-21.sql.gz
Password:

You'll need the password for the database in order to perform this action. I get it from the local config. Note that I have censored the password below. You'll be able to see it when you are on the machine.

halfak@wikilabels-02:~/backups$ cat /srv/wikilabels/config/config/98-database.yaml 
# These credentials are intended to be used on labels.wmflabs.org.  They are
# sensitive and should never be commited to a public repository.
database:
  user: u_wikilabels
  dbname: u_wikilabels
  password: <password>

Now, let's create a new campaign. I run the script from the production code and using the production config. The new_campaign script does the heavy lifting.

halfak@wikilabels-02:~/backups$ sudo -u www-data /srv/wikilabels/venv/bin/wikilabels new_campaign trwiki "Değişiklik kalitesi (3,000 rastgele örnekleme, 2020)" damaging_and_goodfaith DiffToPrevious 1 50  --config=/srv/wikilabels/config/config/
{'active': True, 'created': datetime.datetime(2020, 10, 21, 14, 38, 51, 290336), 'tasks_per_assignment': 50, 'view': 'DiffToPrevious', 'name': 'Değişiklik kalitesi (3,000 rastgele örnekleme, 2020)', 'info_url': None, 'id': 96, 'form': 'damaging_and_goodfaith', 'labels_per_task': 1, 'wiki': 'trwiki'}

You can see that the output contains structured information about the campaign that was just created. The most important bit of information we need is the campaign ID. Here, you see 'id': 96. In this next command, we use the task_inserts script to insert all of the observations that contain "needs_review": true.

halfak@wikilabels-02:~/backups$ cat ../datasets/trwiki.autolabeled_revisions.20k_2020.json | grep '"needs_review": true' | sudo -u www-data /srv/wikilabels/venv/bin/wikilabels task_inserts 96 --config=/srv/wikilabels/config/config/

Now we're done! You can check our work at https://labels.wmflabs.org/ui/trwiki/

Hi @Halfak
Thanks for everything. ORES would be a huge time saver for us.

I started labeling but it shows edits of sysops. I thought we excluded sysops and patrollers.