Page MenuHomePhabricator

Prototype new models to facilitate sockpuppet detection
Closed, ResolvedPublic


Prototype new models to facilitate sockpuppet detection.

This task is scheduled to start in Q2. The preparation for it starts in Q1.

Event Timeline

leila created this task.
leila added a subscriber: TBolliger.

Requested a list of attendees and meeting for syncing up and aligning directions/expectations between Research, Scoring Platform, and Community Tech, prior to reaching out to potential external collaborators.

Notes from our exploratory call with Srijan and multiple WMF teams on August 4:

Summary of results from the meeting: there's generally support for this research. Next steps:

  • Srijan and I will follow up to plan for the start of the research in Q2 (September-December). I'm not sure if this is possible on Srijan's end but we will figure it out in the coming weeks.
  • We will start by understanding the current workflow for detecting sockpuppet accounts.

Srijan says:
"I am already working with Tilen, a visiting PhD student (just like I once myself was :)), on an algorithm to identify bad users in any platform, including Wikipedia. Initial experiments show that the algorithm performs well, also on a Wikipedia vandal identification dataset. The idea is to use it to find any type of bad user, including sockpuppets. I will send you some slides tomorrow so that you get a high level overview.
The plan is get the basic framework of the algorithm done before Tilen leaves, which is in late Sept, and then tune it specifically for Wikipedia after that."

He has also asked if the tool will work on private data (I communicated that he should assume that's the case) and whether we can learn the details of the current process how sockpuppets are detected. I created a task for documentation on meta T172796 and figuring out procedures T172795.

@srijan Happy 2018! :)

I'm assigning this task to you as you're in charge of it. :) On our end, Dario will remain the point of contact. If you need my help at any point, just ping.

@leila Happy new year to you too!
Definitely, thanks!

Update (No action needed):

Srijan and I met today (meeting notes) and we discussed the state of this task. The task is on a very good track given the complexity of it. Detecting sockpuppets is not an easy task. In the past months, the researchers have tried 3 models (A, B, and C under Model 1) and managed to bring the AUC from almost random (~0.5) to 0.72. Right now, they're working on Model 2. The biggest challenge at the moment is to improve the speed of Model 2 for Wikipedia (because the model relies on every single edit, quite some work is needed to speed it up). Given the state of the model and the work left, the current estimate is that we'd be able to test the new model (hopefully with much higher AUC) in May or June. This date may need an update if the results come out earlier or later. Let's not fix those dates as final in our operations, but that's the target.

Thank you for the update, @leila ! Excited to see the results of the new model.

Let me, @SPoore and/or @PEarleyWMF know if you need help getting feedback from CheckUsers or other sockpuppet hunters. As a reminder, the Anti-Harassment has committed to building a simple UI to help our users interface with the model, if needed.

@TBolliger thanks! How do you recommend we set aside time for your team to help? On the one hand, being able to test via the simple UI you refer to in May/June is very plausible from the state of research, on the other hand, it's not 100% clear that the research will be ready. Is there some way you can set aside some time on your team's end for this without locking resources completely until we know more?

I've created T189324: Build UI to validate sockpuppet model with users to track this work.

Our team works on 2-week sprints, so this task can interrupt us at any point. If the model is determined ineffective we can close this ticket as declined. But (more likely and hopefully!) when the model is ready we can set up a call or email to discuss further details (implementation & what we actually want users to validate.)


I also wanted to mention that I've included this work in our Q4 goals:

Do you think Q4 is a reasonable timeframe? Or should it be Q1?

@TBolliger It's hard to commit to Q4 from the research perspective as we may not be able to make it. It really depends on how the research goes (and I know this is deeply uncertain:/). You can call it out in Q4 goals as a stretch goal, or leave it out for Q1, and if you get to do it in Q4, we can still report it. Does this work for your workflow?

OK, we'll drop it for Q4. If things get way ahead of schedule we can still work on it :)

leila reopened this task as Open.EditedMar 28 2019, 8:41 PM

Update time. I have sent the following status update and recommendation for next steps email to a few folks. Putting it here as well for visibility for others.

Summary: we have a feature-based model based on public edit logs that
can predict whether two usernames are the same with performance of
~65%. We will talk with checkusers through Trust and Safety to see how
we can move forward in implementation and also see if they're
interested for the model to include features based on private data
which can enhance model performance.

Longer version:

  • The focus of the work has been on working with public edit data only. Access to private data can most probably improve the results as, for example, we will be able to tell whether the IP address or userAgent of two accounts are the same. We intentionally decided to start with public data only.
  • Using only public edit data: we have two models. A simple feature-based model that can be easily scaled (~65% performance), and a deep learning model which is much more resource intensive (~73% performance). The feature based model is not great in terms of performance but is better than random (50% performance) and it can be a good starting point as it's simple and can be scaled. We expect that adding private data to that (if checkusers are interested), enhances performance significantly.
  • What we're currently predicting is the probability that two usernames are the same. We can extend the model to: for a given username x, give me a ranked list of all usernames that are predicted to be the same as username x (with some condition to make the search space for pairs smaller, otherwise you have x and millions of usernames to check against and compute probabilities for which is not scalable).

Recommendation: we work with checkusers to implement the feature based model for them considering their workflows. We then add the private data to the model as a set of features if checkusers are interested.

I'll follow up with Trust and Safety about this now as I will need to talk about this with checkusers to see what they think is the best way to move forward.

Status update:

  • We presented the current state of the two models in Wikimania 2019.
  • We prepared an email to checkusers with more info for testing and feedback that PEarley with share with them in a couple of hours.

I expect the following couple of weeks be spent on gathering their feedback and understanding how/if iterations over the model are needed.

We have received the first feedback from a checkuser and we will need to change one thing in the set of predictions. At the moment, we include all accounts, including those who have not edited in the past 90 days, but this information is not actionable for checkusers as they can't use other (sometimes private) information that are kept and can be used for 90 days in these cases. It makes sense to remove these predictions, or give two outputs and let them filter by less-than-90-day-edit-activity or not.

We will create a new list and share with them.

leila removed subscribers: Tbayer, DarTar.

I reassigned the task to DED. We're still working with srijan on this. The reassignment makes it clearer on our end who we can poke for status updates.

@Niharika I'll assign it to Isaac and he can decide how to merge/update.

leila edited subscribers, added: DED; removed: TBolliger.

Weekly updates:

Weekly updates:

  • No feedback so far on tool -- looking into ways to reduce barriers to testing with checkusers
  • Started due diligence on making tool code public -- reached out to NK, PE, AS, LZ

Weekly update:

  • Still no feedback
  • Feedback collected from AS about making code public but was requested by PE to give several more days for discussion before making a decision

Weekly update:

  • Still no feedback
  • Waiting on decision around making code public. Will follow up next week.
  • Meeting with NK/EP to discuss productization in the meantime and we seem to have good agreement there.

Weekly update:

  • Still no feedback from Checkusers -- at this point, I believe the expectation is that we will productize it so they can access the tool directly, which should make it much easier for them to provide feedback.
  • Tool code has been moved to Gerrit:
  • I'm largely just playing a consultation role right now but really excellent progress on productization of the tool as being tracked here: T265722
  • I produced datasets of text diffs and which sections were edited by each user to explore with DD

@Isaac thank you for the update. When we meet next time to talk about the technical items, let's make sure we discuss roadmaps for this line of research and model (something you and I touched on a month ago and we thought in December there will be more clarity to act based on).

Weekly updates:

  • Continued support of productization
  • Regenerated data through all of November -- whole pipeline was about 20 minutes start to finish from collecting all the relevant edit history from the cluster to outputting the TSV files the tool uses.

When we meet next time to talk about the technical items, let's make sure we discuss roadmaps for this line of research and model

Sounds good - I will add to the agenda

I'm going to close this task out unless there are any objections -- my work on this has largely been complete for a while now and no issues have come up yet in the productization that would require serious rework of the approach (though plenty of improvements have been made to the stability of the prototype). Future tasks that we might open are:

  • Making updates based on Checkuser feedback
  • Further research into other types of data / modeling that could help inform the ranking.