Page MenuHomePhabricator

Port sock-puppet detection model in-house
Closed, ResolvedPublic

Description

  • Re-implemented most of the code now but missing training data and "embedding" pipeline for users.
  • Gathering recent sock-puppet investigation outcomes for training.
  • Building a new ground truth dataset from archived SPI reports
  • Code handover

Event Timeline

leila triaged this task as High priority.Oct 23 2019, 5:35 PM
leila created this task.

@DED I know a lot of the "porting" is already done by you. Feel free to update the task title and description or resolve it and open a new one for your work in April-June 2020.

  • Re-implemented most of the code now but missing training data and "embedding" pipeline for users.
  • Gathering recent sock-puppet investigation outcomes for training.
  • Building a new ground truth dataset from archived SPI reports

@DED we could use this task to outline the overall work that needs to be done to port the model in-house. Can you list the next steps in the task description?

This is quite interesting, is there more information about this somewhere?
Disclosure: I'm part of the SPI team on enwiki.

This is quite interesting, is there more information about this somewhere?
Disclosure: I'm part of the SPI team on enwiki.

Yes, see https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects

Progress in building the model and updating the code.
Setup a timeline for deployment.

  • Continued progress in building the model and preparing for the demo.
  • Meeting with Amir and Niharika: We discussed the potential of integrating his code, ethical considerations, and the features that can be added/hidden.
  • First model is ready but with relatively low performance (~60% AUC). It was trained on a subset of the data in the english language. Calculating all-time edit diffs remains a challenge for such a large Wiki.
    • Ongoing work on tuning the model to improve the results.
  • Started with itwiki as well. without sentiment analysis.

Hi @DED, thanks for the updates, this is exciting to get the first pass results!

Though the 0.6 AUC is low, what is the performance (precision, recall, F1) on the sockpuppet class? Is the result above for enwiki?

Hello @srijan. I didn't compute these metrics. Basically, processing only parts of enwiki creates an incomplete fingerprint for any user. Unless my current effort in making a pass on the full data succeeds, I plan on sampling users to obtain targeted full edit history.

OK, the point about using the complete user profile is good and valid. The user activity across different language wikis can be additional features too. I'd be interested in seeing the metrics, when you have it, thanks.

updates:

  • I was finally able to process a large enough view of wikipedia history (from 2015 onwards). This should match with the SSO rollout to use the user_text as a unique id across wikis.
  • Transitioned to a new model based on word analysis to accommodate multiple wikis. I'll check what this is capable of. Basically, I gave up on sentiment analysis.

updates:

  • Tested a new model by adding concept-vectors and interaction graph.
  • The model is now slightly more difficult to interpret but achieves a better AUC (75%), using XGBoost.
  • Refactored the data preparation code in Scala. The code is much more scalable and can regenerate the necessary training data in 1 days on our analytics cluster.
  • Discussed with the product team the api endpoints and the potential env. for deployment (ORES ?)

Next steps:

  • features => include talk pages, refine the interaction graph, and dig a bit more into the computation of concept vectors. Also, look into Amir's model.
  • Code => refactor the training to run on the cluster. The above results were limited to 10% of the users. There isn't enough memory to fit 3M enwiki users-data in memory
  • Train with "True Negatives" provided by Niharika.

This is major update. thanks for all the work, especially addressing the scalability component as much as possible.

Great progress, thanks for the updates! Especially exciting to see a huge jump in AUC to 75% and scalability. Let me know how we at Georgia Tech can help.

Cheers,
Srijan

  • Talk pages are now included in the data.
  • I generate a new contribution graph. It's a bipartite of (users) and (wikipages/talkpages) with edit edges (weighted with the number of edits)
  • I tried multiple graph mining algorithms on the contribution graph to detect "sub-communities". So far, these techniques either didn't improve the performances, or the algos didn't scale to the data.

@DED - So how close is this task to "done"? Unfortunately, I can't parse the last remaining acceptance criteria: "Re-implemented most of the code now but missing training data and 'embedding' pipeline for users." Is this ready for Platform Engineering to start building an API around?

@DED - So how close is this task to "done"? Unfortunately, I can't parse the last remaining acceptance criteria: "Re-implemented most of the code now but missing training data and 'embedding' pipeline for users." Is this ready for Platform Engineering to start building an API around?

I have something similar, it's definitely not as good as @DED's model but it's useful. Depends on what do you want to use it for. I'm helping CUs in several wikis with this but I also have some ethical problems with publishing the code/data.

@kaldari The current model is ready, at least as a first iteration. I am in the process of handing over the code and have someone test it internally. @Niharika may know more about the specifics of the deployment responsibilities, is this something you can help with?
Also, we have the same constraints that @Ladsgroup brought up.

@DED - @calbon is going to be heading up the production integration (T259471), in collaboration with @Niharika (as the product owner), so please coordinate with him as well.

I wanted to call folks' attention to this thread in case anyone wants to respond: https://lists.wikimedia.org/pipermail/wikitech-l/2020-August/093681.html
Thanks to @Ladsgroup for raising the question

The initial porting of the model is done. Isaac will lead this work moving forward. I'll resolve this task.

leila updated the task description. (Show Details)