- Re-implemented most of the code now but missing training data and "embedding" pipeline for users.
- Gathering recent sock-puppet investigation outcomes for training.
- Building a new ground truth dataset from archived SPI reports
- Code handover
|Resolved||• DarTar||T171251 [Objective 3.1.2] Models for sockpuppet and toxic discussion detection|
|Resolved||Isaac||T171635 Prototype new models to facilitate sockpuppet detection|
|Resolved||DED||T236299 Port sock-puppet detection model in-house|
- Continued progress in building the model and preparing for the demo.
- Meeting with Amir and Niharika: We discussed the potential of integrating his code, ethical considerations, and the features that can be added/hidden.
- First model is ready but with relatively low performance (~60% AUC). It was trained on a subset of the data in the english language. Calculating all-time edit diffs remains a challenge for such a large Wiki.
- Ongoing work on tuning the model to improve the results.
- Started with itwiki as well. without sentiment analysis.
OK, the point about using the complete user profile is good and valid. The user activity across different language wikis can be additional features too. I'd be interested in seeing the metrics, when you have it, thanks.
- I was finally able to process a large enough view of wikipedia history (from 2015 onwards). This should match with the SSO rollout to use the user_text as a unique id across wikis.
- Transitioned to a new model based on word analysis to accommodate multiple wikis. I'll check what this is capable of. Basically, I gave up on sentiment analysis.
- Tested a new model by adding concept-vectors and interaction graph.
- The model is now slightly more difficult to interpret but achieves a better AUC (75%), using XGBoost.
- Refactored the data preparation code in Scala. The code is much more scalable and can regenerate the necessary training data in 1 days on our analytics cluster.
- Discussed with the product team the api endpoints and the potential env. for deployment (ORES ?)
- features => include talk pages, refine the interaction graph, and dig a bit more into the computation of concept vectors. Also, look into Amir's model.
- Code => refactor the training to run on the cluster. The above results were limited to 10% of the users. There isn't enough memory to fit 3M enwiki users-data in memory
- Train with "True Negatives" provided by Niharika.
- Talk pages are now included in the data.
- I generate a new contribution graph. It's a bipartite of (users) and (wikipages/talkpages) with edit edges (weighted with the number of edits)
- I tried multiple graph mining algorithms on the contribution graph to detect "sub-communities". So far, these techniques either didn't improve the performances, or the algos didn't scale to the data.
@kaldari The current model is ready, at least as a first iteration. I am in the process of handing over the code and have someone test it internally. @Niharika may know more about the specifics of the deployment responsibilities, is this something you can help with?
Also, we have the same constraints that @Ladsgroup brought up.
I wanted to call folks' attention to this thread in case anyone wants to respond: https://lists.wikimedia.org/pipermail/wikitech-l/2020-August/093681.html
Thanks to @Ladsgroup for raising the question