Page MenuHomePhabricator

[SPIKE] Investigate building out automated data pipeline job for Similarusers service
Closed, ResolvedPublic

Description

In T270613, we implemented a PoC to see whether we could orchestrate a tool to assist users in finding similarusers on Wikimedia projects.

In this task, we want to revisit that PoC, and based on it, deploy an automated data pipeline on the new platform_eng Airflow server.

Done

  • Review previously proposed architecture
  • Identify issues and challenges with building out job on Data Pipeline - update ticket with findings
  • Review service's storage and schema - consider if appropriate to port to Cassandra

Event Timeline

WDoranWMF renamed this task from Create a data pipeline for Similar Users to [SPIKE] Investigate building out automated data pipeline job for Similarusers service.Aug 30 2022, 2:06 PM
WDoranWMF updated the task description. (Show Details)

Context (mostly as notes for myself): A couple years ago, there was an effort to put together a solution to be able to generate a list of users based on their similarity to another. The idea being to produce supporting evidence when a user with checkuser privilege is determining whether an acccount is a 'sockpuppet' or not. More context at: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects, and on this epic T259471.

This solution, as of today, consists of:

  1. A manually triggered Jupyter notebook that creates the model. This model is available on a private repo at https://github.com/wikimedia/research-similar-users-model. This repo is private as it contains algorithm details (paremeters, cut-off thresholds) that should be kept private to avoid abuse and adversarial attacks. The notebook uses a mix of PySpark and local computing to generate final model, composed of 3 tables.
  1. An python flask-based application that serves an API to do the similarity checks. This code lives at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/similar-users/+/refs/heads/main.
  1. A MariaDB database instance that is manually fed the output of (1), and is used by (2) to serve requests. This instance was adquired via T268505.
  1. A gerrit patchset PoC that attempts to automatize running (1) with an Airflow DAG.

Findings:

  1. Review of previously proposed architecture

A rearchitecture of the solution has been proposed at https://miro.com/app/board/uXjVOcODd8A=/. The main rationale seems to be to align the architecture of this project with the new conventions from recent development. This rearchitecture would take significant resources to implement. One improvement that we can attempt that is more scoped is to explore the work done in (4) to automatize (1), and see wheter we can put this automation in production.

  1. Issues and challenges with building out job on Data Pipeline

Automatizing the pipeline would require us to:
a) port (1) into a pyspark job in its own gitlab repo to get better code visibility and so that we don't depend on doing notebook computation on the airflow host.
b) rewrite some of the notebook code to run on spark rather than locally.
c) ask a domain expert for code review of (b) (a research folk?).
d) put the airflow dag together,
e) sync up with previous developers of (2) to make sure we test the right things, and
e) test.

This is a bit of work. Based on previous porting efforts, it could be done in a sprint, considering time for unexpected issues.

  1. Review service's storage and schema - consider if appropriate to port to Cassandra

Given the limited usage of the database (there are very few users with checkuser privilege), a database on a shared MariaDB instance is adequate. The current schema doesn't allow use of the API while the data is being refreshed, which can take ~8 hours due to throttling to avoid slamming the database. This is a compelling reason to redesign the way (2) accesses data. We could spike whether a delta update can be done instead. We could also spike whether having two sets of tables, a 'a' set and a 'b' set, and switching between could eliminate the down time while one of the sets gets updated. A port to Cassandra doesn't seem necessary in terms of performance or functionality, but it would be reasonable if we wanted to follow newer conventions as discussed on (5).

Thanks for overview @xcollazo. To the best of my knowledge, this well captures the project status.

I added some pointers to existing tasks in the phab tasks below. Please do feel free to close / re-puprose them as you see fit.

Context (mostly as notes for myself): A couple years ago, there was an effort to put together a solution to be able to generate a list of users based on their similarity to another. The idea being to produce supporting evidence when a user with checkuser privilege is determining whether an acccount is a 'sockpuppet' or not. More context at: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects, and on this epic T259471.

This solution, as of today, consists of:

  1. A manually triggered Jupyter notebook that creates the model. This model is available on a private repo at https://github.com/wikimedia/research-similar-users-model. This repo is private as it contains algorithm details (paremeters, cut-off thresholds) that should be kept private to avoid abuse and adversarial attacks. The notebook uses a mix of PySpark and local computing to generate final model, composed of 3 tables.

Github was chosen because gerrit/gitlab did not allow us to setup private repos (at the time).

  1. Issues and challenges with building out job on Data Pipeline

Automatizing the pipeline would require us to:
a) port (1) into a pyspark job in its own gitlab repo to get better code visibility and so that we don't depend on doing notebook computation on the airflow host.
b) rewrite some of the notebook code to run on spark rather than locally.
c) ask a domain expert for code review of (b) (a research folk?).

cc / @Isaac - the orignal author the model / data pipeline.
Some improvements were discussed in https://phabricator.wikimedia.org/T277548 (might require re-evaluation).

  1. Review service's storage and schema - consider if appropriate to port to Cassandra

Given the limited usage of the database (there are very few users with checkuser privilege), a database on a shared MariaDB instance is adequate. The current schema doesn't allow use of the API while the data is being refreshed, which can take ~8 hours due to throttling to avoid slamming the database. This is a compelling reason to redesign the way (2) accesses data. We could spike whether a delta update can be done instead. We could also spike whether having two sets of tables, a 'a' set and a 'b' set, and switching between could eliminate the down time while one of the sets gets updated. A port to Cassandra doesn't seem necessary in terms of performance or functionality, but it would be reasonable if we wanted to follow newer conventions as discussed on (5).

There was an initial spike (mostly around data voulmes) started in the previous iteration of the Generated Data Platform value stream.
https://phabricator.wikimedia.org/T287274
And some historical record of data volumes:
https://phabricator.wikimedia.org/T286036

Needs validation: historically we had to pierce the firewall to allow access to mariadb prod hosts from stat1004. See https://phabricator.wikimedia.org/T270196.
Effectivley, this limited execution of the ingestion process to that host. I don't know if this constraint still applies.

Thanks for looping me in @gmodena !

Indeed, @xcollazo if you intend to rewrite the notebook job, please don't hesitate to reach out to me to ask any question about design/review/etc. as the person who built the initial prototype (and happy to see this being streamlined)!

Thank you for the context @gmodena and @Isaac.

I believe a decision on next step has been reached on T309125.

Closing this spike as done,.