[SPIKE] Investigate building out automated data pipeline job for Similarusers service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Aug 29 2022, 2:56 PM

Description

In T270613, we implemented a PoC to see whether we could orchestrate a tool to assist users in finding similarusers on Wikimedia projects.

In this task, we want to revisit that PoC, and based on it, deploy an automated data pipeline on the new platform_eng Airflow server.

Done

Review previously proposed architecture
Identify issues and challenges with building out job on Data Pipeline - update ticket with findings
Review service's storage and schema - consider if appropriate to port to Cassandra

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		lbowmaker	T309125 SimilarUsers Service
		Resolved		xcollazo	T316557 [SPIKE] Investigate building out automated data pipeline job for Similarusers service

Event Timeline

xcollazo created this task.Aug 29 2022, 2:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2022, 2:56 PM

JArguello-WMF edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.Aug 30 2022, 12:29 PM

JArguello-WMF moved this task from Ready to In Progress on the Data Pipelines (Sprint 00) board.

WDoranWMF renamed this task from Create a data pipeline for Similar Users to [SPIKE] Investigate building out automated data pipeline job for Similarusers service.Aug 30 2022, 2:06 PM

WDoranWMF updated the task description. (Show Details)

Context (mostly as notes for myself): A couple years ago, there was an effort to put together a solution to be able to generate a list of users based on their similarity to another. The idea being to produce supporting evidence when a user with checkuser privilege is determining whether an acccount is a 'sockpuppet' or not. More context at: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects, and on this epic T259471.

This solution, as of today, consists of:

A manually triggered Jupyter notebook that creates the model. This model is available on a private repo at https://github.com/wikimedia/research-similar-users-model. This repo is private as it contains algorithm details (paremeters, cut-off thresholds) that should be kept private to avoid abuse and adversarial attacks. The notebook uses a mix of PySpark and local computing to generate final model, composed of 3 tables.

An python flask-based application that serves an API to do the similarity checks. This code lives at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/similar-users/+/refs/heads/main.

A MariaDB database instance that is manually fed the output of (1), and is used by (2) to serve requests. This instance was adquired via T268505.

A gerrit patchset PoC that attempts to automatize running (1) with an Airflow DAG.

Findings:

Review of previously proposed architecture

A rearchitecture of the solution has been proposed at https://miro.com/app/board/uXjVOcODd8A=/. The main rationale seems to be to align the architecture of this project with the new conventions from recent development. This rearchitecture would take significant resources to implement. One improvement that we can attempt that is more scoped is to explore the work done in (4) to automatize (1), and see wheter we can put this automation in production.

Issues and challenges with building out job on Data Pipeline

Automatizing the pipeline would require us to:
a) port (1) into a pyspark job in its own gitlab repo to get better code visibility and so that we don't depend on doing notebook computation on the airflow host.
b) rewrite some of the notebook code to run on spark rather than locally.
c) ask a domain expert for code review of (b) (a research folk?).
d) put the airflow dag together,
e) sync up with previous developers of (2) to make sure we test the right things, and
e) test.

This is a bit of work. Based on previous porting efforts, it could be done in a sprint, considering time for unexpected issues.

Review service's storage and schema - consider if appropriate to port to Cassandra

Given the limited usage of the database (there are very few users with checkuser privilege), a database on a shared MariaDB instance is adequate. The current schema doesn't allow use of the API while the data is being refreshed, which can take ~8 hours due to throttling to avoid slamming the database. This is a compelling reason to redesign the way (2) accesses data. We could spike whether a delta update can be done instead. We could also spike whether having two sets of tables, a 'a' set and a 'b' set, and switching between could eliminate the down time while one of the sets gets updated. A port to Cassandra doesn't seem necessary in terms of performance or functionality, but it would be reasonable if we wanted to follow newer conventions as discussed on (5).

CC @WDoranWMF, @lbowmaker.

xcollazo updated the task description. (Show Details)Aug 31 2022, 1:58 PM

• DAbad added a parent task: T309125: SimilarUsers Service.Aug 31 2022, 5:14 PM

• DAbad mentioned this in T309125: SimilarUsers Service.Aug 31 2022, 5:26 PM

xcollazo moved this task from In Progress to In Review on the Data Pipelines (Sprint 00) board.Aug 31 2022, 8:15 PM

Thanks for overview @xcollazo. To the best of my knowledge, this well captures the project status.

I added some pointers to existing tasks in the phab tasks below. Please do feel free to close / re-puprose them as you see fit.

In T316557#8202584, @xcollazo wrote:

Context (mostly as notes for myself): A couple years ago, there was an effort to put together a solution to be able to generate a list of users based on their similarity to another. The idea being to produce supporting evidence when a user with checkuser privilege is determining whether an acccount is a 'sockpuppet' or not. More context at: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects, and on this epic T259471.

This solution, as of today, consists of:

A manually triggered Jupyter notebook that creates the model. This model is available on a private repo at https://github.com/wikimedia/research-similar-users-model. This repo is private as it contains algorithm details (paremeters, cut-off thresholds) that should be kept private to avoid abuse and adversarial attacks. The notebook uses a mix of PySpark and local computing to generate final model, composed of 3 tables.

Github was chosen because gerrit/gitlab did not allow us to setup private repos (at the time).

Issues and challenges with building out job on Data Pipeline

Automatizing the pipeline would require us to:
a) port (1) into a pyspark job in its own gitlab repo to get better code visibility and so that we don't depend on doing notebook computation on the airflow host.
b) rewrite some of the notebook code to run on spark rather than locally.
c) ask a domain expert for code review of (b) (a research folk?).

cc / @Isaac - the orignal author the model / data pipeline.
Some improvements were discussed in https://phabricator.wikimedia.org/T277548 (might require re-evaluation).

Review service's storage and schema - consider if appropriate to port to Cassandra

Given the limited usage of the database (there are very few users with checkuser privilege), a database on a shared MariaDB instance is adequate. The current schema doesn't allow use of the API while the data is being refreshed, which can take ~8 hours due to throttling to avoid slamming the database. This is a compelling reason to redesign the way (2) accesses data. We could spike whether a delta update can be done instead. We could also spike whether having two sets of tables, a 'a' set and a 'b' set, and switching between could eliminate the down time while one of the sets gets updated. A port to Cassandra doesn't seem necessary in terms of performance or functionality, but it would be reasonable if we wanted to follow newer conventions as discussed on (5).

There was an initial spike (mostly around data voulmes) started in the previous iteration of the Generated Data Platform value stream.
https://phabricator.wikimedia.org/T287274
And some historical record of data volumes:
https://phabricator.wikimedia.org/T286036

Needs validation: historically we had to pierce the firewall to allow access to mariadb prod hosts from stat1004. See https://phabricator.wikimedia.org/T270196.
Effectivley, this limited execution of the ingestion process to that host. I don't know if this constraint still applies.

Thanks for looping me in @gmodena !

Indeed, @xcollazo if you intend to rewrite the notebook job, please don't hesitate to reach out to me to ask any question about design/review/etc. as the person who built the initial prototype (and happy to see this being streamlined)!

• EChetty edited projects, added Data Pipelines (Sprint 01); removed Data Pipelines (Sprint 00).Sep 6 2022, 10:00 AM

• EChetty moved this task from Ready to Next Up on the Data Pipelines (Sprint 01) board.Sep 6 2022, 10:02 AM

• EChetty moved this task from Next Up to Ready on the Data Pipelines (Sprint 01) board.Sep 6 2022, 10:09 AM

Thank you for the context @gmodena and @Isaac.

I believe a decision on next step has been reached on T309125.

Closing this spike as done,.

• EChetty moved this task from Ready to Done on the Data Pipelines (Sprint 01) board.Sep 8 2022, 4:13 PM

• EChetty removed a subscriber: xcollazo.