Page MenuHomePhabricator

[SPIKE][PLACEHOLDER] we need to estimate the effort required to migrate Similarusers' backend to Cassandra
Open, Needs TriagePublic


This is a placeholder task that needs refinement (and possibly a better title).

Similarusers has ETL, lifecycle and access patterns that match the "generated datasets" that might be a fit for Cassandra.

The goal of this SPIKE is to identify which parts of the codebase need changes, and what volumes, throughput and QPS we
should size for.

Schema definition can be found at


The model generates 3 datasets, which are stored in 3 dedicated mysql tables.


In June's ingestion run we report the following number or rows per dataset:

datasetnumber of rowssize (GB)

See for details and history of runs.

Query pattern

At request time, for a given user_id (key), data is retrieved from mariadb with lookup (SELECT) operations. The resulting result sets are joined
in memory in the service itself.

Update schedule

Dataset are generated monthly and loaded in bulk. Ingestion is triggered manually around the 6th of the month (once source datasets are available in Hadoop). Currently we throttle ingestion to alleviate resource contention on the db,
and slow ingestion rate to approx 5000 rows/sec.

Datasets are assumed read-only. At ingestion time, previous mariadb data is truncated.

Service load

The service runs in Kubernetes in the staging, codfw and eqiad clusters.
Service metrics are available at
It does not seem the service received requests recently.


We don't have explicit SLOs at this time. The ETL schedule allows for delays (up to a few weeks). We have a 6 hours window during ingestion
when the services won't perform database lookups.