Page MenuHomePhabricator

[SPIKE][PLACEHOLDER] we need to estimate the effort required to migrate Similarusers' backend to Cassandra
Open, Needs TriagePublic

Description

This is a placeholder task that needs refinement (and possibly a better title).

Similarusers has ETL, lifecycle and access patterns that match the "generated datasets" that might be a fit for Cassandra.

The goal of this SPIKE is to identify which parts of the codebase need changes, and what volumes, throughput and QPS we
should size for.

Schema definition can be found at https://github.com/wikimedia/mediawiki-services-similar-users/blob/main/migrations/create.sql.

Datasets

The model generates 3 datasets, which are stored in 3 dedicated mysql tables.

Volumes

In June's ingestion run we report the following number or rows per dataset:

datasetnumber of rowssize (GB)
Temporal199328940.4
UserMetadata88983000.6
Coedit1200673904

See https://phabricator.wikimedia.org/T286036 for details and history of runs.

Query pattern

At request time, for a given user_id (key), data is retrieved from mariadb with lookup (SELECT) operations. The resulting result sets are joined
in memory in the service itself.

Update schedule

Dataset are generated monthly and loaded in bulk. Ingestion is triggered manually around the 6th of the month (once source datasets are available in Hadoop). Currently we throttle ingestion to alleviate resource contention on the db,
and slow ingestion rate to approx 5000 rows/sec.

Datasets are assumed read-only. At ingestion time, previous mariadb data is truncated.

Service load

The service runs in Kubernetes in the staging, codfw and eqiad clusters.
Service metrics are available at https://grafana.wikimedia.org/d/ybN_naBMk/similar-users?orgId=1&from=now-72h&to=now&refresh=10s
It does not seem the service received requests recently.

SLOs/SLAs

We don't have explicit SLOs at this time. The ETL schedule allows for delays (up to a few weeks). We have a 6 hours window during ingestion
when the services won't perform database lookups.