**This is a placeholder task that needs refinement (and possibly a better title).**
Similarusers has ETL, lifecycle and access patterns that match the "generated datasets" that might be a fit for Cassandra.
The goal of this SPIKE is to identify which parts of the codebase need changes, and what volumes, throughput and QPS we
should size for.
# Datasets
The model generates 3 datasets, which are stored in 3 dedicated mysql tables.
## Volumes
In June's ingestion run we report the following number or rows per dataset:
|dataset|number of rows|size (GB) |
|Temporal|19932894| 0.4|
|UserMetadata|8898300|0.6 |
|Coedit|120067390| 4|
See https://phabricator.wikimedia.org/T286036 for details and history of runs.
## Query pattern
At request time, for a given `user_id` (key), data is retrieved from mariadb with lookup (SELECT) operations. The resulting result sets are joined
in memory in the service itself.
## Update schedule
Dataset are generated monthly and loaded in bulk. Ingestion is triggered manually around the 6th of the month (once source datasets are available in Hadoop). Currently we throttle ingestion to alleviate resource contention on the db,
and slow ingestion rate to approx 5000 rows/sec.
Datasets are assumed read-only. At ingestion time, previous mariadb data is truncated.
# Service load
The service runs in Kubernetes in the staging, codfw and eqiad clusters.
Service metrics are available at https://grafana.wikimedia.org/d/ybN_naBMk/similar-users?orgId=1&from=now-72h&to=now&refresh=10s
It does not seem the service received requests recently.
# SLOs/SLAs
We don't have explicit SLOs at this time. The ETL schedule allows for delays (up to a few weeks). We have a 6 hours window during ingestion
when the services won't perform database lookups.