Change Details

**This is a placeholder task that needs refinement (and possibly a better title).** Similarusers has ETL, lifecycle and access patterns that match the "generated datasets" that might be a fit for Cassandra. The goal of this SPIKE is to identify which parts of the codebase need changes, and what volumes, throughput and QPS we should size for. Schema definition can be found at https://github.com/wikimedia/mediawiki-services-similar-users/blob/main/migrations/create.sql. # Datasets The model generates 3 datasets, which are stored in 3 dedicated mysql tables. ## Volumes In June's ingestion run we report the following number or rows per dataset: |dataset|number of rows|size (GB) | |Temporal|19932894| 0.4| |UserMetadata|8898300|0.6 | |Coedit|120067390| 4| See https://phabricator.wikimedia.org/T286036 for details and history of runs. ## Query pattern At request time, for a given `user_id` (key), data is retrieved from mariadb with lookup (SELECT) operations. The resulting result sets are joined in memory in the service itself. ## Update schedule Dataset are generated monthly and loaded in bulk. Ingestion is triggered manually around the 6th of the month (once source datasets are available in Hadoop). Currently we throttle ingestion to alleviate resource contention on the db, and slow ingestion rate to approx 5000 rows/sec. Datasets are assumed read-only. At ingestion time, previous mariadb data is truncated. # Service load The service runs in Kubernetes in the staging, codfw and eqiad clusters. Service metrics are available at https://grafana.wikimedia.org/d/ybN_naBMk/similar-users?orgId=1&from=now-72h&to=now&refresh=10s It does not seem the service received requests recently. # SLOs/SLAs We don't have explicit SLOs at this time. The ETL schedule allows for delays (up to a few weeks). We have a 6 hours window during ingestion when the services won't perform database lookups.