Page MenuHomePhabricator

Set up the ml-cache clusters
Open, Stalled, Needs TriagePublic

Description

We need to add puppet config for the ml-cache nodes (3 nodes per cluster). The hosts will need:

  1. Some Redis instances for the online feature store (more info will likely come from T294434)
  2. Some Redis instances for the score cache (probably more or less what we do for ORES).

Event Timeline

One simple architecture could be something like the following:

online feature store

Since we'll likely use Feast, we'll need to run a python client in Analytics-land (Airflow, kubernetes, etc..) to materialize data on Redis. We could have a simple master/replica setup between eqiad and codfw.

score cache

Unlike the feature store, we may want to have a different score cache for each DC. In this case, we could simply have one redis instance (no replication) on every node.

How the clients are going to reach the various caches is something to discuss. I added a similar note in T294434#7725954

Feast (IIUC) allows to specify a single hostname + port endpoint, and we'll likely want the same for the score cache.

From the SRE point of view, it would be nice to be able to reboot any host with the minimal impact on ongoing traffic. One thing to explore could be the use of https://github.com/twitter/twemproxy, that we already use for all the Mediawiki Redis shards. The idea is to have a local proxy on every client that knows all the Redis shards, and that distributes traffic accordingly (like using a hash function to split what data goes where). Twemproxy received some new commits during the past months, but SRE tried to move away from it as much as possible during the last couple of years (mostly in favor of MCRouter, that is a Memcached proxy though). Adopting this tool could be handy but we should follow up with SRE first.

As reference for the score cache: ORES currently uses a master/replica set up for each datacenter. There is a Redis instance acting as master, and one replicating data using the Redis native protocol. ORES points write traffic to the master node, and in case of failures it needs to be moved to the replica host (via puppet change).

While reading https://feast.dev/blog/a-state-of-feast/, I started to think if Cassandra could be an alternative to Redis for our use case, since it is supported by Feast. There would be several advantages:

  • Cassandra can work in clusters out of the box, so we wouldn't need any proxy to shard the data from the client side.
  • The clients can hit any node, and the query is routed to the right place by Cassandra itself.
  • Cassandra offers several network replication topologies, for example having two clusters in separate DCs and replicating only certain keyspaces between them. It could be convenient for us to have replication between DCs for the feast data, without having it for the score cache use case (that would have only intra-dc replication between nodes of course).

Ideally it seems the best solution, but the drawbacks are:

  1. Different latency speed from Redis, even if Cassandra can be optimized for read and can offer a nice latency.
  2. Data needs to be structured in a certain way (like keyspaces) since it is not a regular database.

@Eevans Hi! I'd need some help with the initial config of the Cassandra ML cluster if you have time :)

We have three nodes in eqiad and three in codfw (2x2TB SSDs, 256G of ram, 48 vcores each node) and it will support Lift Wing, our Kubernetes cluster running Kserve that will replace ORES in the near future. Liftwing is only to support a model serving API, no model training is involved.

We have two use cases for it:

  1. Score cache - basically what ORES currently do with Redis, namely saving a model score to avoid recomputation (so a certain Wiki rev-id is evaluated with a ML model, a potentially expensive operation, and a score about good/bad faith etc.. is returned).
  2. Online Feature Store - this is a more complex use case, namely we'll use https://feast.dev/ to periodically load Feature datasets from Hadoop to Cassandra (to support fast feature retrieval etc..). Since Hadoop is only in eqiad, we'll probably load data to the eqiad Cassandra cluster and set up a network topology for the related keyspace to replicate data to codfw (lemme know if it is possible or not).

For maintainability and flexibility, having a 3 nodes Cassandra cluster for each DC seems a good compromise, compared to a Redis sharding set up or a Redis master/replica (that was never deployed at the WMF IIRC). If what I wrote above makes sense, how many instances should we configure? The 2xSSDs that each node runs will be in RAID1, so I guess that a single instance should be enough, or do you suggest more?

Thanks in advance :)

Plan after a chat with Eric and Filippo - let's start with one instance for each node, and then we can experiment with keyspaces and data access patterns. In case we'll need more instances, we'll easily rebuild the cluster.

Change 793707 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cassandra::single_instance: add target_version

https://gerrit.wikimedia.org/r/793707

Change 793714 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad

https://gerrit.wikimedia.org/r/793714

Change 793707 merged by Elukey:

[operations/puppet@production] profile::cassandra::single_instance: add target_version and rack

https://gerrit.wikimedia.org/r/793707

To keep archives happy - I am having a chat with Eric over email about this cluster and its future usage. The AQS cassandra cluster should become a multi-tenant/dc cluster able to support various use cases, so we need to decide if ml-cache is a valid use case for a standalone cluster or not.

I have updated https://gerrit.wikimedia.org/r/793707 with Eric's suggestion, I'll wait for his review before proceeding further.

Reporting in here some details about what we discussed with Eric via email. The extended use cases that we are trying to implement are two basically:

  1. Score cache/storage, namely we want to store in Cassandra the results of model predictions (for example, ORES models etc..) since they are very expensive to compute. For example an ORES model for a given wiki to predict a damaging revision could take up to 500+ms to compute a result, and we want something faster where to store the results of model-version/score-result tuples (and other metadata) to reduce the latency. ORES uses a single Redis instance at the moment, but we wanted something more flexible to manage (reboot of nodes, upgrades, load balancing traffic, etc..) and due to the next use case, we thought that Cassandra was ok. This use case seems more like a caching one I know, but Data Engineering is interested in having something to query to get historical predictions over time, so we could think about keeping scores indefinitely (to enable comparisons between across model versions etc..).
  2. Online feature store - this is more complicated and we are still trying to wrap our heads around it, every company does it in a different way and there are few open source options. The idea is to have a place where Feature Datasets computed in Data Engineering land (maybe via spark etc.) can be loaded to Cassandra periodically to support the K8s ML Serve cluster (basically the replacement of ORES to get ML predictions from models deployed on it). https://feast.dev/ supports Cassandra, probably self-managing a keyspace. There is a python client to load data from $somewhere (hadoop, hive, spark, etc..) and store it into Cassandra. My idea was to have the Feast keyspace replicated to codfw transparently, to avoid a double loading from Data Engineering land (to eqiad and codfw ml-cache clusters). This may be a use case for the new AQS, but it would need to be discussed.

As discussed with Eric, I tried to get more info about the actual state of implementation of the Cassandra connector in Feast.

https://github.com/feast-dev/feast/pull/475 seems to be an old attempt to add support, ended up nowere.

https://github.com/feast-dev/feast/pull/1875 seems a second attempt made from Astra, that ended up in https://pypi.org/project/feast-cassandra. Feast seems more inclined to allow external connectors to be stored in separated repos IIUC.

https://awesome-astra.github.io/docs/pages/tools/integration/feast/ contains more info, especially related to the Cassandra set up. IIUC feast wants a keyspace and an admin account for it, self managing it.

The next step is trying to figure out if our use cases could be onboarded on AQS (so avoiding a new cluster) or if the ml-cache cluster is needed (maybe with a different name etc..).

For ML, the use case 1) is more pressing since it is a requirement for our new Lift Wing cluster MVP. The use case 2) is more medium/long term, but we'd like to start working on it asap.

@lbowmaker Hi! I had a chat with Eric about the use cases described above, and if you have some time I'd like to understand if any of them could be onboarded on AQS (or its future name, not sure what it will be :).

To recap, my team is currently thinking to use Cassandra for two use cases:

  1. Score cache - we are moving the ORES (ores.wikimedia.org/) infrastructure from ad-hoc to a more standardized Kubernetes setup. ORES is a long standing project to offer ML models able to "score" Mediawiki articles/revisions/etc.. to help bots and automated tool to fight vandalism (and more, but this is one of the biggest use cases). The ML models are a little slow and to score a revision or an article they may take 500/600ms, so we'd like to store those computations on Cassandra to avoid recalculating them every time (this is what currently ORES does as well, but with Redis on dedicated hardware etc..).
  1. Feature store with https://feast.dev. This is a more long term and complicated use case, we still don't have a clear idea about what we'll do and we'd need to experiment before taking a final decision. The feature store should be a fast datastore to store feature datasets, basically the input parameters of our models. The feature stores are usually split into two parts: online and offline use case. The latter is often a more batch/throughput oriented use case, where a lot of data is needed to train models (but we are not there yet, so not important for us). The former use case, the online one, is to provide low latency and fast data to models that are used (like in our use case) to form an API or similar. The Feast feature store aforementioned has a plugin for Cassandra, self-managing its own keyspace.

We don't have a clear idea about access patterns for the second use case, but the first one should be relatively straightforward, mostly similar to a fast cache/store. The idea of the ML team was to use 6 nodes (3 in eqiad and 3 in codfw) to create two separate Cassandra clusters, and onboard both use cases on them to start testing. After a chat with Eric, IIUC we may use AQS for the Score Cache use case, and possibly use a separate cluster for the Feature store one. What do you think? Any suggestion/advice?

Change 793714 merged by Elukey:

[operations/puppet@production] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad

https://gerrit.wikimedia.org/r/793714

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS buster completed:

  • ml-cache1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206151415_elukey_1755432_ml-cache1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

I tried to run the Cassandra a instance on ml-cache1001 and I got an error while starting it, since the TLS truststore was not present on disk. I followed https://phabricator.wikimedia.org/T307798#7943863 to generate TLS certs and committed them to the private repo. Not sure if the TLS settings are mandatory now but we'll want to have them anyway, so better to add them sooner rather than later.

Change 806167 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add stub cassandra tls secrets for the ml-cache cluster

https://gerrit.wikimedia.org/r/806167

Change 806167 merged by Elukey:

[labs/private@master] Add stub cassandra tls secrets for the ml-cache cluster

https://gerrit.wikimedia.org/r/806167

Change 806168 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_cache::storage: add TLS settings for Cassandra

https://gerrit.wikimedia.org/r/806168

Change 806168 merged by Elukey:

[operations/puppet@production] role::ml_cache::storage: add TLS settings for Cassandra

https://gerrit.wikimedia.org/r/806168

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS buster completed:

  • ml-cache1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206160932_elukey_1906707_ml-cache1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS buster completed:

  • ml-cache1002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206160921_elukey_1905772_ml-cache1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
root@ml-cache1001:/var/log/cassandra# nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.64.134.8   85.43 KiB  256          65.1%             3b22afff-bc1d-4cbb-a64e-ef5fa2b94ca6  d_f
UN  10.64.130.9   70.87 KiB  256          66.9%             3efa2c48-f241-4b43-8600-b0fe094b415b  a_e
UN  10.64.32.186  85.43 KiB  256          68.1%             7e965dbe-eeee-4ba9-8a5e-5d370df6f53a  b_c

ml_cache eqiad cluster up!

Next step is to bootstrap the codfw cluster, and then we should be done. We should try to figure out if we can use Bullseye and not Buster though.

Next step is to bootstrap the codfw cluster, and then we should be done. We should try to figure out if we can use Bullseye and not Buster though.

The reason the AQS*/restbase* Cassandra clusters are still on Buster is because Restbase hasn't been adapted to a more recent Node.js. Since this isn't an issue for ml-cache, Bullseye is most probably fine?

https://cassandra.apache.org/doc/latest/cassandra/new/java11.html mentions that Cassandra only supports Java 11 fully as of 4.0.2, but that's also not an issue because Java 8 is still being provided via the component/jdk8 component for Bullseye (and when we upgrade to Cassandra 4 in the future that can go away).

Next step is to bootstrap the codfw cluster, and then we should be done. We should try to figure out if we can use Bullseye and not Buster though.

The reason the AQS*/restbase* Cassandra clusters are still on Buster is because Restbase hasn't been adapted to a more recent Node.js. Since this isn't an issue for ml-cache, Bullseye is most probably fine?

https://cassandra.apache.org/doc/latest/cassandra/new/java11.html mentions that Cassandra only supports Java 11 fully as of 4.0.2, but that's also not an issue because Java 8 is still being provided via the component/jdk8 component for Bullseye (and when we upgrade to Cassandra 4 in the future that can go away).

We'd need bullseye repos that include Cassandra and jvm-tools (easy enough), and a cassandra-tools-wmf package that has been ported to bullseye (the current package isn't install-able because of a missing dependency on pyyaml).

Disclaimer: There may be other nits I am not aware of.

elukey changed the task status from Open to Stalled.Jun 20 2022, 8:11 AM

Created T310980 to track the efforts to have Cassandra on Bullseye :) Let's wait this task before continuing with the codfw cluster!

Adding some notes from meeting on 6/21 with Chris Albon, Eric Evans, Luca Toscano, Lukasz Sobanski, Matthew Vernon and Luke Bowmaker.

1. Score Cache

  • Confirm this is preemptive (are the ORES scores calculated for every revision create event or is it on demand?)
  • Need to understand size of data and growth (schema should be simple, something like: model, score, wiki, article)
  • Need to understand data retention - how long do we keep scores - just keep latest?

@elukey - if an event on mediawiki.revision-create triggers the calculation of ORES then it looks like in May the average was around 1.5M events per day. Here is the query if you want to play around with this in the Analytics/Hive tables.

SELECT count(*), year, month, day
from event_sanitized.mediawiki_revision_create
where year = 2022 and month = 5
group by year, month, day

Hi, yesterday I had a meeting with @diego and @MunizaA in the Research Team. We're currently studying ORES models usage from the scores stored in event_sanitized.mediawiki_revision_score. I think it might be related to this thread. Here is the query that lists the number of scores produced by ORES models in 2022.
{F35262964}

The damaging/goodfaith models usage are the most was around 136M. The itemquality/itemtopic models designed for Wikidata was around 105M. The usage rate is associated to the number of wikis supported by the model e.g. the damaging/goodfaith models support more wiki projects (36) than the articlequality model (13).

Change 808907 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add configuration for the ml-cache codfw Cassandra cluster

https://gerrit.wikimedia.org/r/808907

@lbowmaker hi! I reviewed https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway and I am wondering if the score cache use case would fit. Our idea, for the moment, is to have the cache populated once a client requests a particular score that is not already cached. For example: if a score for the goodfaith model, revision 11111 and wiki en lands to lift wing, we'd check the score cache and insert the result if not present. This basic mechanism will be also used to populate the cache preemptively, for example using ChangeProp or Airflow (they will listen to kafka and act upon revision change etc..). We'd like to proceed in this way since we'll also need to emit Kafka events for Eventgate (see T301878).

The Data Gateway seems to be more oriented for use cases where the data is loaded separately to AQS (for example, via Airflow directly), and retrieved via a simple GET under a certain path. Is there space for the use case above too? If so, is there any kind of authentication to use? I didn't find docs but in case feel free to point me to those if things are already written :)

@elukey - in an ideal world we would like to abstract you from a lot of the underlying details of the data storage. For example, we are currently working on a project to write user feedback on image suggestions from a Kafka topic to Cassandra. The feature team just writes to the Kafka topic and our team takes care of writing the data to the Cassandra tables - the Data Gateway exposes that data from an endpoint. See this ticket for more info. This is currently a POC/first attempt at doing this.

Would you be interested if I tried to find some time next week for us to model the end to end flow of all this work with some of the Platform and DE team?

In the past we did the same for image suggestions, see here, this helped us understand what we needed and what we needed to build out.

@lbowmaker hi! I reviewed https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway and I am wondering if the score cache use case would fit. Our idea, for the moment, is to have the cache populated once a client requests a particular score that is not already cached. For example: if a score for the goodfaith model, revision 11111 and wiki en lands to lift wing, we'd check the score cache and insert the result if not present. This basic mechanism will be also used to populate the cache preemptively, for example using ChangeProp or Airflow (they will listen to kafka and act upon revision change etc..). We'd like to proceed in this way since we'll also need to emit Kafka events for Eventgate (see T301878).

If I understand correctly, you're saying you want a service with typical caching semantics (calculating a result upon request when there is no cached entry, or when otherwise forced), which you'll then use to cache preemptively (triggered by a corresponding event), by making just such a request. Is that right?

The Data Gateway seems to be more oriented for use cases where the data is loaded separately to AQS (for example, via Airflow directly), and retrieved via a simple GET under a certain path. Is there space for the use case above too?

Assuming s/Data Gateway/Generated Data Platform/ here, the two things are nearly (though not precisely) the same as you've described them. If you define the problem as: Processing a stream of events in order to add/update scores to storage for low-latency access, they both match that description. If you define the problem as one of caching, then the semantics around triggering that work, storing the result, and data retention become necessarily specific.

So I guess the question is: Is this a caching use case? Are we treating the data as ephemeral. Is the loss of it an inconvenience (cold start), or a serious issue? Will it be retained in perpetuity, or expire on a TTL (or similar)?

If so, is there any kind of authentication to use? I didn't find docs but in case feel free to point me to those if things are already written :)

No authentication at present, no.

Change 808907 merged by Elukey:

[operations/puppet@production] Add configuration for the ml-cache codfw Cassandra cluster

https://gerrit.wikimedia.org/r/808907

@lbowmaker hi! I reviewed https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway and I am wondering if the score cache use case would fit. Our idea, for the moment, is to have the cache populated once a client requests a particular score that is not already cached. For example: if a score for the goodfaith model, revision 11111 and wiki en lands to lift wing, we'd check the score cache and insert the result if not present. This basic mechanism will be also used to populate the cache preemptively, for example using ChangeProp or Airflow (they will listen to kafka and act upon revision change etc..). We'd like to proceed in this way since we'll also need to emit Kafka events for Eventgate (see T301878).

If I understand correctly, you're saying you want a service with typical caching semantics (calculating a result upon request when there is no cached entry, or when otherwise forced), which you'll then use to cache preemptively (triggered by a corresponding event), by making just such a request. Is that right?

Yeah correct, this is the idea that we have for the models that we are porting from ORES to Lift Wing. They are very slow and calculating the score is expensive (500ms+ at least), so having a cache is essential. We have internal and external clients so HTTP caching may not be ideal, and we don't like the idea of using Redis since setting it up for high availability is not great at the moment (and maintenance is more difficult than say Cassandra).

At the moment we do the same for ORES, ChangeProp calls it asking for a special precache endpoint, that causes scores to be saved on Redis. For the ORES models we may need to follow this road since we don't want to calculate scores more than once with slow models, but in the future this may not be true anymore (when we'll replace ORES models with something more performant etc..). So this score cache may be limited to a particular use case of Lift Wing , namely the first one, and it could go away once we deprecate/replace the ORES models. This is why I am leaning towards using the ml-cache clusters rather than AQS.

The Data Gateway seems to be more oriented for use cases where the data is loaded separately to AQS (for example, via Airflow directly), and retrieved via a simple GET under a certain path. Is there space for the use case above too?

Assuming s/Data Gateway/Generated Data Platform/ here, the two things are nearly (though not precisely) the same as you've described them. If you define the problem as: Processing a stream of events in order to add/update scores to storage for low-latency access, they both match that description. If you define the problem as one of caching, then the semantics around triggering that work, storing the result, and data retention become necessarily specific.

So I guess the question is: Is this a caching use case? Are we treating the data as ephemeral. Is the loss of it an inconvenience (cold start), or a serious issue? Will it be retained in perpetuity, or expire on a TTL (or similar)?

The data in the cache may be lost in theory, but it would be a huge performance hit for the models that use it (from few ms to hundreds per request). I don't have a clear idea about data retention yet, but I'd be more inclined to keep values rather than drop them.

If so, is there any kind of authentication to use? I didn't find docs but in case feel free to point me to those if things are already written :)

No authentication at present, no.

I'd prefer to have a simple authentication mechanism when storing results if possible..

@elukey - in an ideal world we would like to abstract you from a lot of the underlying details of the data storage. For example, we are currently working on a project to write user feedback on image suggestions from a Kafka topic to Cassandra. The feature team just writes to the Kafka topic and our team takes care of writing the data to the Cassandra tables - the Data Gateway exposes that data from an endpoint. See this ticket for more info. This is currently a POC/first attempt at doing this.

Would you be interested if I tried to find some time next week for us to model the end to end flow of all this work with some of the Platform and DE team?

This is really interesting and more clear, thanks a lot for the explanation. We will definitely have similar use cases in the future when people will build new models, so my team will use AQS for sure. As explained above the score cache may not be a good use case though, the more I think about it the more I'd use the ML clusters for it. For future models we may not need the score cache anymore, and we'll definitely work towards use cases more similar to the ones that you depicted above.

In the past we did the same for image suggestions, see here, this helped us understand what we needed and what we needed to build out.

I'll review the docs thanks!

@lbowmaker hi! I reviewed https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway and I am wondering if the score cache use case would fit. Our idea, for the moment, is to have the cache populated once a client requests a particular score that is not already cached. For example: if a score for the goodfaith model, revision 11111 and wiki en lands to lift wing, we'd check the score cache and insert the result if not present. This basic mechanism will be also used to populate the cache preemptively, for example using ChangeProp or Airflow (they will listen to kafka and act upon revision change etc..). We'd like to proceed in this way since we'll also need to emit Kafka events for Eventgate (see T301878).

If I understand correctly, you're saying you want a service with typical caching semantics (calculating a result upon request when there is no cached entry, or when otherwise forced), which you'll then use to cache preemptively (triggered by a corresponding event), by making just such a request. Is that right?

Yeah correct, this is the idea that we have for the models that we are porting from ORES to Lift Wing. They are very slow and calculating the score is expensive (500ms+ at least), so having a cache is essential. We have internal and external clients so HTTP caching may not be ideal, and we don't like the idea of using Redis since setting it up for high availability is not great at the moment (and maintenance is more difficult than say Cassandra).

At the moment we do the same for ORES, ChangeProp calls it asking for a special precache endpoint, that causes scores to be saved on Redis. For the ORES models we may need to follow this road since we don't want to calculate scores more than once with slow models, but in the future this may not be true anymore (when we'll replace ORES models with something more performant etc..). So this score cache may be limited to a particular use case of Lift Wing , namely the first one, and it could go away once we deprecate/replace the ORES models. This is why I am leaning towards using the ml-cache clusters rather than AQS.

The Data Gateway seems to be more oriented for use cases where the data is loaded separately to AQS (for example, via Airflow directly), and retrieved via a simple GET under a certain path. Is there space for the use case above too?

Assuming s/Data Gateway/Generated Data Platform/ here, the two things are nearly (though not precisely) the same as you've described them. If you define the problem as: Processing a stream of events in order to add/update scores to storage for low-latency access, they both match that description. If you define the problem as one of caching, then the semantics around triggering that work, storing the result, and data retention become necessarily specific.

So I guess the question is: Is this a caching use case? Are we treating the data as ephemeral. Is the loss of it an inconvenience (cold start), or a serious issue? Will it be retained in perpetuity, or expire on a TTL (or similar)?

The data in the cache may be lost in theory, but it would be a huge performance hit for the models that use it (from few ms to hundreds per request). I don't have a clear idea about data retention yet, but I'd be more inclined to keep values rather than drop them.

If so, is there any kind of authentication to use? I didn't find docs but in case feel free to point me to those if things are already written :)

No authentication at present, no.

I'd prefer to have a simple authentication mechanism when storing results if possible..

Sorry, I thought from the context this was referring to the Data Gateway (reads), storage (for the Platform) is meant to be abstracted as well, but should use some form of authentication, yes.

@elukey - in an ideal world we would like to abstract you from a lot of the underlying details of the data storage. For example, we are currently working on a project to write user feedback on image suggestions from a Kafka topic to Cassandra. The feature team just writes to the Kafka topic and our team takes care of writing the data to the Cassandra tables - the Data Gateway exposes that data from an endpoint. See this ticket for more info. This is currently a POC/first attempt at doing this.

Would you be interested if I tried to find some time next week for us to model the end to end flow of all this work with some of the Platform and DE team?

This is really interesting and more clear, thanks a lot for the explanation. We will definitely have similar use cases in the future when people will build new models, so my team will use AQS for sure. As explained above the score cache may not be a good use case though, the more I think about it the more I'd use the ML clusters for it.

FTR, this Platform isn't the only multi-tenant Cassandra cluster, the RESTBase cluster is also available for uses like this.

For future models we may not need the score cache anymore, and we'll definitely work towards use cases more similar to the ones that you depicted above.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2001.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2001.codfw.wmnet with OS buster completed:

  • ml-cache2001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206300812_elukey_959051_ml-cache2001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2002.codfw.wmnet with OS buster completed:

  • ml-cache2002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206300828_elukey_962779_ml-cache2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2003.codfw.wmnet with OS buster completed:

  • ml-cache2003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206300842_elukey_965813_ml-cache2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

codfw cluster up and running on Buster :)

@lbowmaker @Eevans I had a long chat with my team about the AQS cluster and our use cases, we reached some consensus about how to proceed, lemme try to summarize and then I'd love some feedback :)

  • First of all, many thanks for the patience and all the explanations made for us so far. It was really great to discuss our use cases and it was eye opening.
  • The score cache use case is probably something that we could call the "ores score cache" use case, since we want to do only because the revscoring/ORES models are very slow and without a cache we'd see a severe performance penalty in even basic calls. Future models will not follow the ORES patterns and will leverage pre-computed data more actively, that we hope it will come from AQS. Our idea, for the moment, is to use both ml-clusters for the "ores score cache" use case, without trying AQS, since it is something that we'll probably only just use to support the ORES models until we'll deprecate them for something newer (the Research team is already working on it).
  • We'd be interested to know how/if it makes to replace the current ORES precache/revision-score workflow with something that Data Platform manages (like Airflow based etc..). The context is in T301878, but basically at the moment this happens:

Our idea is to segment the revision-score stream in multiple ones, one for each model type (mediawiki.revision-score-editquality, etc..), and use Lift Wing (not ORES) to compute the scores. Ideally we could replace the complicated ChangeProp workflow with something that Data Platform offers, but this is probably something that I'd need to follow up with @lbowmaker.

  • Long term, we see a lot of overlapping between our Feature store use case and AQS. We'll test feast as anticipated, but the ideal goal would be to force people to use a single tool instead of multiple ones. So we'll keep AQS in mind when evaluating feast, so that we'll not add another tool on top of what Data Platform is going to offer.

Thanks @elukey for the summary. The TL;DR is that we are trying to build the minimum thing to fix a very specific problem regarding the ORES models. I don't think we have the resources to make a general solution to pre-cache but I also don't think we need to. We just need the ORES models hosted on Lift Wing to be faster until they are replaced. We are calling it Project Racing Stripe ("It isn't actually faster, it just seems faster.") and @klausman leading it (and came up with the name).

@lbowmaker @Eevans I had a long chat with my team about the AQS cluster and our use cases, we reached some consensus about how to proceed, lemme try to summarize and then I'd love some feedback :)

  • First of all, many thanks for the patience and all the explanations made for us so far. It was really great to discuss our use cases and it was eye opening.
  • The score cache use case is probably something that we could call the "ores score cache" use case, since we want to do only because the revscoring/ORES models are very slow and without a cache we'd see a severe performance penalty in even basic calls. Future models will not follow the ORES patterns and will leverage pre-computed data more actively, that we hope it will come from AQS. Our idea, for the moment, is to use both ml-clusters for the "ores score cache" use case, without trying AQS, since it is something that we'll probably only just use to support the ORES models until we'll deprecate them for something newer (the Research team is already working on it).
  • We'd be interested to know how/if it makes to replace the current ORES precache/revision-score workflow with something that Data Platform manages (like Airflow based etc..). The context is in T301878, but basically at the moment this happens:

Our idea is to segment the revision-score stream in multiple ones, one for each model type (mediawiki.revision-score-editquality, etc..), and use Lift Wing (not ORES) to compute the scores. Ideally we could replace the complicated ChangeProp workflow with something that Data Platform offers, but this is probably something that I'd need to follow up with @lbowmaker.

  • Long term, we see a lot of overlapping between our Feature store use case and AQS. We'll test feast as anticipated, but the ideal goal would be to force people to use a single tool instead of multiple ones. So we'll keep AQS in mind when evaluating feast, so that we'll not add another tool on top of what Data Platform is going to offer.

Just to reiterate for posterity sake. The Platform team's Generated Data Platform (aka AQS) isn't the only option for (Cassandra-based) storage, we have another multi-tenant storage cluster as well (the cluster formerly exclusive to RESTBase). Obviously we'd want to do due diligence with respect to establishing ORES score caching storage size and throughput requirements, but I suspect we have more than enough capacity.

We're in the process of trying to pull the various Cassandra clusters together under the umbrella of SRE Data Persistence. I don't know that that means a team-owned special-purpose Cassandra cluster would be discouraged (that's not for me to say), but it does seem like we'd want to be explicit about that if it's the case.

Just to reiterate for posterity sake. The Platform team's Generated Data Platform (aka AQS) isn't the only option for (Cassandra-based) storage, we have another multi-tenant storage cluster as well (the cluster formerly exclusive to RESTBase). Obviously we'd want to do due diligence with respect to establishing ORES score caching storage size and throughput requirements, but I suspect we have more than enough capacity.

We're in the process of trying to pull the various Cassandra clusters together under the umbrella of SRE Data Persistence. I don't know that that means a team-owned special-purpose Cassandra cluster would be discouraged (that's not for me to say), but it does seem like we'd want to be explicit about that if it's the case.

+1 makes sense, in the ideal scenario the ML team will not sure any extra Cassandra clusters for its use cases (especially in the future). I'd keep concerns separated from the Cassandra RESTBase cluster if possibile (at least for the moment), and we'll need a special cluster anyway to test https://feast.dev/ and figure out how to proceed.

Informed @LSobanski via email as well, so Data Persistence is aware of this extra new cluster :) I think that, if everybody agrees, this task can be closed and the ML testing phase can start. Once we'll have all our results we'll show them to people and we'll decide how to proceed, does it make sense?