Page MenuHomePhabricator

Security API Storage Needs
Closed, ResolvedPublic

Description

We will be receiving files as a part of the security API implementation. The current test file is 700 mb compressed that will need temporary storage and permanent storage is needed for the 4 gb uncompressed file. There might be a caching layer or mount point that is suitable for this purpose. We are considering the following options, but not sure which one best fits our use case:

  • usr/share mounted in the container (similar to MaxMind)
  • swift
  • possible caching layer or other storage options we don't know about

usr/share might be the easiest, but not sure if there's a limit on how many files can be stored there

Event Timeline

dpifke subscribed.

Some questions:

  • What is the process by which these files are uploaded/updated? How often are they updated?
  • What is accessing them? (MediaWiki?)
  • What is the access pattern? (frequency, etc.)
  • What availability guarantees are needed?

Hey @dpifke - I'll try to answer some of these, though @STran or @Mstyles may have more/better insights:

  • What is the process by which these files are uploaded/updated? How often are they updated?

Ideally, the data would be consumed from a foreign API as a stream via this proposed, WIP processing script, which would run as a cron/systemd job. Updates would be either 24-hour (daily) or 5-minute ("real-time"), depending upon the product that was ultimately chosen. Approximate file/stream size is 700 Mb compressed for the daily and 5 Mb compressed for the real-time.

  • What is accessing them? (MediaWiki?)

A new wikimedia service based upon the standard service-template-node.

  • What is the access pattern? (frequency, etc.)

See response to question 1.

  • What availability guarantees are needed?

For the data import/write, ideally we would want to alert on any failure and have some kind of minimal retry process. For data access, it will likely be extremely low-traffic, at least for now, compared to most other data within wikimedia production. The service is intended to be rolled out to a minimal number of privileged users/bots who will access it via a new mediawiki right/extension, and possibly later to internal wikimedia services or mediawiki extensions.

As a general rule, storing large files inside containers is a bad idea. MaxMind files are small enough that they can fit into an image without increasing its size significantly. And I would definitely say it's out of the question we distribute a 4 gb file to all kubernetes nodes so that it can be share-mounted by this application.

What you need is to have either an api endpoint, or a recurring kubernetes job, to load these files into a datastore that is suitable for querying - so one of {relational database, cassandra, elasticsearch}.

Swift can store files but I would assume we have more efficient ways of querying these data than downloading 700 megabytes from swift every N minutes, with the additional need to have fast local ephemeral storage for the containers.

Thanks @Joe, is there a hard limit on file sizes that can be stored inside the container? We might have other options with the file sizes.

Do you have any guidance or examples of other applications that are using data stores? I think we can use our existing service to store the files into the data store.

And I agree, Swift probably is not the best choice for this.

It seems like the simplest way forward for us would be to use the existing Cassandra cluster with a new access for our service.

It seems like the simplest way forward for us would be to use the existing Cassandra cluster with a new access for our service.

Looking at this a bit more, given what's already using the Cassandra cluster - Restbase, AQS, Sessionstore, Maps - I'm not sure the current use-cases for the Security API or even the envisioned near-future use-cases necessarily justify using the existing Cassandra cluster as the data store. The existing systems would appear to be quite a bit more high-volume than what the Security API would likely be.

Do you have any guidance or examples of other applications that are using data stores? I think we can use our existing service to store the files into the data store.

This might be the best approach for a few different reasons - to bundle a redis/cassandra service with the Security API for its exclusive use. Elasticsearch is also interesting (I believe Toolhub uses this) but seems maybe less appropriate for the Security API? Anyhow, it would still be nice to find some existing services which make use of this paradigm as examples to follow.

Without knowing more about the type of data and your access patterns, it's hard to provide a good suggestion around this. But, more in general:

I would suggest that the most flexible, cheap, proven and easy to use datastore is a relational database like Mysql, which should be the default choice for storing data that has any kind of querying needs. I think in general we should ask ourselves if there's a good reason why a relational database is not a good fit for our data and our data access patterns, and if we can't answer that question, just use mysql.

For instance, talking about services other than mediawiki, toolhub and linksrecommendation both have their own dedicated databases on the mysql cluster for services.

Thanks @Joe, is there a hard limit on file sizes that can be stored inside the container? We might have other options with the file sizes.

There is a general limit on the size of a docker image, which should in general not exceed 500 MBs, although it's not a hard limit.

Without knowing more about the type of data and your access patterns, it's hard to provide a good suggestion around this. But, more in general: ,,,

Hey @Joe

Thanks for the guidance. The typical data we're looking at will be about 4 Gigs (uncompressed) of JSON that we retrieve from a foreign API. This data will be retrieved from said API either via a "daily" pull (~ 700 Mb compressed) or several "real-time" (5-minute) pulls (~ 4 Mb compressed). If the latter product/method is used, we would plan to store those data pulls cumulatively to amass 12 to 24 hours worth of historical data on our end. The format from the API is a sequence of JSON data separated by newlines and looks somewhat like this, though field count can vary a bit:

{
  "ip":"1.2.3.4",
  "services":["string1"],
  "org":"string2",
  "count":-1,
  "data1":"string1",
  "data2":"string2",
  "data3":"string3",
  "data4":true
}

This technically isn't valid JSON, but it is what the vendor provides, so there will be a transform layer applied after pulling the data from their API. We were initially thinking something like redis/cassandra would make the most sense here in their abilities to quickly process and store simple, non-relational data and provide fast access. If mysql/maria would make the most sense given the size/lack of complexity of this data and the only real business requirement being quick, accurate storage and retrieval of the data, then we can definitely explore that option.

Hey @Joe - just wondering if you had any thoughts or guidance regarding my previous comment. If not, I think we'll explore using MySQL as an-hoc data store for the Security API service.

sbassett moved this task from In Progress to Our Part Is Done on the Security-Team board.

Per the last recommendation from @Joe at T301428#7730915, we've decided to pursue MySQL/Maria as the primary backend for the Security API. We can re-open this task if we run into any serious blockers with that approach.