Page MenuHomePhabricator

Obtain backups data through means of different API endpoints (api_db)
Open, Needs TriagePublic

Description

This ticket is to document and cover the design of various API endpoints that could be present in the database, along with the SQL query

Basically our program will execute the SQL query with databases v0.4.3 (async databases, uses pymysql in the background), and this data will be passed onto a Pydantic model for verification (we don't really need to do this unless we are inputting the data into a table from our program)

As the table is already created, I tried using SQLAlchemy reflection to grab an ORM of the existing table.

Endpoint 1: To get all the recent backups metadata for each unique section

API endpoint: api/v1/get_recent_backups_data

Proposed SQL query on backups table to get this data:

SELECT id, name, status, source, host, type, section, start_date, end_date, total_size 
FROM backups b 
WHERE b.start_date = (SELECT MAX(start_date) FROM backups b2 WHERE b2.section = b.section)

(I'm not that great with SQL query but if you have a better query to do this, please post it :-D )

Expected JSON output example (same for the others as well):

[
    {
        "id": 860,
        "name": "dump.s2-test.2019-03-06--10-03-06",
        "status": "finished",
        "source": "localhost",
        "host": "es2002.codfw.wmnet",
        "type": "dump",
        "section": "s2-test",
        "start_date": "2019-03-06T10:03:06",
        "end_date": "2019-03-06T10:03:08",
        "total_size": 142031
    },
    {
        "id": 869,
        "name": "snapshot.s1-test.2019-03-06--10-54-56",
        "status": "finished",
        "source": "localhost",
        "host": "es2002.codfw.wmnet",
        "type": "snapshot",
        "section": "s1-test",
        "start_date": "2019-03-06T10:54:56",
        "end_date": "2019-03-06T10:55:10",
        "total_size": 116361965
    },
    {
        "id": 875,
        "name": "dump.section6.2019-03-12--08-30-18",
        "status": "failed",
        "source": "dbstore2001.codfw.wmnet:3316",
        "host": "es2001.codfw.wmnet",
        "type": "dump",
        "section": "section6",
        "start_date": "2019-03-12T08:30:18",
        "end_date": "2019-03-12T08:30:18",
        "total_size": null
    },
]

Endpoint 2: To get stale backups for each section

API endpoint: api/v1/get_stale_backups

Show me backups whose last successful backups were more than X hours ago (X is the freshness criteria

Proposed SQL query

TODO -- we may do the logic in the actual python program rather than putting it in an SQL query

Endpoint 3: To get recent backups with large percentage change.

API endpoint: api/v1/get_inconsistent_backups

Show me backups whose last successful backups sizes had a percentage change of X percentage

Proposed SQL query

TODO - we may do the logic in the actual python program rather than putting it in an SQL query

Event Timeline

A more "elegant" way to do a group-wise maximum is to do something like:

SELECT b1.id, b1.name, b1.status, b1.source, b1.host, b1.type, b1.section,
        b1.start_date, b1.end_date, b1.total_size
FROM backups b1
LEFT JOIN backups b2 ON b1.section = b2.section AND b1.start_date > b2.start_date
WHERE b2.id is null

As it doesn't require subqueries, but it is not a huge issue- the main issue is that both this and your query are not well optimized because they both don't use indexes efficiently (as they don't exist).

However, before we rush to create suitable indexes, there is a few issues with the query (both versions):

  • The query will return ALL results, including those of test or sections we don't care (e.g. on my results we get "test-s1" section which corresponds to backup testing, and we don't care about. We can also assume we could have sections configured with no backups taken yet (e.g. recently setup backup). We should probably query only the sections we get from configuration or ignore non-monitored sections afterwards.
  • We probably want to know the status on both datacenter "eqiad" and "codfw" (there is no datacenter column, although currently we can use the dns name of the host containing the backup), although some backups only happen on one datacenter (again, we should get that from configuration)
  • We probably want to report both types of backups- not all sections have both dumps and snapshots, and they are taken with different frequency. Again, something we can get from configuration.

For example, here is the information of configured backups we want to monitor (most likely we will want to return one result per line, for each dc, section and type):

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/role/common/alerting_host.yaml$33

Thank you @jcrespo
I have not used indexes much before, so I will do a bit of learning on this.

Looking at the alerting_host yaml file, I think we can set up some logic on our side to filter out the sections that we don't care about, or we can write an SQL query to exclude them in the query itself. Which one would be the better option, according to you? I would filter it out in python code as it's easier on my side, but what do you think?

Also, to maintain and support a list of sections that we actually monitor on, I see two options to do this with our api_db program, which
(A) Modify api_db program or SQL query to support new sections OR
(B) Have some way to read this alerting_host yaml file (push it through puppet?) and grab the relevant sections and have the program filter it out based on data from this yaml file.

While option A sounds good, having to repeat the work at two places (in the alerting_host.yaml file and the api_db program) seems a bit inefficient
Option B seems a good opinion only if there was a way to copy this alerting_host.yaml file into the instances that run bernard:api_db (through a puppet script) -- is this possible?

As for the datacenter (eqiad, codfw) -- yes, we can get them from the domain name and we can refer to the yaml file to see the ones that we really need to monitor.

Which one would be the better option, according to you?

The one that works :-)

Before getting too caught in details- My suggestion for you (if you accept it) is to start by having 1 alert/metric working ASAP (e.g. simplest api call possible and simplest html served on frontend using that api) so you have a vertical slice to show it working, and later making it a more complex and complete design. It is better to have 1 very small thing working than multiple complex things and none working, a more iterative design could help you react to changes in needs and a more "continuous delivery" approach.