Page MenuHomePhabricator

Several Wikidata Grafana boards missing data before October 2021
Closed, ResolvedPublic

Description

On several Wikidata-related Grafana dashboards, metrics that should go back several years seem to be truncated recently:

In many cases the metrics appear to start on 2021-10-22, but there are also some other dates, such as 2021-10-12 here:

Screenshot 2021-10-26 at 16-49-02 Wikidata Datamodel Lexemes - Grafana.png (258×910 px, 29 KB)

Can we figure out why the old data is gone?

Event Timeline

This might also affect non-Wikidata boards; for example, on API request rate (top 10), the query, parse and stashedit lines all start 2021-10-21.

Screenshot 2021-10-26 at 16-55-58 API backend summary - Grafana.png (687×1 px, 72 KB)

SAL says that graphite1004 was reimaged to bullseye on 2021-10-21 7:56 UTC. @fgiunchedi do you think this could be related? (Though it wouldn’t really explain why some other metrics seem to have an earlier cutoff date… I’m just guessing here so far.)

@Lucas_Werkmeister_WMDE thank you for the report. Yes pretty sure the graphite bullseye migration is related. We backfilled graphite1004 from graphite2003 (which in turn was the first host we reimaged, and backfilled it from graphite1004), I suspect some metric files did backfill fully and some others didn't (I don't know why exactly yet).

I verified taking one of your examples for mediawiki API /srv/carbon/whisper/MediaWiki/api/query/executeTiming/sample_rate.wsp has historical data on graphite2003 but not graphite1004 (as experienced). So I think what's needed is to run a backfill again (on the metrics that we know are missing data first), this is a safe operation because data gets merged. I'll try that tomorrow and report back my findings.

Aklapper renamed this task from Several Wikidata Grafana boards missing data before October 2022 to Several Wikidata Grafana boards missing data before October 2021.Oct 27 2021, 8:33 AM

Mentioned in SAL (#wikimedia-operations) [2021-10-27T09:25:17Z] <godog> another run of backfill on graphite1004 - T294355

Status update: the backfill is still ongoing since I lowered the concurrency.

The good news is that some metrics are already backfilled, e.g. api backend summary: https://grafana.wikimedia.org/d/000000002/api-backend-summary?viewPanel=31&orgId=1&from=1617235200000&to=1635119999000

The bad news is that I suspect the first backfill on Oct 11th (i.e. when we reimaged and then backfilled graphite2003) suffered from the same undetected problem, therefore in the case of metrics having data since Oct 11th (as opposed to Oct 21st) we do have data loss unfortunately

Status update: I'm running a full audit on all ~4M metric files looking for similar cases. The backfill from yesterday completed in the mean time and some metrics were able to be backfilled successfully.

I'll be following up with an incident report about this -- again my apologies for the unexpected data loss during migration and backfill.

In terms of action items: we currently don't backup graphite metric files, mostly due to the sheer number of files and space they take. However if a subset of metric files in a directory hierarchy isn't too big (I don't have the exact number on hand for "big" but I'd say low tens of thousands, to be confirmed) then it should be doable to back up in bacula.

Audit completed, what I did is count the number of null data points in the year leading to the graphite2003 reimage (i.e. the first reimage, where the backfill would have first failed) from 2020/10/14 to 2021/10/11 (first column). And the number of nulls after the first reimage (from 2021/10/12 to 2021/10/20) in the second column.

The files for which backfill failed would have a high number of nulls in the first column but low in the second (i.e. datapoints are being appended now, but haven't for the last year). The full list for metrics with more than 10 nulls in last year but less than 10 in the last week is at https://people.wikimedia.org/~filippo/nulls-T294355 (12MB file). I believe these were all the metrics affected by the failed backfill and for which we lost historical data.

Draft incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite

Please feel free to integrate/change as needed. I'll be OOO until the 18th and I'll pick this back up

I've sent the incident up for review, what do you think re: my proposal of adding parts of the hierarchy to bacula (if it is feasible in terms of number of files, e.g. daily is ~100k files now)

Sounds like a good idea to me, I can’t judge how much would fit in Bacula. Do you need a list of important metrics (worth backing up)?

Sounds like a good idea to me, I can’t judge how much would fit in Bacula. Do you need a list of important metrics (worth backing up)?

Yes a list of "paths" in the metrics hierarchy would be greatly helpful.

re: acceptable limits of files for bacula jobs I'm looping in @jcrespo for guidance/assistance

number of files are (within reason) a non-blocker for bacula, as files are packaged into volumes. It is true that each file is stored as a mysql record, but that should be able to scale until dozens of (US) billons, although it may be slow to recover when rebuilding metadata.

Most limiting factor would be the overall size + backup frequency for capacity planning. We don't have a lot of temporal data backed up, so not sure if we could come up with a strategy that saves space (e.g. if data is immutable, we may want to avoid full backups every day). What is the file/directory structure? If data is below e.g. 100GB I would consider it "small" and not requiring optimization.

The typical backup schedule is incrementals of a set of paths every day, differentials every fortnite, and fulls monthly- however it is highly customizable per job.

number of files are (within reason) a non-blocker for bacula, as files are packaged into volumes. It is true that each file is stored as a mysql record, but that should be able to scale until dozens of (US) billons, although it may be slow to recover when rebuilding metadata.

Most limiting factor would be the overall size + backup frequency for capacity planning. We don't have a lot of temporal data backed up, so not sure if we could come up with a strategy that saves space (e.g. if data is immutable, we may want to avoid full backups every day). What is the file/directory structure? If data is below e.g. 100GB I would consider it "small" and not requiring optimization.

The typical backup schedule is incrementals of a set of paths every day, differentials every fortnite, and fulls monthly- however it is highly customizable per job.

Thank you that's helpful to know, my hunch is that we'd want every other week backups since this is mainly a safety measure. File structure is one file per metric for graphite, with the filesystem path mirroring the graphite path (e.g. foo.bar.baz.value will be /foo/bar/baz/value.wsp on the filesystem). All files are expected to be around ~100k (e.g. the daily top level directory I mentioned earlier is ~100k files and 11G in size).

One more question, to finally decide if setting up weekly full backups or daily but incremental- do all files mostly change completely, or only a subset of them? Incrementals are able to be done with file granularity only (it will backup fully files as long as its path or hash has changed), if value.wsp changes every minute, and there is only 1 per value, we will do "weekly only full", otherwise the daily incrementals may be preferred.

If we end up doing weekly fulls, 11GB * 12 weeks of retention = 130 GB, which we can handle with no issue.

One more question, to finally decide if setting up weekly full backups or daily but incremental- do all files mostly change completely, or only a subset of them? Incrementals are able to be done with file granularity only (it will backup fully files as long as its path or hash has changed), if value.wsp changes every minute, and there is only 1 per value, we will do "weekly only full", otherwise the daily incrementals may be preferred.

My expectation is that most files we're backing up will change (otherwise it means the metric files are not being updated, which would make the backups less relevant) so definitely +1 for e.g. a weekly full backup

If we end up doing weekly fulls, 11GB * 12 weeks of retention = 130 GB, which we can handle with no issue.

Thank you that's good to know!

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

I don't have the answer to that question, but whenever any of you have the servers and path(s), you can follow the instructions at https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client to send a preliminary backup proposal to Puppet, and I will assist you to merge it with the proper setup (e.g. schedule, day, etc.) - I think it will be more useful to discuss the details over a patch :-).

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

What's "everything" in this context? :) If you are talking about daily then yes it does look like it!

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

What's "everything" in this context? :) If you are talking about daily then yes it does look like it!

I was thinking of everything, even non-daily stuff, but it looks like daily would actually be enough for us. Manuel created a list of important dashboards in T297145; the topics they use are:

Change 745838 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: backup 'daily' hierarchy

https://gerrit.wikimedia.org/r/745838

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

What's "everything" in this context? :) If you are talking about daily then yes it does look like it!

I was thinking of everything, even non-daily stuff, but it looks like daily would actually be enough for us. Manuel created a list of important dashboards in T297145; the topics they use are:

Ok! Please see the related review to start backing up daily. I've added @jcrespo too for signing off purposes

Let me give it a deeper look, while the patch by itself looks good as is, I want to check if a different (non-default) backup policy would be more advantageous in frequency and space. :-)

Manuel added a subscriber: Lydia_Pintscher.

@Lydia_Pintscher: This is unfortunately very bad news: I have just discussed this issue with Lucas and Michael. The short version is that a maintenance attempt apparently failed badly and led to data loss (see incident report). There were no data backups. As a result, we permanently lost data for all metrics that still lack historic data (11 October 2021 and earlier).

Reconstructing the missing data from dumps is theoretically possible in some cases, but it will take a lot of developer resources. Our recommendation is to focus reconstruction attempts on only a very small number of extremely important metrics. If you have a shortlist of such candidates, the team would in the next step try to identify whether reconstruction is possible in theory and what would be needed to make it happen. I opened a dedicated ticket for reconstruction efforts (T297487).

Also, we are asking to get regular backups for the future (see T297145 and this ticket).

@Manuel @Lydia_Pintscher going forward I suggest also investing resources to switch to Prometheus as the supported metric system. Graphite is deprecated and in "life support" mode while all producers (essentially mediawiki and related) are being ported over, thanks!

Thank you for the suggestion @fgiunchedi! Do we have an explanation somewhere of how to do this?

Thank you for the suggestion @fgiunchedi! Do we have an explanation somewhere of how to do this?

Sure no problem! My understanding is that these metrics are published/pushed somewhat infrequently by background jobs, therefore a good starting point would be https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway) . Happy to provide more guidance/info on T297494 as well though

Happy to provide more guidance/info on T297494 as well though

Thank you for the offer! We will come back to it!

Change 745838 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: backup 'daily' hierarchy, with weekly frequency, every Monday

https://gerrit.wikimedia.org/r/745838

fgiunchedi claimed this task.

I'm tentatively resolving the task since all short term mitigations are completed, feel free to reopen if sth is amiss

Running Jobs:
Console connected using TLS at 13-Dec-21 09:20
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
396417  Back Full      4,568    412.9 M graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running
396418  Back Full          0         0  graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running
====
Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
396417  Full     108,320    11.70 G  OK       13-Dec-21 09:34 graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily
396418  Full     108,320    11.70 G  OK       13-Dec-21 09:35 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily

https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily&from=1639384883943&to=1639388483943