Page MenuHomePhabricator

Store unsampled API and XFF logs
Closed, ResolvedPublic

Description

Apparently unsampled API logs were disabled by Ori in December, and XFF logs were disabled by Reedy in January. I don't understand how we can respond to abuse and DoS attacks without these logs.

When we bought Fluorine, it was sized so as to have a sufficient disk I/O to store unsampled Apache access logs with a short retention time: https://rt.wikimedia.org/Ticket/Display.html?id=2400 . Unfortunately this was never implemented -- it would still be useful in my opinion. Instead, we stored XFF and API logs, and tuned the retention time of the API logs so that they fit on the disk.

If the API logs stop fitting on the disk on fluorine, the first thing to try should be to reduce the retention time. This is currently 30 days, see files/misc/scripts/mw-log-cleanup in puppet.

If fluorine is now so undersized, 2.5 years on, that we can't even store 7 days of API access logs, then we should buy new hardware for it.

Event Timeline

tstarling created this task.Feb 3 2015, 3:28 AM
tstarling raised the priority of this task from to Needs Triage.
tstarling updated the task description. (Show Details)
tstarling added a subscriber: tstarling.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2015, 3:28 AM
tstarling updated the task description. (Show Details)Feb 3 2015, 3:29 AM
tstarling set Security to None.
Joe added a subscriber: Joe.Feb 3 2015, 6:00 AM

Honestly I don't see the point in creating a godzilla 130 GB file every day.

A correct way to tackle this is probably rotating the file more often than daily, and keep 7 days of retention if possible.

In T88393#1010638, @Joe wrote:

Honestly I don't see the point in creating a godzilla 130 GB file every day.

If we knew what HHVM server(s) were involved in T87645, that would help a with isolating that bug. That information would have been in the API log if it existed.

Another plausible scenario is:

  • The site goes down
  • 15 minutes later, the site comes back up
  • 5 minutes after that, initial ops response begins. Maybe MySQL fell over from overload but nobody knows where the queries came from. Will it happen again? The API log tells you wall clock execution time, it's possible to pull out slow and/or high volume API queries and model their effect on MySQL load. Maybe it was an accidental DoS -- then you have the username in the API log and you can ask them nicely not to do it again. Maybe it was deliberate DoS, then you have the IPs and full details of the vulnerability in the API log. If it wasn't API requests, we're screwed, especially without the XFF log, but at least with API DoS we have some hope of working out what happened. Grepping a 130GB log is not easy, but when all hands are on deck, it's a minor problem, compared to not having any logs at all.

A correct way to tackle this is probably rotating the file more often than daily, and keep 7 days of retention if possible.

Rotating hourly or whatever would have been fine in the case of T87645 since I know the time of the event to within a few seconds. I think it would generally be beneficial.

I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be replaced by 50 lines of your favourite scripting language.

ori added a subscriber: ori.Feb 5 2015, 12:56 AM

I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be replaced by 50 lines of your favourite scripting language.

When there are multiple viable options, I ask myself what our donors would want us to do. In this case, I think they'd prefer that we just get bigger hard drives sooner than spend developer time. Daily rotation of the un-sampled logs was fine except when rotated files were left un-gzipped.

Increasing fluorine's storage capacity so we have additional breathing room seems very much worth the investment, especially since it has only recently pushed against the limits after two years of loyal service. If we do that, and if we have a cron job pick up any files in archive/ that logrotate failed to compress for whatever reason, we could close this task and feel good about it, IMO.

Andrew triaged this task as Normal priority.Feb 8 2015, 9:48 PM
ori added a comment.Feb 9 2015, 10:09 PM
In T88393#1016697, @ori wrote:

if we have a cron job pick up any files in archive/ that logrotate failed to compress for whatever reason, we could close this task and feel good about it, IMO.

Done in https://gerrit.wikimedia.org/r/#/c/188720/

hoo added a subscriber: hoo.Feb 10 2015, 9:12 PM
Andrew raised the priority of this task from Normal to High.Mar 10 2015, 9:08 PM

Change 195673 had a related patch set uploaded (by Andrew Bogott):
Reduce lifetime of api logs to 20 days.

https://gerrit.wikimedia.org/r/195673

Change 195673 abandoned by Andrew Bogott:
Reduce lifetime of api logs to 20 days.

https://gerrit.wikimedia.org/r/195673

Change 195677 had a related patch set uploaded (by Andrew Bogott):
Purge api-feature-usage logs older than 90 days.

https://gerrit.wikimedia.org/r/195677

Andrew added a subscriber: Andrew.Mar 10 2015, 9:20 PM

Bigger drives strike me as (relatively) cheap and easy solution for this.

Change 195677 merged by Andrew Bogott:
Purge api-feature-usage logs older than 90 days.

https://gerrit.wikimedia.org/r/195677

OK, current log retention policy looks like this:

API logs: 30 days
api-feature-usage logs: 90 days
xff logs: 88 days
everything else: 180 days

If we were to just declare '30 days for everything' then we can live with fluorine on existing hardware for a good long while. Can subscribers to this ticket please comment as to which, if any, of the logs need to go back more than 30 days?

Alternatively if the /current/ log rentention is already too short, then we probably need to order new servers pronto.

Andrew added a subscriber: Anomie.Mar 11 2015, 6:02 PM
hoo added a comment.Mar 11 2015, 6:03 PM

IMO 30 days are enough for API logs (and probably also for XFF logs, although I think we decided to no longer collect these at all?). Other logs (like exception and fatal logs) should be retained longer than 30 days (90 days minimum).

The job runner logs could also be retained for 30 days only.

also note that fluorine has another 300+GB free in the vg

root@fluorine:/a/mw-log/archive# vgs
  VG   #PV #LV #SN Attr   VSize VFree  
  vg0    2   1   0 wz--n- 2.19t 384.17g

so at ~14GB/day of compressed logs it seems we have plenty for at least 30 days indeed, I think we should go for that and then tackle sampled vs unsampled, see how much we're generating daily and plan capacity accordingly (assuming fluorine can withstand the unsampled udp log stream)

Krenair added a subscriber: Krenair.
mark added a subscriber: mark.Mar 31 2015, 12:15 PM

Let's use the extra 300 GB now, and procure additional hardware for it. If we're talking about just a few TB of additional storage, it's not worth spending engineering time on.

looks like after removing the stray uncompressed log files and the current retention we're stable in terms of disk usage:

api-feature-usage logs: 90 days

I've already said this elsewhere, but I'll put it here too so it can be found more easily in the future: 30 days would be fine for this log file.

I'm also hoping to reduce the flood of entries to that log in the reasonably near future.

so fluorine disk space is stable now after cleaning up uncompressed logs, what's left is unsampled vs sampled. if we go unsampled a daily rotated log file is unpractical IMO, sticking sth like cronolog might be a solution

bd808 added a subscriber: bd808.

@fgiunchedi why is daily rotation not practical for unsampled logs? Too big?

@Andrew yes, difficult to grep and compress and trim if needed

just for reference, sampling is defined in $wgDebugLogGroups in InitialiseSettings.php and currently at 1000 for api logs and xff is disabled

Joe added a comment.Apr 24 2015, 8:05 AM

@Andrew, @fgiunchedi is no one working on this? It is an old ticket, marked as high priority and it's unassigned.

no I don't think anyone is working on this, I mostly worked on it when on clinic duty, my plate is full already (of stateful problems, no less)

Is there anything that actually needs doing besides just removing the 'sample' from the 'api' entry in wmgMonologChannels?

That's leaving the speculation in T88393#1173311 that a daily file is "unpractical" for the future, since unpractical is better than useless.

bd808 added a comment.Apr 24 2015, 3:48 PM

My patch in rOMWC2680380cba02: debug logging: Convert to Monolog logging restores unsampled xff logs to fluorine. I left api sampled but that is easy to fix. If sampling is removed from the api log it should still be excluded from Logstash by leaving 'api' => array( 'logstash' => false ), in wmgMonologChannels.

ok even with unsampled xff fluorine grows at ~12G/day with ~800G free, if we're short on space again we can either move to another machine or replace fluorine's 500G disks with bigger ones

Change 206865 had a related patch set uploaded (by Anomie):
Remove sampling of api.log

https://gerrit.wikimedia.org/r/206865

Change 206865 merged by jenkins-bot:
Remove sampling of api.log

https://gerrit.wikimedia.org/r/206865

Anomie closed this task as Resolved.Apr 29 2015, 3:16 PM
Anomie claimed this task.
BBlack reopened this task as Open.Apr 29 2015, 3:24 PM
BBlack closed this task as Resolved.
BBlack reassigned this task from Anomie to RobH.
BBlack edited projects, added procurement; removed Patch-For-Review.
BBlack added a subscriber: BBlack.

Well I was going to keep this open for procuring disks and then it got closed while I was doing so. Maybe better put off in another task, so doing that instead :)

greg moved this task from INBOX to Done on the Release-Engineering-Team board.May 23 2015, 2:14 PM
sbassett moved this task from Backlog to Done on the Security-Team board.Jun 11 2019, 7:22 PM