Store unsampled API and XFF logs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• tstarling
	Feb 3 2015, 3:28 AM

Description

Apparently unsampled API logs were disabled by Ori in December, and XFF logs were disabled by Reedy in January. I don't understand how we can respond to abuse and DoS attacks without these logs.

When we bought Fluorine, it was sized so as to have a sufficient disk I/O to store unsampled Apache access logs with a short retention time: https://rt.wikimedia.org/Ticket/Display.html?id=2400 . Unfortunately this was never implemented -- it would still be useful in my opinion. Instead, we stored XFF and API logs, and tuned the retention time of the API logs so that they fit on the disk.

If the API logs stop fitting on the disk on fluorine, the first thing to try should be to reduce the retention time. This is currently 30 days, see files/misc/scripts/mw-log-cleanup in puppet.

If fluorine is now so undersized, 2.5 years on, that we can't even store 7 days of API access logs, then we should buy new hardware for it.

Details

Subject	Repo	Branch	Lines +/-
Remove sampling of api.log	operations/mediawiki-config	master	+1 -1
Purge api-feature-usage logs older than 90 days.	operations/puppet	production	+3 -0
Reduce lifetime of api logs to 20 days.	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fgiunchedi	T88393 Store unsampled API and XFF logs
		Resolved		fgiunchedi	T92417 Investigation if Fluorine needs bigger disks or we retain too much data

Event Timeline

• tstarling created this task.Feb 3 2015, 3:28 AM

• tstarling raised the priority of this task from to Needs Triage.

• tstarling updated the task description. (Show Details)

• tstarling added projects: MediaWiki-Core-Team, acl*sre-team.

• tstarling subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2015, 3:28 AM

• tstarling updated the task description. (Show Details)Feb 3 2015, 3:29 AM

• tstarling set Security to None.

Honestly I don't see the point in creating a godzilla 130 GB file every day.

A correct way to tackle this is probably rotating the file more often than daily, and keep 7 days of retention if possible.

In T88393#1010638, @Joe wrote:

Honestly I don't see the point in creating a godzilla 130 GB file every day.

If we knew what HHVM server(s) were involved in T87645, that would help a with isolating that bug. That information would have been in the API log if it existed.

Another plausible scenario is:

The site goes down
15 minutes later, the site comes back up
5 minutes after that, initial ops response begins. Maybe MySQL fell over from overload but nobody knows where the queries came from. Will it happen again? The API log tells you wall clock execution time, it's possible to pull out slow and/or high volume API queries and model their effect on MySQL load. Maybe it was an accidental DoS -- then you have the username in the API log and you can ask them nicely not to do it again. Maybe it was deliberate DoS, then you have the IPs and full details of the vulnerability in the API log. If it wasn't API requests, we're screwed, especially without the XFF log, but at least with API DoS we have some hope of working out what happened. Grepping a 130GB log is not easy, but when all hands are on deck, it's a minor problem, compared to not having any logs at all.

A correct way to tackle this is probably rotating the file more often than daily, and keep 7 days of retention if possible.

Rotating hourly or whatever would have been fine in the case of T87645 since I know the time of the event to within a few seconds. I think it would generally be beneficial.

Legoktm subscribed.Feb 3 2015, 10:11 PM

• MZMcBride subscribed.Feb 4 2015, 4:33 AM

I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be replaced by 50 lines of your favourite scripting language.

In T88393#1016638, @tstarling wrote:

I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be replaced by 50 lines of your favourite scripting language.

When there are multiple viable options, I ask myself what our donors would want us to do. In this case, I think they'd prefer that we just get bigger hard drives sooner than spend developer time. Daily rotation of the un-sampled logs was fine except when rotated files were left un-gzipped.

Increasing fluorine's storage capacity so we have additional breathing room seems very much worth the investment, especially since it has only recently pushed against the limits after two years of loyal service. If we do that, and if we have a cron job pick up any files in archive/ that logrotate failed to compress for whatever reason, we could close this task and feel good about it, IMO.

ori mentioned this in rOPUP6f4f681fe16d: mw-log-cleanup: find and compress uncompressed rotated files.Feb 5 2015, 2:41 AM

Andrew triaged this task as Medium priority.Feb 8 2015, 9:48 PM

In T88393#1016697, @ori wrote:

if we have a cron job pick up any files in archive/ that logrotate failed to compress for whatever reason, we could close this task and feel good about it, IMO.

Done in https://gerrit.wikimedia.org/r/#/c/188720/

hoo subscribed.Feb 10 2015, 9:12 PM

Andrew raised the priority of this task from Medium to High.Mar 10 2015, 9:08 PM

Change 195673 had a related patch set uploaded (by Andrew Bogott):
Reduce lifetime of api logs to 20 days.

https://gerrit.wikimedia.org/r/195673

gerritbot added a project: Patch-For-Review.Mar 10 2015, 9:10 PM

Change 195673 abandoned by Andrew Bogott:
Reduce lifetime of api logs to 20 days.

https://gerrit.wikimedia.org/r/195673

Change 195677 had a related patch set uploaded (by Andrew Bogott):
Purge api-feature-usage logs older than 90 days.

https://gerrit.wikimedia.org/r/195677

Bigger drives strike me as (relatively) cheap and easy solution for this.

Andrew added a project: hardware-requests.Mar 10 2015, 9:20 PM

Change 195677 merged by Andrew Bogott:
Purge api-feature-usage logs older than 90 days.

https://gerrit.wikimedia.org/r/195677

Andrew mentioned this in rOPUPc57ea39296b3: Purge api-feature-usage logs older than 90 days..Mar 10 2015, 9:23 PM

Andrew removed a project: hardware-requests.Mar 11 2015, 5:51 PM

OK, current log retention policy looks like this:

API logs: 30 days
api-feature-usage logs: 90 days
xff logs: 88 days
everything else: 180 days

If we were to just declare '30 days for everything' then we can live with fluorine on existing hardware for a good long while. Can subscribers to this ticket please comment as to which, if any, of the logs need to go back more than 30 days?

Alternatively if the /current/ log rentention is already too short, then we probably need to order new servers pronto.

Andrew added a subscriber: Anomie.Mar 11 2015, 6:02 PM

IMO 30 days are enough for API logs (and probably also for XFF logs, although I think we decided to no longer collect these at all?). Other logs (like exception and fatal logs) should be retained longer than 30 days (90 days minimum).

The job runner logs could also be retained for 30 days only.

Anomie mentioned this in T92653: API returning a 503 error for the same query.Mar 13 2015, 11:38 PM

fgiunchedi mentioned this in T94396: flamegraph (xenon) is using most of fluorine's memory.Mar 30 2015, 10:56 AM

also note that fluorine has another 300+GB free in the vg

root@fluorine:/a/mw-log/archive# vgs
  VG   #PV #LV #SN Attr   VSize VFree  
  vg0    2   1   0 wz--n- 2.19t 384.17g

so at ~14GB/day of compressed logs it seems we have plenty for at least 30 days indeed, I think we should go for that and then tackle sampled vs unsampled, see how much we're generating daily and plan capacity accordingly (assuming fluorine can withstand the unsampled udp log stream)

Krenair added a subscriber: Jalexander.Mar 30 2015, 11:29 AM

Krenair subscribed.

Let's use the extra 300 GB now, and procure additional hardware for it. If we're talking about just a few TB of additional storage, it's not worth spending engineering time on.

looks like after removing the stray uncompressed log files and the current retention we're stable in terms of disk usage:

screenshot_vWjyUZ.png (371×744 px, 41 KB)

In T88393#1110251, @Andrew wrote:

api-feature-usage logs: 90 days

I've already said this elsewhere, but I'll put it here too so it can be found more easily in the future: 30 days would be fine for this log file.

I'm also hoping to reduce the flood of entries to that log in the reasonably near future.

so fluorine disk space is stable now after cleaning up uncompressed logs, what's left is unsampled vs sampled. if we go unsampled a daily rotated log file is unpractical IMO, sticking sth like cronolog might be a solution

• bd808 added a project: MediaWiki-Debug-Logger.Apr 3 2015, 6:44 PM

• bd808 subscribed.

• RobLa-WMF edited projects, added Release-Engineering-Team, Security-Team; removed MediaWiki-Core-Team.Apr 8 2015, 10:19 PM

@fgiunchedi why is daily rotation not practical for unsampled logs? Too big?

@Andrew yes, difficult to grep and compress and trim if needed

just for reference, sampling is defined in $wgDebugLogGroups in InitialiseSettings.php and currently at 1000 for api logs and xff is disabled

@Andrew, @fgiunchedi is no one working on this? It is an old ticket, marked as high priority and it's unassigned.

no I don't think anyone is working on this, I mostly worked on it when on clinic duty, my plate is full already (of stateful problems, no less)

Is there anything that actually needs doing besides just removing the 'sample' from the 'api' entry in wmgMonologChannels?

That's leaving the speculation in T88393#1173311 that a daily file is "unpractical" for the future, since unpractical is better than useless.

My patch in rOMWC2680380cba02: debug logging: Convert to Monolog logging restores unsampled xff logs to fluorine. I left api sampled but that is easy to fix. If sampling is removed from the api log it should still be excluded from Logstash by leaving 'api' => array( 'logstash' => false ), in wmgMonologChannels.

ok even with unsampled xff fluorine grows at ~12G/day with ~800G free, if we're short on space again we can either move to another machine or replace fluorine's 500G disks with bigger ones

Change 206865 had a related patch set uploaded (by Anomie):
Remove sampling of api.log

https://gerrit.wikimedia.org/r/206865

Change 206865 merged by jenkins-bot:
Remove sampling of api.log

https://gerrit.wikimedia.org/r/206865

Anomie mentioned this in rOMWC96f7fc76bd9e: Remove sampling of api.log.Apr 29 2015, 3:08 PM

Anomie closed this task as Resolved.Apr 29 2015, 3:16 PM

Anomie claimed this task.

Well I was going to keep this open for procuring disks and then it got closed while I was doing so. Maybe better put off in another task, so doing that instead :)

BBlack mentioned this in T97537: Procure bigger disks for fluorine logs.Apr 29 2015, 3:26 PM

BBlack mentioned this in T92417: Investigation if Fluorine needs bigger disks or we retain too much data.Apr 29 2015, 3:33 PM

fgiunchedi claimed this task.Apr 29 2015, 3:45 PM

fgiunchedi closed subtask T92417: Investigation if Fluorine needs bigger disks or we retain too much data as Resolved.May 6 2015, 7:02 AM

greg moved this task from INBOX to Done on the Release-Engineering-Team board.May 23 2015, 2:14 PM

sbassett moved this task from Incoming to Our Part Is Done on the Security-Team board.Jun 11 2019, 7:22 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM