Page MenuHomePhabricator

Swift container for performance flame graphs (ArcLamp)
Open, Needs TriagePublic

Description

The ArcLamp pipeline collects stack traces (via Redis) and produces the flame graphs seen at https://performance.wikimedia.org/php-profiling/.

Right now, the processed logfiles and SVGs are stored in /srv/xenon on webperf1002, and processing runs via cron on that host. We would like to run the pipeline in a more distributed manner (T227026), increase retention (T200108), and not require bespoke backup/restore/failover procedures for this data. I believe the path forward is thus to store this data in Swift. The analytics cluster was also considered, however the data needs to be externally available via HTTPS on the performance site.

I've done some initial work towards rewriting the cron job to read/write data from a local Swift instance on my laptop. I would like to start running this on real data, initially in parallel with the current pipeline.

This task is to determine replication, etc. settings and create Swift container(s) for this data. As I do not have admin rights in the Swift cluster, some SRE input & assistance is requested.

Event Timeline

dpifke created this task.Feb 10 2020, 7:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2020, 7:19 PM
aaron added a subscriber: aaron.EditedFeb 11 2020, 2:44 AM

Looks like hierdata/(swift|codfw)/params.yaml needs updating, along with the private puppet repo (beforehand).

I'd suggest "performance:xhgui" "performance:arc-lamp" as the Swift account:user. It could have admin rights for "performance" (analogous to our main MediaWiki user, which can create containers automatically)*. Manual container creation can be done via https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_container (the "swift" client tool is just doing the HTTP POST for you here).

Great to see this work ! re: authentication and permissions it is indeed like @aaron outlined, we'd be creating a user and that can create containers and upload files at will.

A bunch of questions to get a better idea on the dataset, how big are you expecting this data to get over time ? Does the data need to be replicated in codfw too (assuming you'd be starting to write in eqiad)? Our standard replication within the datacenter is 3x copies. We have a standard rate limit of 30/s write operations (PUT/DELETE/POST), mentioning it in case it is relevant. Also in terms of access to the files, I'd imagine the containers to be public for reads and accessed by reverse proxying via the webserver on the performance website ?

Looking at yesterday's (2020-02-11) output, it was about 8 GB of (uncompressed) logs and 14 MB of SVGs, and about 800 files total. We can control the sampling interval to regulate how big these get, so let's assume it's relatively constant. I'll have to check if there's a reason we don't compress the logs; I feel like we should, which would dramatically reduce this. (I just now tried gzip -1 on one set of logs, and they went from 4 GB to 479 MB.)

Yes, we want this data to be available in both data centers. Right now, the entire pipeline is duplicated in both places: we read from Redis twice, and generate two sets of files, one in each location. We could save some resources if replication happened at the file level instead. The Redis pubsub channel tracks what data has been consumed, so it shouldn't be too difficult to make processing fail over reliably.

At 800 files/day, we'll be well under the rate limit (assuming it's per-file and not per-block). The only time I could think it'd be an issue would be if the pipeline breaks and we have to backfill data, but so long as exceeding the rate limit generates a unique error, we can add logic to back off if we get throttled.

Are public Swift containers directly accessible? If so, I can think of points both for and against a reverse proxy, instead of linking directly. If Swift reads can only happen from inside, then the reverse proxy is definitely how we'll go.

aaron added a comment.Feb 13 2020, 1:19 AM

Compression seems doable. LZMA works well per https://phabricator.wikimedia.org/T235455#5837382 . arclamp-grep would have to change though; maybe grep(fname, search_string) could stream zipped log object contents to lzcat and loop through the resulting lines.

@fgiunchedi How hard is it too set up hourly swift-repl container sync for eqiad <=> codfw for such a container? That seems doable. It might be interesting to try https://docs.openstack.org/swift/latest/overview_container_sync.html for this. It has far less traffic than the MediaWiki upload repos, so it would be lower risk. My concern would be the fact that the daemon that listens to redis will keep appending JSON lines to the objects representing the "current" buckets (daily, hourly) for each endpoint. That might trigger a lot of naive "copy the whole file" writes each time some JSON profiling lines are flushed.

As long as there is proper buffering of JSON lines into chunks to periodically flush, I don't see the object write rate being a problem.

AFAIK, and having looking at puppet today, Swift is only indirectly accessible from the outside. We have upload.wikimedia.org/X that falls back to a random/close ms-fe* Swift proxy server via DNS discovery. The custom WSGI middleware we have, rewrite.py, runs in the proxy server and reroutes the request as follows:

  • auth request (/auth) or supposedly authenticated (AUTH_) URLs pass directly to the core Swift response handler (valid account name in URL and valid token header presence is enforced)
  • known public paths as recognized by rewrite.py are rewritten to an unauthenticated URLs for corresponding swift containers (within the "mw" account) and passed to the core Swift response handler (401s if the container does not have .r:* within x-container-read ACLs)
  • for any other path, an error response is returned

Looking at yesterday's (2020-02-11) output, it was about 8 GB of (uncompressed) logs and 14 MB of SVGs, and about 800 files total. We can control the sampling interval to regulate how big these get, so let's assume it's relatively constant. I'll have to check if there's a reason we don't compress the logs; I feel like we should, which would dramatically reduce this. (I just now tried gzip -1 on one set of logs, and they went from 4 GB to 479 MB.)

Sounds good, from T200108 even without compression it seems the whole dataset is going to be ~200GB which is ok

Yes, we want this data to be available in both data centers. Right now, the entire pipeline is duplicated in both places: we read from Redis twice, and generate two sets of files, one in each location. We could save some resources if replication happened at the file level instead. The Redis pubsub channel tracks what data has been consumed, so it shouldn't be too difficult to make processing fail over reliably.

At 800 files/day, we'll be well under the rate limit (assuming it's per-file and not per-block). The only time I could think it'd be an issue would be if the pipeline breaks and we have to backfill data, but so long as exceeding the rate limit generates a unique error, we can add logic to back off if we get throttled.

ok! yeah rate limit doesn't seem a problem in practice, we can whitelist the account and/or bump the limits if it comes to that

Are public Swift containers directly accessible? If so, I can think of points both for and against a reverse proxy, instead of linking directly. If Swift reads can only happen from inside, then the reverse proxy is definitely how we'll go.

As @aaron mentioned the containers are accessible via our frontend caching layer, in the case of commons/upload.wikimedia.org the caching layer talks directly to swift for historical reasons and we have a custom middleware in swift to translate container names. For new use cases having the service/webserver reverse-proxy to swift is recommended, thus the request flow (all https) will be: clients -> frontend caches -> apache on webperf -> swift.

@fgiunchedi How hard is it too set up hourly swift-repl container sync for eqiad <=> codfw for such a container? That seems doable. It might be interesting to try https://docs.openstack.org/swift/latest/overview_container_sync.html for this. It has far less traffic than the MediaWiki upload repos, so it would be lower risk. My concern would be the fact that the daemon that listens to redis will keep appending JSON lines to the objects representing the "current" buckets (daily, hourly) for each endpoint. That might trigger a lot of naive "copy the whole file" writes each time some JSON profiling lines are flushed.

Not hard to setup container sync, in fact that's what we are doing for docker registry (cfr T214289 T227570, plus puppet). Although I'm confused about the json file you mentioned, I was under the impression that only output svgs and logs would be stored in swift? (i.e. https://performance.wikimedia.org/arclamp/logs/). In terms of failover I don't know offhand if it is possible to keep the same container name 2-way synchronized (assuming only one writer at a time, or that the object names don't collide).

In tactical/practical terms: I can assist with this and can review patches but my time spent on swift is limited (mostly maintenance) as swift maintainership itself is currently not funded.

HTH!

Change 572129 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] Add Swift user for ArcLamp

https://gerrit.wikimedia.org/r/572129

I submitted a patch which I *think* does what's needed to create the user, less the private keys. I don't know if there's more to it than this, but hopefully it's a starting point.

Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.

Change 572129 merged by Filippo Giunchedi:
[operations/puppet@production] Add Swift user for ArcLamp

https://gerrit.wikimedia.org/r/572129

Mentioned in SAL (#wikimedia-operations) [2020-02-19T08:14:09Z] <godog> roll restart swift proxies - T244776

In case it is useful: we have setup a separate Swift cluster to host Prometheus long term data (https://wikitech.wikimedia.org/wiki/Thanos) which unlike the main swift cluster is multi-site, let me know if that's something you'd be interested in trying.