Page MenuHomePhabricator

Update arclamp to active/active architecture
Open, MediumPublic

Description

Currently arclamp is using redis in an active/standby architecture (active in eqiad, standby in codfw)

This has a few downsides, for instance failover is manual involving commits in a few different places, the architecture lacks durability at the site level, etc.

Since we're in the process of dedicating hardware to arclamp/redis (T327277), let's evaluate possible approaches to deploy arclamp with a basic active/active architecture.

For the purposes of this task I think it'll be best to focus on minimally invasive config only changes as much as possible.


current status: draft, gathering options, please edit/add notes/feedback/etc

Option: independent redis instances on arclamp hosts

  • Deploy individual redis instances to arclamp1001 and arclamp2001
  • Direct writes to the site local redis instance
  • Add/update arclamp-log jobs to a per-site layout, e.g. both arclamp hosts run:
    • arclamp-log.py /etc/arclamp-log-excimer-eqiad.yaml -- redis_host: arclamp1001
    • arclamp-log.py /etc/arclamp-log-excimer-codfw.yaml -- redis_host: arclamp2001

In theory this would remove the current manual failover steps, with the possible wrinkle that two arclamp-log.py processes outputting to the same files will have side effects (I've not tested this), low activity thresholds may need adjusting, etc.


Event Timeline

herron triaged this task as Medium priority.

In general LGTM!

Though I don't think outputting to the same files from two different processes is going to work without corruption (i.e. we'd have to coordinate writes to file somehow, or write to distinct log files?).

Keeping the write coordination within arclamp-log.py seems desirable to me, meaning we'd have to be able to read messages from > 1 redis and write the "joined" stream to files. I don't know offhand how easy/hard would be to achieve that though!

This approach seems like it could work, though arclamp-log.py would need some exclusive locks around the log.write() calls. The additional reordering of log entries with an active/active approach should not cause problems since we use the timestamps in the JSON entries and sort on that.

I like the idea of having less config changes needed on switch over and avoiding cross-datacenter operations within MediaWiki requests (even postsend). Refining the primary/standby approach also would be fine to me. It's really a question of which one is simpler.

I think it'll be easier to reason about the system if the tracelog files remain "owned" by a single process. There is also the pruning logic that runs part of that, which seems odd to run multiple off especially as it gives the incorrect appearance that multiple retention settings can be configured when in fact one would be ineffective. Providing two config files with duplicate settings that "must" be identical is added complexity and confusion we can do without.

Having said that, I think that also applies to the notion of reading from multiple Redis instances and merging the results. This increase levels where races and discrependencies are expected from ~1 to ~4, thus making it practical certainty that sharing a flame graph URL on performance.wikimedia.org would show different numbers to different people depending from where you view them which seems confusing. E.g. if I link one from a task and mention there are 1312 samples but you might see 1311 in the SVG if you read it from a geo location closer to Codfw. @aaron correctly pointed out in our triaging meeting that these discrependencies don't matter functionally in terms of the service we provide. The variance is fine, both if it were data lost from both buckets, or if it were non-deterministic in which bucket it falls. But having it consistently be inconsistent in which way we're inconsistent, I think exposes internal guts too much and puts the burden on the end-user more than I'd like.

Perhaps an incremental step can be to indeed provision Redis in both data centers but follow the primary DC in terms of which one we write and which one we read. Today there are no switchover steps for arclamp, similar to Statsd and Graphite, we tend to keep it in Eqiad and/or have a separate switchover. If we provision arclamp-redis in both DCs we can start doing a switchover for it anytime ahead of MediaWiki's switchover.

This would have as benefit that we keep a single flow of data, and thus keep the output more deterministic. One thing that complicates the switch with one read source is that the switch requires a restart (which loses a second of data, given no resumption) and the switch also isn't atomic between MW and ArcLamp (which loses another few seconds of data). This is acceptable for Arc Lamp, but I agree it'd be neater if we support multiple read sources. That way, we can just always read from both and not care which one is primary and thus have one less thing to switch/restart.

Thank you for the extensive reasoning and context @Krinkle @aaron !

I see what you mean re: having inconsistent data, something I haven't thought about. At steady state I believe the data should be the same though? i.e. say at the hourly buckets both arclamps would have read the same data from both redis pubsub channels if I'm not mistaken?

We'll definitely be provisioning redis in both arclamp hosts as part of T327277, I'm not convinced the interim option of following the primary DC is worth the effort compared to the status quo (i.e. switchover arclamp independent of mw by flipping mw itself and then arclamp).

Supporting multiple read sources would be indeed nice with the main benefit of requiring no actions on switchover. I have looked at the code and something else I didn't consider the other day is the added complexity as we would be moving from a single threaded python process to async/multiple threads. Overall I think it'd be manageable though, for example read threads per-redis all writing to a queue and the main thread in charge of reading the queue and serialize writes to files.

Supporting multiple read sources would be indeed nice with the main benefit of requiring no actions on switchover. I have looked at the code and something else I didn't consider the other day is the added complexity as we would be moving from a single threaded python process to async/multiple threads. Overall I think it'd be manageable though, for example read threads per-redis all writing to a queue and the main thread in charge of reading the queue and serialize writes to files.

So I was curious how this would look like and gave it a try here (WIP but definitely reviewable, contains kinda-unrelated changes too): https://gerrit.wikimedia.org/r/q/topic:multiple-redis

The gist is that reading from pubsub.subscribe(config.get('redis_channel', 'arclamp')) is changed to read from a queue.Queue. The (bounded) queue is fed by multiple redis, each doing pubsub.subscribe. Semantics stay the same, in the sense that if the queue is empty for a certain time then the process exists. In this case however a single redis losing its connection would make the thread die, so redis reconnections are also handled as a side effect, for improved resiliency (e.g. arclamp-log.py will start up and attempt (re)connections to all redises). For ease of deployment purposes the configuration is backwards-compatible with what we have now. Let me know what you think!

We might want to do this next year together with a rewrite of the arc-lamp suite into PHP.

Supporting multiple read sources would be indeed nice with the main benefit of requiring no actions on switchover. I have looked at the code and something else I didn't consider the other day is the added complexity as we would be moving from a single threaded python process to async/multiple threads. Overall I think it'd be manageable though, for example read threads per-redis all writing to a queue and the main thread in charge of reading the queue and serialize writes to files.

So I was curious how this would look like and gave it a try here (WIP but definitely reviewable, contains kinda-unrelated changes too): https://gerrit.wikimedia.org/r/q/topic:multiple-redis

The gist is that reading from pubsub.subscribe(config.get('redis_channel', 'arclamp')) is changed to read from a queue.Queue. The (bounded) queue is fed by multiple redis, each doing pubsub.subscribe. Semantics stay the same, in the sense that if the queue is empty for a certain time then the process exists. In this case however a single redis losing its connection would make the thread die, so redis reconnections are also handled as a side effect, for improved resiliency (e.g. arclamp-log.py will start up and attempt (re)connections to all redises). For ease of deployment purposes the configuration is backwards-compatible with what we have now. Let me know what you think!

Thank you! I think the approach looks great overall, and am in support of making the incremental improvement to what we have today. If/when you want to resurrect the patches I'd be happy to give a review.

We might want to do this next year together with a rewrite of the arc-lamp suite into PHP.

Thanks, sounds like something to consider, although it is also quite a lot larger larger in scope than what I had imagined for this task. I'd be happy to add my thoughts on the rewrite task though, local redundancy in particular comes to mind.

Ack. For now, we have a decade or so behind us with no issues around the Redis arc-lamp that we're aware of or were worth avoiding for the level of service we expect to provide with the flame graphs. During switchovers, it has stayed on Eqiad on mwlog1001 so far, and that's fine I think to continue given it's out of the critical path. Having it away from mwlog and having a cold standby in the secondary DC are fine incremental improvements. Beyond that I'd rather push back for a few quarters while we prioritise other work. I'd prefer not to have others do the work instead because this is a good oppertunity for new team members to feel connected with the stack and make their contributions, even if only through code review and testing when the time comes.

lmata subscribed.

It seems like this is not a priority, so we'll postpone it for now. We can revisit it when the time comes.