Page MenuHomePhabricator

[Spike] Plan ingress system to use with new Kafka topic for landing page and impression data
Closed, ResolvedPublic1 Story Points

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2018, 7:27 PM

It has been suggested that we use python-kafka now... Would this as a complement to, or instead of, the current Python scripts? How would they interact?

If we move only to python-kafka (potentially a good idea, becuase that would mean one less snowflake to maintain), but there are limits on how it can munge the info, should we try munging more on the client, to ensure full data equivalence?

It looks like python-kafka is definitely the way to go for receiving data from a Kafka producer and processing it in Python. It looks like we can adapt the current python script to received data this way and still keep much of the processing code as-is. Also probably a good opportunity to remove cruft and refactor parts of the script.

We'll have to tune for scalability... It looks like normally, Kafka consumers execute continuously, but maybe not? The current script inserts data in the DB in chunks, so I assume we'd keep that aspect of it.

At this point, I can't think of any arguments for staying with the file-to-python-to-mysql pipeline... and indeed, I think that was one of the weaker points in the existing system, no?

Following up on the question in the task description:

limits on how it can munge the info, should we try munging more on the client, to ensure full data equivalence?

No, it can do anything. It's just a Python library that we'd import and use to ingress the data, instead of reading files, as we currently do.

Here's one possible issue: backfill in case of failure. In the current system, I don't think files with this data are ever deleted from the FR cluster, so if there's a problem with any part of the pipeline downstream of the files, we have practically no time limit on our ability to backfill.

On the other hand, Kafka EL messages stick around in that system for only 8 days. Though the same data is in Hive for 90 days, I don't know if you can easily recreate the original Kafka messages from them. So possibly only 8 days window at least for low-effort backfill, I guess.

Another possible issue is fault-tolerance. Janky though it may be, the current system is quite fault-tolerant, at least if, say, the DB goes down for a bit. How much effort would it be to build the same degree of fault tolerance into a direct Kafka-mysql slurping? What would we gain in exchange for doing so? Would it be worth it?

Seems one option is to use Apache Flume with a JDBC sink...

Streaming Kafka Messages to MySQL Database

The Flume JDBC sink used in that example doesn't seem very actively maintained, though.

Another consideration is, we may want to seriously revamp this whole analytics system one day, including what data is stored, and how it's stored and queried. Any effort we expend jiggling the current infrastructure around might be better used for that more ambitious task. As in, maybe we should change as little as possible, and instead think about how to get a much better system in the medium-term?

As in, maybe we should change as little as possible, and instead think about how to get a much better system in the medium-term?

Possibly! A huge part of the Event Data Platform is to make it easier to get events into different stores, including MySQL, most likely using Kafka Connect.

Not sure if you've looked into, but EventLogging's MySQL 'consumer' handles batching and retries and duplicates to consume from Kafka and write into MySQL. It will work for any EventLogging schema. The main downside is you don't have control over the name of the tables (they will be named SchemaName_<REVISION_ID>), and any new schema revisions will result in new tables being created.

If you have to do any data munging, it shouldn't be too difficult to write EventLogging 'plugins' to transform data before it is handed to the MySQL 'consumer' process.

If this sounds good to you, I'd be happy to show you how to make it all work. The simplest case would be something like:

eventlogging-consumer 'kafka:///broker1,broker2?topic=MyTopicName&group_id=frack_consumer0' 'mysql://username:password@hostname:3306/database_name'

Are you already emitting these FR events? If so, they are likely going to be in the EventLogging MySQL database already :) (unless we've blacklisted them for volume, in which case they will only be in Hive).

The main downside is you don't have control over the name of the tables (they will be named SchemaName_<REVISION_ID>), and any new schema revisions will result in new tables being created.

Thanks much!!!! Hmmmm, I'm afraid as a first step, that won't work, since we need 100% compatibility with the current MySQL schema.

Are you already emitting these FR events? If so, they are likely going to be in the EventLogging MySQL database already :) (unless we've blacklisted them for volume, in which case they will only be in Hive).

Not yet, but emitting them is just a config change away. There will be a high volume--does that mean that you'll need to blacklist ingress into EventLogging MySQL?

Yes, please let us know the schema and timeline for this, thank youuuu

Yes, please let us know the schema and timeline for this, thank youuuu

The Database schema is here. It's got quite a lot of janky cruft that's impossible to spot without a detailed review of the code and how it's used. Same goes for the Python scripts that read log files to ingress into the data base.

We don't have an exact timeline, but it needs to be fully tested, at scale, ahead of the large FR campaigns later this year.

Ah, when I say schema, I mean the name of the EventLogging schema, so that
we can blacklist it from MySQL ingestion in the log database.

From T189820: Create job to deliver the eventlogging_CentralNoticeImpression topic:

Just curious though. Are you sure you want to just write this stuff to a log file? Are you just putting it there for a bit? The less hacky pipeline for yall would be kafka -> into MySQL directly.

So, just to summarize, it seems there are really two options:

A) Continue with a system as similar as possible to the present. So, continue to use Kafkatee to write log files, and slurp them up with the current Python scripts. This will involve switching to the new EventLogging Kafka topics and possibly adapting the config a bit. For testing, we'd run two Kafkatees in parallel.

B) Improve the ingress part of the pipeline by adapting the Python scripts to read directly from the Kafka topics.

Advantages to (A) are that we get more incremental change and stay firmly within scope. If I understand correctly from the other ticket, there may be complications for Operations? (@cwdent?) (Note that so far, a hard cutoff has not been considered... unless plans change, we'll be first ingressing into a parallel testing database, initially at a lower sample rate, to be sure the data is the same...)

(B) has the advantage of being less hacky... Specifically, I imagine, it'd be more efficient, and also, it's a more standard type of Kafka pipeline, so we'd get some upstream support. However, it's a bigger change, and has different implications regarding backfill ability. (Currently, the files are kept indefinitely. So if we need to backfill following an outage anywhere downstream of the files, we can do so, and it's easy, because we have the data in the same exact format that we normally get it in.)

I'd like to suggest that we stick with (A). (So, backtracking from my previous suggestion!!!) That keeps us firmly with scope. Changes like (B) are a good idea for medium-term... We could look at (B) or other similar improvements as part of a bigger roadmap. That way we can take time to scope out requirements and coordinate with other planned improvements in the Analytics infrastructure (see @Ottomata's comment about Kafka Connector here).

Thoughts?

Thanks much!!! :)

Sounds fine to me! And might make sense given the larger Modern Event Platform roadmap.

Re. backfilling. Kafka only keeps topics for 7 days (this can be expanded on a per-topic basis), but we also keep all this data in Hadoop for 90 days. If a post 7-day backfill was needed, the data could be obtained from Hadoop.

@AndyRussG nothing too bad for ops, some packaging and puppet, I am working on that now

OK! So we'll go with (A), that is, continue to use Kafkatee to write files, and slurp them with the python script.

For testing, then, we'll need the second Kafkatee instance and the test database.

After that's working, in the medium-term, we can look at improving the whole system for stability, and maybe also for near-realtime data like we now have for Druid, while checking carefully all requirements.

Thanks much!!!!

Ejegg closed this task as Resolved.Jun 5 2018, 8:15 PM
Ejegg triaged this task as Normal priority.