Page MenuHomePhabricator

New scripts to ingress data from Kafkatee into MySQL
Closed, ResolvedPublic4 Estimated Story Points

Assigned To
Authored By
AndyRussG
May 25 2018, 3:08 PM
Referenced Files
None

Description

As per T192839, for sampled impression and landing page data we won't ingress data directly from the Kafka topic in to the database, but rather will write files from the stream and will read those, as in the legacy system.

However, the format of the new files is pretty different from the old ones. Also, the legacy python scripts that processed data in the old format are pretty crufty. So, instead of writing new code to re-create the legacy format and feed it to the crufty legacy scripts, we'll re-do the legacy scripts to read the new format.

This should make the system more maintainable and stable, so it's definitely within scope for this switchover.

We may wish to make some minor changes in the database schema, but we should ensure that queries currently used will continue to work.

Thanks!!

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 457090 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Implement purge-incomplete for landingpage and output stats

https://gerrit.wikimedia.org/r/457090

Change 457091 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Rename package fr_user_event_consumer -> fruec

https://gerrit.wikimedia.org/r/457091

Change 457272 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Add event_type arg to log_file_mapper.get_lastest_time()

https://gerrit.wikimedia.org/r/457272

Change 457275 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] [WIP] Inline doc and comments

https://gerrit.wikimedia.org/r/457275

Change 457691 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Flag SQL template constants as private

https://gerrit.wikimedia.org/r/457691

Change 455189 merged by Ejegg:
[wikimedia/fundraising/FRUEC@master] Add LandingPage test data

https://gerrit.wikimedia.org/r/455189

@Ejegg I'm seeing a patch merged here. Is this task Pending Deployment?

@Ejegg I'm seeing a patch merged here. Is this task Pending Deployment?

Hi! That's just the first patch in the series of patches. That patch itself was pretty isolated, but I'd suggest review be performed on the last in the series. Thanks!!!!

Change 463110 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Remove testing field (for banner previews) from CNEvents

https://gerrit.wikimedia.org/r/463110

Change 516062 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Move object cache to separate submodule

https://gerrit.wikimedia.org/r/516062

Review should look at the current "HEAD" of the series of Gerrit patches: https://gerrit.wikimedia.org/r/#/c/wikimedia/fundraising/FRUEC/+/516062/

Thanks!!!

Also noting here locations of review comments:

Change 524101 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[wikimedia/fundraising/FRUEC@master] Fix typos in inline docs and comments

https://gerrit.wikimedia.org/r/524101

It looks like some of the tasks that have been +2'ed for code review also need to be +2 verified. I guess this is necessary since we don't have any CI running at all.

Thanks!!!

Change 455869 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Add landingpage event processing

https://gerrit.wikimedia.org/r/455869

Change 456434 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Truncate strings to DB column limit

https://gerrit.wikimedia.org/r/456434

Change 456664 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Refactor stats output

https://gerrit.wikimedia.org/r/456664

Change 456672 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Print a friendly message confirming config

https://gerrit.wikimedia.org/r/456672

Change 457090 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Implement purge-incomplete for landingpage and output stats

https://gerrit.wikimedia.org/r/457090

Change 457091 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Rename package fr_user_event_consumer -> fruec

https://gerrit.wikimedia.org/r/457091

Change 457272 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Add event_type arg to log_file_mapper.get_lastest_time()

https://gerrit.wikimedia.org/r/457272

Change 457691 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Flag SQL template constants as private

https://gerrit.wikimedia.org/r/457691

Change 457275 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Inline doc and comments

https://gerrit.wikimedia.org/r/457275

Change 463110 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Remove testing field (for banner previews) from CNEvents

https://gerrit.wikimedia.org/r/463110

Change 516062 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Move object cache to separate submodule

https://gerrit.wikimedia.org/r/516062

Change 524101 merged by Jgleeson:
[wikimedia/fundraising/FRUEC@master] Fix typos in inline docs and comments

https://gerrit.wikimedia.org/r/524101

All patches in the chain are now merged ready for deployment!

I'm wondering, is this really pulling from the files produced by kafkatee, or do we get events directly from Kafka?

I'm wondering, is this really pulling from the files produced by kafkatee, or do we get events directly from Kafka?

Ah, I see from the code! It looks great, seems to be parsing the kafkatee logfiles and writing to a nearly identical schema. That would answer the next question I had, which is whether random scripts like the WMDE banner impression export will continue to work under the new system. Thanks for doing this!

I'm wondering, is this really pulling from the files produced by kafkatee, or do we get events directly from Kafka?

Ah, I see from the code! It looks great, seems to be parsing the kafkatee logfiles and writing to a nearly identical schema.

Thanks! Yeah, for now we decided to keep ingressing from files written by kafkatee. I think that an eventual switch to near-realtime direct consumption of Kafka streams won't be too much extra work... And in that case, we'll still have the log files as backup for backfill (for periods beyond the time that events are retained in Kafka).

That would answer the next question I had, which is whether random scripts like the WMDE banner impression export will continue to work under the new system. Thanks for doing this!

Yeah, that's the plan, in any case! BTW regarding the schema changes and backward compatibility, please see this draft specification. Any comments are most welcome, on the talk page or the related task, T196563. See also the proposed SQL for the schema change, in comments here: T196564.

Thanks again for digging in!!! :)

Done! Bwahahahahahah :)