Page MenuHomePhabricator

Understanding first day: write analytics queries in advance
Closed, ResolvedPublic


In the spirit of "test driven development", it is valuable to write Hive queries in advance of finalizing and deploying our EventLogging schemas. This will ensure that we'll be collecting all the data we need from the multiple schemas we intend to use.

Event Timeline

Now that @kostajh, @nettrom_WMF, and I have mostly defined the new schema and the existing schemas we intend to use, this task is ready to be worked on.

In preparation for this task, I looked into whether EventLogging data for our EditorJourney schema would be available in the Data Lake while being tested (i.e. on betalabs). My understanding is that the raw data will be available as it goes through Kafka, ref this part of the EL documentation.

However, I cannot find evidence that the data is flowing in. Maybe that is related to the schema data showing up in the client side event logging, as mentioned in T205759#4704855?

It's probably time for one of us to meet with someone at Release-Engineering-Team about 1) why all data is in client-side-events instead of all-events, and 2) why we're not seeing this data in data lake.

@MMiller_WMF : I see that I maybe shouldn't phrase loose thoughts as questions when there's a dozen other people subscribed to a task, sorry! It was mainly intended as a mental note that these two tasks appear to be connected, so I should keep an eye on the progress made over there since it looked like they're working on it. But, I could also make sure everyone knows about all the symptoms, so thanks for tagging @kostajh!

@kostajh : The data won't be readily available in the Data Lake during testing, instead the raw data (from Kafka) is stored in Hadoop. The EL doc I referred to shows how to turn that into a queryable table. There's no indication of data in /mnt/hdfs/wmf/data/raw/eventlogging, I'd expect there to be an eventlogging_EditorJourney directory there with the raw data in it. and that directory doesn't exist.

I wrote many of the queries during testing, but also found that things don't necessarily translate easily between the MariaDB testing environment we have for EventLogging, and the Data Lake where the production data ends up. Something to note for future projects, while we wait for a fully functional staging environment.