Page MenuHomePhabricator

Return to real time banner impressions in Druid
Closed, ResolvedPublic

Description

Prior to May we previously had near real-time banner impressions data brought into druid that could be access via druid/turnilo.

During the update to 0.11 we were informed that we would loose real time banner impressions for a few weeks as you moved from Tranquility to the Kafka Indexing Service

Since then I've not heard any movement on when this might return.

Event Timeline

We're considering a new version of these jobs for Q3

fdans triaged this task as Medium priority.Sep 6 2018, 4:45 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
elukey added a project: User-Elukey.
elukey added a subscriber: JAllemandou.

@Jseddon hi! I should be able to work on this next quarter, @JAllemandou had a prototype work fine a couple of months ago, it should be only a matter of productionizing it. Any particular deadline from you or is it (as I imagine) having it before Q4?

@AndyRussG Hi! I'd like to ask you a couple of questions before starting to work on this task.. are eventlogging_CentralNoticeImpression and eventlogging_LandingPageImpression good authoritative sources for banner impression data instead of filtering webrequest logs?

@AndyRussG Hi! I'd like to ask you a couple of questions before starting to work on this task.. are eventlogging_CentralNoticeImpression and eventlogging_LandingPageImpression good authoritative sources for banner impression data instead of filtering webrequest logs?

Hi @elukey! Thanks!!! Eventually eventlogging_CentralNoticeImpression will replace webrequest logs as the main data source on CentralNotice activity. It's not there yet--the new pipeline (which includes more than just EventLogging)--still has to be finished and tested at scale. Here's a task for the Druid switchover: T186048: Adapt ingress of CN data into Druid to EventLogging-based impression recording. The event is currently switched on globally at a 1% sample rate. That's the rate it will likely stay at for most community campaigns; for Fundraising, it should eventually get to 100%.

eventlogging_LandingPageImpression is not related to CentralNotice or banners--it's for another user entry point for donations (mostly from e-mails).

All right so IIUC for this moment I should just use webrequest :)

Hi @AndyRussG! During our offsite we discussed about this use case, and we decided that we'll support the ingestion of the eventlogging_CentralNoticeImpression Kafka topic into Druid, but we'd prefer not to build any Spark job to "filter" the banner impressions from the Webrequest stream anymore. Would it be acceptable for your team? I know that the event is not completely ready and sampled now, but it might be for your peak season?

Hi!!! Many apologies for the delay here...

I think it makes sense to build the realtime data consumer based on the EventLogging stream. The only drawback would be that initially, until the new pipeline is ready, it would all be at a 1% sample rate. (Aagain, when the new pipeline is switched on, that stream would get the same sample rates as currently available via beacon/impression and Webrequest, that is, 100% on Fundraising campaigns and 1% on almost everything else.)

Quick question: if you take that approach, how long would be before another job (existing) backfills with the full data from beacon/impression? Hours? Days?

Thanks much and apologies again for the delay! :)

@elukey so just to confirm, it's fine to go ahead and use the eventlogging_CentralNoticeImpression stream for this. Thanks so much again and apologies again for the delays in replying!!!

P.S. Also, pls lmk if you have any questions about the new stream... :)

@AndyRussG sorry for the lag but I had to clarify with Joseph some details :)

So first of all, we'd need to upgrade Druid to 0.12.3 before proceeding, to have a robust Kafka Indexing Service (T206839). We should be able to do it relatively soon.

When this is done, we'd prefer to keep webrequest and eventlogging data streams separated for various reasons (schema discrepancies, inconsistencies, etc..). So we propose the following:

  • we keep the daily webrequest-based banner impression indexing job, that will be available in turnilo as it is now.
  • we create a new job for eventlogging_CentralNoticeImpression, that is composed by two parts: a "realtime" indexation that pulls data as it comes through from Kafka, and another one that runs hourly/daily on the same data. We'll be able to backfill only the data that we have on HDFS about eventlogging_CentralNoticeImpression, but as @joal was noticing the size of every month varies a lot, so we were wondering about the consistency of the data over the past months.

Let me know!

the size of every month varies a lot, so we were wondering about the consistency of the data over the past months.

Thanks!!!!!! That sounds fine. Client-side data was activated and de-activated in various ways for various times, and was only activated globally at 1% on August 14th. Since, eventually, only Fundraising campaigns will be at 100%, we can continue to expect a significant amount of variation.

(Please lmk if you see anything that doesn't seem consistent with this... Thx again!!!)

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:25 PM
Milimetric added a project: Analytics-Kanban.

Druid upgraded on druid100[1-3], so we can finally start looking into this :)

Quick update - we are working on the batch (daily/hourly) ingestion workflow for the eventlogging data, still a couple of things that we need to fix before making it working. After that we'll try to add the "real-time" ingestion part :)

@AndyRussG @Jseddon Hi! So I have something to show to you in: https://turnilo.wikimedia.org/#event_centralnoticeimpression

We have a tool called Eventlogging2Druid now that automagically imports data from Eventlogging/Hive and index it to Druid. We indexed only a couple of days (Oct 27th/28th) in Druid as example, but you can play with the data and see if it fits your needs. A couple of notes:

While the first point is relatively easy to fix in the tool, the latter is a bit more cumbersome due to some limitations of Turnilo/Druid. We calculated the value before using a longer and less automated process in the Analytics Refinery (via Oozie) that we'd prefer not to use anymore, and stick with the new "standard" that is Eventlogging2Druid. This of course if you are not heavily dependent on the normalized request count metric, otherwise let us know.

If you like the new datasource in Turnilo, this will be the schedule:

  • We are going to fix T208589 and add a regular job for hourly/daily indexation of event_centralnoticeimpression, backfilling data where possible. This should be ready in a couple of weeks.
  • In the meantime, we'll experiment with the "Realtime" indexation from Kafka events as they are ingested, to provide a more granular view of banner data (like the minutely segments that we were providing before).

The important part is to agree on a schema for the data showed in Druid/Turnilo as soon as possible so we can rely on it, otherwise doing changes as we go might become a problem.

Let me know!

@AndyRussG @Jseddon Hi! So I have something to show to you in: https://turnilo.wikimedia.org/#event_centralnoticeimpression

We have a tool called Eventlogging2Druid now that automagically imports data from Eventlogging/Hive and index it to Druid. We indexed only a couple of days (Oct 27th/28th) in Druid as example, but you can play with the data and see if it fits your needs

Cool!!! Thanks so much for working on this! :) Nice that there's now something automagic for this...

Could you you point me to the configuration used for this test and the source code for the new tool?

There are a few changes that would be needed regarding which event properties are included. Specifically, we'd need to add country and region (as sent on the event) as well as impressionEventSampleRate. Also, recordImpressionSampleRate is not needed (it's just the sample rate used for logging on the old system).

While geocoded data might be nice someday, the country and region data points sent on the event are more important, since they show the criteria used by CentralNotice for banner and campaign selection (based on the Geo cookie on the client).

This metric is important, since it tells us (estimated) actual event counts for all campaigns that are not running at 100% sample rate. Currently all campaigns are at 1%, and in the future Fundraising campaigns will be at 100%, and most others will stay at 1%. So, the normalized count is the only way to compare actual events across campaigns.

We could add an event property with the value calculated per event--that is, the 1/sample rate. I imagine that float value could be put directly in Druid, which would then just have to sum all the values in the results of each query. Can Druid/Turnilo do that? That way, perhaps we could still have the metric with the new tool?

Thanks so much and many apologies once again for the delay!!!!

@AndyRussG @Jseddon Hi! So I have something to show to you in: https://turnilo.wikimedia.org/#event_centralnoticeimpression

We have a tool called Eventlogging2Druid now that automagically imports data from Eventlogging/Hive and index it to Druid. We indexed only a couple of days (Oct 27th/28th) in Druid as example, but you can play with the data and see if it fits your needs

Cool!!! Thanks so much for working on this! :) Nice that there's now something automagic for this...

Could you you point me to the configuration used for this test and the source code for the new tool?

The scala code is in the Analytics Refinery, I'll let the author (@mforns) to follow up properly in this task :)

There are a few changes that would be needed regarding which event properties are included. Specifically, we'd need to add country and region (as sent on the event) as well as impressionEventSampleRate. Also, recordImpressionSampleRate is not needed (it's just the sample rate used for logging on the old system).

While geocoded data might be nice someday, the country and region data points sent on the event are more important, since they show the criteria used by CentralNotice for banner and campaign selection (based on the Geo cookie on the client).

I have updated the sample data in turnilo with the above! :)

This metric is important, since it tells us (estimated) actual event counts for all campaigns that are not running at 100% sample rate. Currently all campaigns are at 1%, and in the future Fundraising campaigns will be at 100%, and most others will stay at 1%. So, the normalized count is the only way to compare actual events across campaigns.

We could add an event property with the value calculated per event--that is, the 1/sample rate. I imagine that float value could be put directly in Druid, which would then just have to sum all the values in the results of each query. Can Druid/Turnilo do that? That way, perhaps we could still have the metric with the new tool?

There are currently two main limitations that we faced:

  1. the EL2Druid tool considers all the dimensions as strings by default, even if Druid is able to handle int/float/double dimensions. There was a fix by Marcel in https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/472013, but after a bit of testing we realized that it wouldn't have really worked (for the reason below).
  2. Turnilo offers very nice expression constructs to use for shaping dimensions/measures (http://plywood.imply.io/expressions) but the gotcha is that it doesn't do anything by itself, instead it offloads everything to Druid (understandably so, some non trivial aggregation should not be done by a UI tool in JS). Druid handle these accepting Javascript and executing it on the data, but this feature is disabled by default since considered insecure for production (the code doesn't run on a sandbox and has full power, so a potential security hole).

The solution that you proposed though might work - if 1/sample-rate was available in the event itself we could ingest it to druid as measure (and hence set correctly as float/double) and possibly plot it to Turnilo without too many problems. @mforns what do you think?

We could add an event property with the value calculated per event--that is, the 1/sample rate. I imagine that float value could be put directly in Druid, which would then just have to sum all the values in the results of each query. Can Druid/Turnilo do that? That way, perhaps we could still have the metric with the new tool?

The solution that you proposed though might work - if 1/sample-rate was available in the event itself we could ingest it to druid as measure (and hence set correctly as float/double) and possibly plot it to Turnilo without too many problems. @mforns what do you think?

Super nice low-tech solution :) I like that!
@elukey: Let's test first that we can parse and aggregate double precision as metric (i don't see why not, but better testing before). We probably can use current sample-rate for that, even if the resulting metric makes no sense.

We could add an event property with the value calculated per event--that is, the 1/sample rate. I imagine that float value could be put directly in Druid, which would then just have to sum all the values in the results of each query. Can Druid/Turnilo do that? That way, perhaps we could still have the metric with the new tool?

The solution that you proposed though might work - if 1/sample-rate was available in the event itself we could ingest it to druid as measure (and hence set correctly as float/double) and possibly plot it to Turnilo without too many problems. @mforns what do you think?

Super nice low-tech solution :) I like that!
@elukey: Let's test first that we can parse and aggregate double precision as metric (i don't see why not, but better testing before). We probably can use current sample-rate for that, even if the resulting metric makes no sense.

Confirmed that it works! I used recordImpressionEventSampleRate as measure, everything works like a charm (caveat: the datasource in turnilo needs to be set with no introspection).

@elukey

Confirmed that it works! I used recordImpressionEventSampleRate as measure, everything works like a charm (caveat: the datasource in turnilo needs to be set with no introspection).

I think you can set introspection: autofill-dimensions-only and then at least you don't need to configure dimensions, only measures (that are few..)

@AndyRussG
The generic EventLoggingToDruid code is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/EventLoggingToDruid.scala
And the per-data-set jobs are defined in puppet, here: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/druid_load.pp

@AndyRussG Would it be possible to add the new 1/sample-rate field to the schema?

@AndyRussG
We discussed in our daily meeting about this, and decided to modify our codebase to adapt to your needs, so that we can ingest 1/sampleRate as a new Druid measure.
So, you would not need to add anything to the schema. See; T210099

List of dimensions that we are going to grab from the Eventlogging event:

event_campaign,event_banner,event_project,event_uselang,event_bucket,event_anonymous,event_statusCode,event_device,event_country,event_region,event_impressionEventSampleRate

I have launched a realtime job indexing values flowing in kafka. Data can be seen here (please notice the event normalized count metric :) :

https://turnilo.wikimedia.org/#test_kafka_event_centralnoticeimpression

@AndyRussG , @Seddon could you please let us know if data feels ok, and if some fields would be unneded for instance?
Many thanks.

For reference, here is the request sent to druid for realtime ingestion:

curl -L -X POST -H 'Content-Type: application/json' -d '{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "test_kafka_event_centralnoticeimpression",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "flattenSpec": {
          "useFieldDiscovery": false,
          "fields": [
            "dt",
            { "type": "path", "name": "event_anonymous", "expr": "$.event.anonymous" },
            { "type": "path", "name": "event_banner", "expr": "$.event.banner" },
            { "type": "path", "name": "event_bannerCategory", "expr": "$.event.bannerCategory" },
            { "type": "path", "name": "event_bucket", "expr": "$.event.bucket" },
            { "type": "path", "name": "event_campaign", "expr": "$.event.campaign" },
            { "type": "path", "name": "event_campaignCategory", "expr": "$.event.campaignCategory" },
            { "type": "path", "name": "event_campaignCategoryUsesLegacy", "expr": "$.event.campaignCategoryUsesLegacy" },
            { "type": "path", "name": "event_country", "expr": "$.event.country" },
            { "type": "path", "name": "event_db", "expr": "$.event.db" },
            { "type": "path", "name": "event_device", "expr": "$.event.device" },
            { "type": "path", "name": "event_impressionEventSampleRate", "expr": "$.event.impressionEventSampleRate" },
            { "type": "path", "name": "event_project", "expr": "$.event.project" },
            { "type": "path", "name": "event_recordImpressionSampleRate", "expr": "$.event.recordImpressionSampleRate" },
            { "type": "path", "name": "event_region", "expr": "$.event.region" },
            { "type": "path", "name": "event_result", "expr": "$.event.result" },
            { "type": "path", "name": "event_status", "expr": "$.event.status" },
            { "type": "path", "name": "event_statusCode", "expr": "$.event.statusCode" },
            { "type": "path", "name": "event_uselang", "expr": "$.event.uselang" },
            "recvFrom",
            { "type": "path", "name": "ua_browser_family", "expr": "$.userAgent.browser_family" },
            { "type": "path", "name": "ua_browser_major", "expr": "$.userAgent.browser_major" },
            { "type": "path", "name": "ua_device_family", "expr": "$.userAgent.device_family" },
            { "type": "path", "name": "ua_is_bot", "expr": "$.userAgent.is_bot" },
            { "type": "path", "name": "ua_is_mediawiki", "expr": "$.userAgent.is_mediawiki" },
            { "type": "path", "name": "ua_os_family", "expr": "$.userAgent.os_family" },
            { "type": "path", "name": "ua_os_major", "expr": "$.userAgent.os_major" },
            { "type": "path", "name": "ua_wmf_app_version", "expr": "$.userAgent.wmf_app_version" },
            "webHost",
            "wiki"
          ]
        },
        "timestampSpec": {
          "column": "dt",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [
            "event_anonymous",
            "event_banner",
            "event_bannerCategory",
            "event_bucket",
            "event_campaign",
            "event_campaignCategory",
            "event_campaignCategoryUsesLegacy",
            "event_country",
            "event_db",
            "event_device",
            "event_impressionEventSampleRate",
            "event_project",
            "event_recordImpressionSampleRate",
            "event_region",
            "event_result",
            "event_status",
            "event_statusCode",
            "event_uselang",
            "recvFrom",
            "ua_browser_family",
            "ua_browser_major",
            "ua_device_family",
            "ua_is_bot",
            "ua_is_mediawiki",
            "ua_os_family",
            "ua_os_major",
            "ua_wmf_app_version",
            "webHost",
            "wiki"
          ]
        }
      }
    },
    "transformSpec": {
      "transforms": [
        {
          "type": "expression",
          "name": "event_inverseRecordImpressionSampleRate",
          "expression": "1 / event_recordImpressionSampleRate" }
      ]
    },
    "metricsSpec": [
      {
        "name": "event_count",
        "type": "count"
      },
      {
        "name": "event_normalized_count",
        "type": "doubleSum",
        "fieldName": "event_inverseRecordImpressionSampleRate"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "SECOND"
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsPerSegment": 5000000
  },
  "ioConfig": {
    "topic": "eventlogging_CentralNoticeImpression",
    "consumerProperties": {
      "bootstrap.servers": "kafka-jumbo1001.eqiad.wmnet:9092"
    },
    "taskCount": 1,
    "replicas": 3,
    "taskDuration": "PT10M"
  }
}' http://druid1001.eqiad.wmnet:8090/druid/indexer/v1/supervisor

Hey just started following this ticket. BTW, I really like the idea of including the sample rate in the event. The Better Use of Data working group has been talking about doing this as a general convention for all analytics events. For Modern Event Platform analytics events (q4 maybe?) we'd like to add the sampling settings for each event at a TBD convention set field, e.g. meta.sampling_settings: { rate: 0.001, key: random } or something like that.

DStrine lowered the priority of this task from High to Medium.Dec 18 2018, 8:47 PM

This has not been mission critical for us. Fr-tech will get back to this in the new year.

I see, @DStrine and @AndyRussG next time around let's please make clear on ticket this is not critical/important, we assumed it was per description of ticket but it sounded like that was not the case. Perhaps that was our misunderstanding, Let's just make sure to be on the same page going forward.

@JAllemandou can we keep the code to initiate the real time ingestion in the refinery repo? I think we just need a bit of documentation pointing to it , once we document and commit the code I think this ticket can be closed. cc @elukey

Per our conversation in standup we are going to kill the job that imports from kafka directly and use eventlogging to druid, there were some useful learnings but it does not look we need both jobs.

Change 480956 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add druid-kafka task example in banner_activity

https://gerrit.wikimedia.org/r/480956

Druid-kafka-supervisor task and how-to added to refinery in oozie/banner_activity/druid folder (https://gerrit.wikimedia.org/r/480956).

Code merged, if @JAllemandou closes job we can close this ticket

Change 480956 merged by Joal:
[analytics/refinery@master] Add direct kafka-to-druid ingestion example

https://gerrit.wikimedia.org/r/480956

Job killed from druid1001.eqiad.wmnet using:

# Get supervisor ID
curl -L druid1001.eqiad.wmnet:8090/druid/indexer/v1/supervisor
# ["test_kafka_event_centralnoticeimpression"]

# Shutdown supervisor using shutdown command - New API defines terminate command but did not work
curl -L -X POST -H "Content-Type: application/json" druid1001.eqiad.wmnet:8090/druid/indexer/v1/supervisor/test_kafka_event_centralnoticeimpression/shutdown
# {"id":"test_kafka_event_centralnoticeimpression"}

# Check supervisor ID is gone
curl -L druid1001.eqiad.wmnet:8090/druid/indexer/v1/supervisor
# []

Closing ticket as it did not seem FR was using this data, data source in turnilo is present but will not be updated.

Dear @Nuria, @JAllemandou, @elukey, @mforns,

Thank you so much for all your work on this. It is hugely appreciated. Many apologies if there was a misunderstanding about priorities, and for the lack of timely review. I think part of what happened might be that the priority on this task was set before some urgent stuff (including security-related tasks) came up, but no one got to communicating this to you. Apologies again.

Also, just to note, data about banners is important to more than just WMF Fundraising. In fact, CentralNotice displays more community banners than fundraising ones. @Jseddon works a lot with community campaigns and chapter fundraisers, and I think (please correct me if I'm wrong) he uses data from Turnilo to check the health of those campaigns (as well as FR banner teams using it).

Closing ticket as it did not seem FR was using this data, data source in turnilo is present but will not be updated.

It's true that FR is not yet using data from the CentralNoticeImpression event, relying instead on the old beacon/impression calls. However, it is important that we make the switch, and work will continue in that direction. So, all the work done on this task is incredibly helpful!!!! I think the real-time job was almost there. Below are some comments (finally).

For reference, here is the request sent to druid for realtime ingestion

Looks fantastic! It'll be incredibly useful to get that UA data there. I don't know what sort of load Druid is experiencing... If there's a need to improve performance, some dimensions could be removed. Specifically, event_campaignCategoryUsesLegacy (can be easily determined from other data already in the event), event_result (legacy field that can also be derived from other data), and event_recordImpressionSampleRate (just the sample rate for the old call to beacon/impression) could all go. (We put them in the event just in case they're needed for debugging, but we can always get them via Hive.)

Finally, there's a small mistake in the calculation for the normalized count. It should use event_impressionEventSampleRate, which is the sample rate for these events, rather than event_recordImpressionSampleRate (usually not the same value). (Really nice that you were able to include that calculation in the pipeline, btw.)

I don't think it's for me to re-open this task, but it would be really amazing if the last point could be fixed and the job turned back on. :)

Thanks so much and apologies once again!!!!! :)

@AndyRussG I think it will be worth to open a new ticket explaining what data you need and what it is used for. This ticket was for a real time job that consumed data from kafka eventlogging_CentralNoticeImpression. It seems that you do not really need real time data and that you could use eventlogging regular ingestion, which is delayed couple hours from real time. Either way, it will be helpful to open a ticket that explains in detail what is needed by your team. We have not heard from @Jseddon since ticket was created so while it seems data might be useful in theory it does not seem to be used.

I think @Jseddon would still like this but the holiday work and time off have been a factor. I know a bit of his schedule. He might not be available to talk here until later next week.

Either way, it will be helpful to open a ticket that explains in detail what is needed by your team

OK, that makes a lot of sense. Hopefully, we can take some time to hash out details about where to go with CentralNotice data and prioritize based on specific needs...

Thanks so much once again, all!!!