Page MenuHomePhabricator

Provide aggregated banner impression counts
Closed, ResolvedPublic

Description

In order to import impression data of our fundraising banners to our own database we need them to be publicly available. This ticket is meant to resolve T243092 (and T255446).

Banner impression data is regularly aggregated in banner_activity_minutely, but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

Acceptance Criteria

  • Banner impression data is collected and aggregated in time spans of 15 minutes for all banners that
    • match the regular expression B\d{2}_WMDE
    • were sending a beacon to uri_host (en|de)\.(m\.)?wikipedia.org
  • There is one file for each time span.
  • The file is named banner_impressions_YYYYMMDD_hhmm.csv, using the beginning of the time span.
  • The file contains comma-separated values:
    • Banner name
    • Impression count (extrapolated)
  • The files are published in analytics.wikitech.org
  • Files older than 30 days are deleted regularly.

Event Timeline

@kai.nissen @WMDE-leszek

Q. My first assumption is that all data on the WMDE banner impressions will be found in the wmf.webrequest table where uri_path = "/beacon/impression". Please correct me if I am wrong.

Next, @kai.nissen once informed me - in the context of the New Editors campaigns - how all WMDE banners encompass the following string: WMDE - and also in this ticket's description.
However, today I find banner=wle_2021_de from `uri_host = "de.wikipedia.org".
Q. Are we simply going to count anything found on uri_path = "/beacon/impression" and `uri_host = "de.wikipedia.org" (including mobile) or are we going to look for some specific banner only? If the later holds, than how are we about to recognize the specific banners that we are interested in?

Banner impression data is regularly aggregated in banner_activity_minutely, but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

I was not able to find the banner_activity_minutely table in Hive: it is not in the wmf database and not found in event. Where does it live?

... but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

Well, there is probably nothing that can be done about it. Namely:

The max(dt) (i.e. timestamp) found on (approx) 09:42 UTC was 2021-05-25T08:59:59Z:

select max(dt) from wmf.webrequest where uri_path = "/beacon/impression" and uri_host = "de.wikipedia.org" and year = 2021 and month = 5 and day = 25;

which means the wmf.webrequest has a delay of approximately 40 minutes (which we already know) - because of pre-processing I guess.

I thought that maybe the wmf_raw.webrequest table can help:

select max(dt) from wmf_raw.webrequest where uri_host like "%de.wikipedia%" and uri_path = "/beacon/impression" and year = 2021 and month = 5 and day = 25;

but

_c0
2021-05-25T08:59:59Z

(run on Tue 25 May 2021 09:49:39 AM UTC approximately).

  • no. So it is Kafka processing that takes time and we get to observe hourly data intakes only with a delay of approximately 40 minutes. Please take into your account that this 40 minutes that I am mentioning are an empirical result and that I have no idea if it can also be an hour or two or ten minutes sometimes.

Thus, I would say that you need to get in touch with Analytics on the following two requests:

... but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

and

For all banners containing "WMDE" in their name, the banner impression data is collected and aggregated in time spans of 15 minutes.

From what I can see a more reasonable demand would be to go for hourly aggregates with an approximately one hour of delay.
But if that does not work for you: the WMF Data Engineers are the only ones who might be able to tell you what to do.

Thanks for looking into this, @GoranSMilovanovic!

Q. My first assumption is that all data on the WMDE banner impressions will be found in the wmf.webrequest table where uri_path = "/beacon/impression". Please correct me if I am wrong.

Next, @kai.nissen once informed me - in the context of the New Editors campaigns - how all WMDE banners encompass the following string: WMDE - and also in this ticket's description.
However, today I find banner=wle_2021_de from `uri_host = "de.wikipedia.org".

This banner is part of the Wiki Loves Earth campaign and does not need to be taken into account. Banners used in WMDE's New Editors campaigns are always prefixed by WMDE. Fundraising banners are prefixed by what matches the regular expression B\d{2}_WMDE. Now that you say it: we only need data for fundraising banners. I'll update the task description accordingly. By the way, we also need data from de.m.wikipedia.org, en.wikipedia.org and en.m.wikipedia.org.

Q. Are we simply going to count anything found on uri_path = "/beacon/impression" and `uri_host = "de.wikipedia.org" (including mobile) or are we going to look for some specific banner only? If the later holds, than how are we about to recognize the specific banners that we are interested in?

Are we able to filter based on the payload passed to the beacon? It should contain the banner name.

Banner impression data is regularly aggregated in banner_activity_minutely, but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

I was not able to find the banner_activity_minutely table in Hive: it is not in the wmf database and not found in event. Where does it live?

It's in druid. At least, that's what Superset tells me.

... but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

Well, there is probably nothing that can be done about it. Namely:

The max(dt) (i.e. timestamp) found on (approx) 09:42 UTC was 2021-05-25T08:59:59Z:

select max(dt) from wmf.webrequest where uri_path = "/beacon/impression" and uri_host = "de.wikipedia.org" and year = 2021 and month = 5 and day = 25;

which means the wmf.webrequest has a delay of approximately 40 minutes (which we already know) - because of pre-processing I guess.

I thought that maybe the wmf_raw.webrequest table can help:

select max(dt) from wmf_raw.webrequest where uri_host like "%de.wikipedia%" and uri_path = "/beacon/impression" and year = 2021 and month = 5 and day = 25;

but

_c0
2021-05-25T08:59:59Z

(run on Tue 25 May 2021 09:49:39 AM UTC approximately).

  • no. So it is Kafka processing that takes time and we get to observe hourly data intakes only with a delay of approximately 40 minutes. Please take into your account that this 40 minutes that I am mentioning are an empirical result and that I have no idea if it can also be an hour or two or ten minutes sometimes.

Thus, I would say that you need to get in touch with Analytics on the following two requests:

... but with a delay of ~2 hours, which does not meet the requirement of WMDE's team Kampagnen und Fundraising.

and

For all banners containing "WMDE" in their name, the banner impression data is collected and aggregated in time spans of 15 minutes.

From what I can see a more reasonable demand would be to go for hourly aggregates with an approximately one hour of delay.
But if that does not work for you: the WMF Data Engineers are the only ones who might be able to tell you what to do.

A delay of 40 minutes may even be fine for the fundraising team. But as you said, this might also be more, especially during the fundraising campaign. I assume, the data you requested is all from Hive? T255446: Investigate showing realtime the eventlogging banner stream (currently sampled at 1%) suggests, that the data is directly taken from kafka.

@kai.nissen

As of T255446 which

"... suggests, that the data is directly taken from kafka."

essentially everything we have is taken from Kafka at some point (see webrequest docs). T255446 - if I understand it correctly - is meant to make Kafka feed Druid directly with banner impressions data. That would allows us to observe banner impressions data way before the processing and sanitization pipelines that need to work before the data enter the webrequest table (which we use to pick up banner impressions for our campaigns) are completed.

Now, the question is, as I see it, is:

  • do you want me to develop a reporting pipeline as described in this ticket relying on Hive/Webrequest, with all the constraints (give or take ~40 minutes delay), or
  • do we want to wait for T255446 to be implemented and then work with Druid?

Please let me know. Thanks.

kai.nissen renamed this task from Provide aggregated banner impression counts in real-time to Provide aggregated banner impression counts.Aug 1 2021, 4:35 PM

Alright, we can drop the "real-time" requirement and use the data as soon as it has been ingested from kafka.

The files will be picked up regularly by a cron-job running on WMDE's fundraising infrastucture. It relies on the file name convention mentioned in the task description ("The file is named banner_impressions_YYYYMMDD_hhmm.csv, using the beginning of the time span."), so the file name is crucial for picking up the data.

@kai.nissen Thank you, Kai, but where are your banner_impressions_YYYYMMDD_hhmm.csv files found? Is it some path on the Analytics Client servers? Or elsewhere?

The ticket description explains:

The files are published in analytics.wikitech.org

but no path is provided.

Also, I am not sure if I understand exactly what do I need to provide:

  • if the banner_impressions_YYYYMMDD_hhmm.csv are already provided somewhere,
  • what else beyond the Acceptance Criteria that are already satisfied (or at least they are in my understanding) by the mere production of those files is needed?

Please clarify. Thank you!

Edit. Ok, maybe from this:

Alright, we can drop the "real-time" requirement and use the data as soon as it has been ingested from kafka.

I might understand that I need to work with wmf.webrequest to produce the banner_impressions_YYYYMMDD_hhmm.csv files? But is sounds like the files are already being produced:

The files will be picked up regularly by a cron-job running on WMDE's fundraising infrastucture.

@kai.nissen Edit. Ok I got I got it the WMDE fundraising infrastructure will be picking up the files from analytics.wikitech.org regularly, and we choose to go for wmf.webrequest and produce the files. Ok. I am on it. Sorry for my misunderstanding!

@GoranSMilovanovic None of what is described in this task is done, yet. The outcome of this task should be that the files are created and published in a folder on analytics.wikitech.org.

Change 730052 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T283015

https://gerrit.wikimedia.org/r/730052

Change 730052 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T283015

https://gerrit.wikimedia.org/r/730052

@kai.nissen

Tested successfully, running in production from stat1008 Analytics Client, publishing at the public directory https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/WMDE_Banners

Here are the details and my comments:

  • The https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/WMDE_Banners directory is not regularly updated: TRUE. We would need to talk about this with the WMF Analytics team, this is not the first time that I see such things happening. The system is regularly updating, hour by hour, from stat1008, the files are copied to /srv/published/datasets/wmde-analytics-engineering/WMDE_Banners/ from where they should be synced with the public directory every 15 minutes (I think 15 minutes), but more often than not I can't see that public directory updated (again, similar things happen in other public directories under https://analytics.wikimedia.org/). There is nothing that we (WMDE) can do about that.
  • The update script runs on the following schedule:
    • every hour, 5th minute of the hour, from my crontab,
    • it sends a HiveQL query to wmf.webrequest with the parameters exactly as described in the ticket description,
    • and obtains the data beginning with this_hour - 2h, e.g:
    • if the script starts at 18:05:00, it will obtain the data from 16:00:00 to 16:59:59.
  • Given that it takes some time for our Analytics systems to ingest data from Kafka into wmf.webrequest, I think this is the safest possible approach to (a) assure data completeness and (b) the best possible trade-off of data recency to assure completeness.

@kai.nissen The situation with T293112 seems to have improved, take a look at the public dir.

If the update continues to show up in the public directory during the day, I might close T293112.

@kai.nissen @Tobi_WMDE_SW

Do we need anything else here?

N.B. There will be a reboot of the stat1008 Analytics Client where we run this today. That will result in an unavoidable loss of a small amount of data.

@GoranSMilovanovic Thanks for this. I had a look at the published files and as far as I can tell they look fine. We will start to retrieve the data.

N.B. There will be a reboot of the stat1008 Analytics Client where we run this today. That will result in an unavoidable loss of a small amount of data.

That (and how you described the workflow in your previous comment) makes me think that whenever the cron job fails for some reason or just does not run, we have a gap in the data.

@kai.nissen

That (and how you described the workflow in your previous comment) makes me think that whenever the cron job fails for some reason or just does not run, we have a gap in the data.

At some point, I can introduce a module to check the completeness of the data and re-run the ETL for the missing pieces perhaps?

kai.nissen removed GoranSMilovanovic as the assignee of this task.
kai.nissen added a subscriber: GoranSMilovanovic.