Page MenuHomePhabricator

Verify kafkatee use for fundraising logs on erbium
Closed, ResolvedPublic1 Story Points

Description

I'd like to turn of udp2log (webrequest)! Fundraising is one of the last hold outs using this.

I've set up a kafkatee instance on erbium, that is now consuming from kafka, outputting to files in /a/log/fundraising-kafkatee in the exact same format that udp2log was. These files should be a drop in replacement for the udp2log generated ones. I'd like someone from Fundraising tech to verify that these files will continue to work for them if we write them to the same file paths that fundraising currently uses.

@awight, I think Jeff Green mentioned that you should be the one to do this. Eh? :)

Details

Related Gerrit Patches:
wikimedia/fundraising/tools/DjangoBannerStats : masterNeed debugging
wikimedia/fundraising/tools/DjangoBannerStats : masterReinstate insane hardcoded config
wikimedia/fundraising/tools/DjangoBannerStats : masterDon't override log dir config

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi! I've checked this by comparing, from erbium:

/a/log/fundraising/logs/buffer/2015/bannerImpressions-sampled100.tsv-20150601-053001.log.gz
vs.
The same 15-minute block in /a/log/fundraising-kafkatee/bannerImpressions-sampled100.tsv.log

Results to date:

  • Our analytics DB, DjangoBannerStats, seems to correctly slurp and analyze the data from kafkatee.
  • My only concern so far is a difference in the number of entries for that 15-minute period: the udp2log file has 480 entries, while kafkatee gave us 430 entries for the same period. Is the sampling rate the same?

My only concern so far is a difference in the number of entries for that 15-minute period: the udp2log file has 480 entries, while kafkatee gave us 430 entries for the same period. Is the sampling rate the same?

So, there are other reasons this could happen, like a kafkatee bug, but I think it can also be explained by the way sampling is done. Sampling is done on the full incoming stream, not on the filter. I.e. sampling is done before filtering. Every 100th http request is passed to udp-filter, which is filtering for Special:RecordImpression. It is highly unlikely that the 100th request that udp2log sees is the same 100th request that kafkatee sees.

landingpages.tsv is not sampled, so every request to a 'wikimediafoundation.org,donate.wikimedia.org' domain should be there. Since this is unsampled, it should be easier to compare udp2log and kafktee outputs. If you check the same 15 minute time period for landingpages.tsv, you will see that udp2log has 24988 entries, and kafkatee has 25046 entries. kafkatee actually has (58) more. This is more like the behavior I am used to seeing when comparing the two, as udp2log is inherently lossy[2]. We don't claim 100% message delivery in Kafka (YET! because of some technical details in the way we are producing messages from remote datacenters), but it is usually better than udp2log.

[2] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=erbium%7Cgadolinium&mreg[]=%5Epacket_loss_average%24&z=large&gtype=line&title=%5Epacket_loss_average%24&aggregate=1&r=hour

Thanks, interesting!

Sampling is done on the full incoming stream, not on the filter. I.e. sampling is done before filtering. Every 100th http request is passed to udp-filter, which is filtering for Special:RecordImpression. It is highly unlikely that the 100th request that udp2log sees is the same 100th request that kafkatee sees.

OK...! I do understand these aren't the same samples. :) I wonder what difference in number of requests is likely, given the differences in the pipelines, Here are comparisons for a few more blocks:

Periodudp2log entrieskafkatee entries
20150601-011501480535
20150601-013001481497
20150601-014501478497

...so the variance does go both ways.

Maybe it'd be worth it to set up a second, temporary instance of DjangoBannerStats, just to compare the results over a period?

Also: how complex is the code that does this? If it's really simple and we're pretty certain nothing unexpected is happening, maybe the above step would be overkill?

Thanks again!!

In those cases, there are more requests in kafkatee than in udp2log, yes? That is generally expected, right?

In those cases, there are more requests in kafkatee than in udp2log, yes? That is generally expected, right?

Yes, that makes sense, then, given udp2log lossiness...

It sounds to me like we should try a parallel DB filled up by Kafkatee for a little while. Any objections?

Another quick question... would it be possible to get a more detailed description of the two pipelines (old and new) and the differences between them, if any? (Looking back on this thread, I think I may have misread some of your comments about that...) I'm just wondering if apparently inoccuous differences in how sampling is done (for example, the order of sampling and filters or the universe that sampling is done on?) might have consequences for how the stats should be processed. Thanks!!

(Any thoughts, @ellery?)

@ellery: Ping, we're waiting to see if you think we should care about the statistical differences between udp2log and kafkatee?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 22 2015, 10:28 PM
awight claimed this task.Jul 22 2015, 10:41 PM
atgo triaged this task as Medium priority.Aug 12 2015, 11:01 PM

@ellery: Can you let us know what you think about the impact on statistics?

@Jgreen mentions that the new log pipeline will be updated in realtime, so we should reconsider our 15-minute job period.

Jgreen added a parent task: Restricted Task.Sep 16 2015, 5:18 PM
Jgreen added a comment.EditedSep 16 2015, 5:24 PM

The new kafkatee-based banner log pipeline is up and running on americium.frack.eqiad.wmnet, which exports stored banner logs by nfs to barium, so . . . you can use this for testing/comparison:

barium:/archive/udplogs - udplog pipeline, as always
barium:/archive/banner_logs -- new kafkatee-based pipeline, as will be going forward

Also see Ottomata's comment above about stored offset vs tail; the americium collectors are configured to use stored offset.

I'll respond later with a more detailed report but here are some initial results based on some simple simulations (these may change once I dig deeper):

  1. for a typical testing scenario, the difference in the two methods are insignificant
  2. sampling at rate 1:100 is way too high for most use cases. The estimates of the true banner impression counts from this sampled data are too variable. We try to find 10% changes in donation rates. If the estimates of banner impression is within 15% of the true number with 95% confidence, then this becomes a fool's errand. For reasonable testing scenarios, I'm seeing that we are within 10% of the true number with 95% confidence. We should be within a only few percent. Really, we should not be sampling at all.
Ottomata added a comment.EditedSep 16 2015, 8:48 PM

Ellery, call your Senator and tell her you want a realtime streaming analytics cluster.

ellery added a comment.EditedSep 18 2015, 7:25 PM

Is that really necessary? If so, this is an important use case you can use for your pitch :).

Anyway, here is a more detailed analysis:

https://github.com/ewulczyn/wmf/blob/master/fr/one_off_notebooks/SRI_Sampling.ipynb

Feel free to use the code to play around with the relevant parameters. It turns out that the 2 methods can differ significantly for some reasonable testing scenarios. What is the highest sampling rate we can afford?

What is the highest sampling rate we can afford?

Significantly unknown. Load is pretty low at the current traffic level and sampling rates. How about we try changing it from 1:100 to 1:10 and see how that looks under full banner traffic?

Ok, this is switched for all americium collectors.

Hey @Jgreen checking to see if the impression numbers should be sampled differently in the test we ran today. & heads up @atgo

Hey @Jgreen checking to see if the impression numbers should be sampled differently in the test we ran today. & heads up @atgo

Yes--the logs collected with the new/testing pipeline have been sampled at 1:10 since midday 9/21.

@Jgreen do the banner impression numbers in the table pgheres.bannerimpressions correctly reflect this change?

@Jgreen do the banner impression numbers in the table pgheres.bannerimpressions correctly reflect this change?

No, just the collected/stored logs on americium. No changes have been made to the old pipeline-to-database setup.

@Jgreen Does that mean the data in pgheres.bannerimpressions is incorrectly scaled? If not, when will the new logs on amercium feed pgheres.bannerimpressions?

@atgo Can you help figure out where this fits in the sprint plan?

hey @ellery - right now we've got it in Q2, which means we'll hopefully look at it before the new year. I have some concerns about rocking the boat too close to December. Any thoughts on that?

Jgreen added a comment.EditedSep 24 2015, 10:09 PM

@Jgreen Does that mean the data in pgheres.bannerimpressions is incorrectly scaled?

I haven't had any involvement with log processing, only collection/storage. Just to be super clear, here is my understanding of what we have in play.

  1. erbium udp2log collector (1:1 & 1:100) --> netapp --> barium:/archive/udplogs/2015/*
  1. erbium "testing" kafkatee (1:1 & 1:100) --> erbium:/a/log/fundraising-kafkatee/*
  1. americium "the way of the future" kafkatee (1:1 & 1:10) --> barium:/archive/banner_logs/2015/*

#1 is what we've had historically and, as far as I know, what is being consumed into the pgehres database. Still collecting at 1:100 or 1:1, no changes have been made there in the timeframe of this project.

#2 is a prototype kafkatee instance set up by @Ottomata for testing/debugging. Logs collected this way aren't rotated, they're just appending to files on disk on erbium. I believe @AndyRussG did by-hand comparisons and log-processor-compatibility testing using data collected by this stream.

#3 is the collector I just finished building on a dedicated machine in the fundraising cluster; the permanent home. This is the pipeline I switched from 1:100 to 1:10 earlier this week.

If not, when will the new logs on amercium feed pgheres.bannerimpressions?

That switch can be thrown as soon as we're confident that the #3 pipeline above is behaving comparably to #1.

And then after the dust settles we shut off #1 and #2 and commence rejoicing and partying.

Everything Jgreen is saying is true ;) there's no effect on existing banner impression counts in the pgehres table. Furthermore, when we do make the change, the count column will still be correct--that column is already scaled by the sampling rate.

We'll give lots of notice before changing the numbers in the pgehres table.

Furthermore, when we do make the change, the count column will still be correct--that column is already scaled by the sampling rate.

How does that work? Is it detecting the scaling rate from the log file name (i.e. *-sampled100.tsv-*)? I was wondering this the day that I flipped the sampling rate, because I initially forgot to adjust the target file name.

Hey guys - what's the next steps for getting the tighter sampling into pgheres? Inquiring minds (@MeganHernandez_WMF) want to know. This seems like a bit of a secondary inquiry than this task, so if we need to spin out a new one just lmk.

Ping - what do we need to do to get 1:10 into pgheres?

Ping - what do we need to do to get 1:10 into pgheres?

We are waiting on @ellery to let us know that the new collection is looking statistically sane. After that I think it's a trivial configuration change to switch to the new pipeline.

One comment on the change from 1:100 to 1:10--this means we're collecting something like 10X the amount of log data per day, and if we decide we need to stay at this sampling rate we will run out of long term storage space much faster than planned. I think we're fine for the upcoming year, but something to keep in mind.

1:10 is already much better. Pgheres has a campaign field which is crucial. Are you able to get the campaign name associated with a banner impression with the new scheme? If not there needs to be a table mapping banners to campaigns.

1:10 is already much better. Pgheres has a campaign field which is crucial. Are you able to get the campaign name associated with a banner impression with the new scheme? If not there needs to be a table mapping banners to campaigns.

The log format hasn't changed, so if we were able to get this with the old scheme the new scheme should be no different.

@ellery: just noting that there's still an open question for you, which is blocking us. I don't understand the details well enough to even paraphrase, but I'll try--the old logs sampled at a predictable interval i % 100 = 0, and the new logs are actually randomly sampled. I think. Does this mess with your analysis? Also, we seem to be getting a lot more loglines, we're not sure why but *maybe* it's due to not experiencing UDP packet loss. Anything you want us to do about the discontinuity when we switch over?

@awight Thanks for double checking, I understand the difference. See this notebook for an analysis of the difference:

https://github.com/ewulczyn/wmf/blob/master/fr/one_off_notebooks/SRI_Sampling.ipynb

And yes, the numbers from UPD clearly underestimate the number of impressions (as compared to the unsampled logs). Compare the impression numbers in cells 5 and 6 in this notebook for an example:

https://github.com/ewulczyn/wmf/blob/master/fr/dashboard/python/dashboard_instances/B1516_0826_en6C.ipynb

@awight Thanks for double checking, I understand the difference. See this notebook for an analysis of the difference:
https://github.com/ewulczyn/wmf/blob/master/fr/one_off_notebooks/SRI_Sampling.ipynb
And yes, the numbers from UPD clearly underestimate the number of impressions (as compared to the unsampled logs). Compare the impression numbers in cells 5 and 6 in this notebook for an example:
https://github.com/ewulczyn/wmf/blob/master/fr/dashboard/python/dashboard_instances/B1516_0826_en6C.ipynb

@ellery & @awight this sounds optimistic. Does anything need to be done re. the impact of shifting from random vs interval sampling, or can we declare the new pipeline ok for immediate consumption?

Is this task being tracked on two tickets? Anyway, you can make the change as far as I'm concerned as long as the campaign names are included in the schema as before (unless there is some other way to map banners to campaigns in one of the other tables). Also, make sure to multiply the counts by 10 or what ever the rate ends up being used.

Confirmed that the campaign is intact. All the pipeline does is store URLs in a file, the banner impression loader job is what's responsible for parsing these URLs and importing into the database.

Also, for the record we are now talking about beaconImpressions files, not bannerImpressions. E.g. /archive/banner_logs/2015/beaconImpressions-sampled10.tsv-20151020-184501.log.gz

Confirmed that the campaign is intact. All the pipeline does is store URLs in a file, the banner impression loader job is what's responsible for parsing these URLs and importing into the database.

I'm still stuck on whether this relates to "the old logs sampled at a predictable interval i % 100 = 0, and the new logs are actually randomly sampled." Is there something about the change to random sampling that affects how the banner impression loader is able to line up campaigns? If not, and all concerns are accounted for, let's fire the missiles and slay the old pipeline!

Special:BanenrLoader currently does not include the campaign as a query parameter. Special:RecordImpression did include this. The campaign either needs to be added to Special:BanenrLoader, or there needs be a table maintained by FR tech that maps banners to campaigns.

Special:BanenrLoader currently does not include the campaign as a query parameter. Special:RecordImpression did include this. The campaign either needs to be added to Special:BanenrLoader, or there needs be a table maintained by FR tech that maps banners to campaigns.

I don't understand, AFAIK we aren't collecting Special:BannerLoader. Here's what the new kafkatee pipeline collects:

output pipe 10 /usr/bin/udp-filter -F ' ' -p Special:RecordImpression >> /srv/kafkatee/bannerImpressions-sampled10.tsv.log
output pipe 10 /usr/bin/udp-filter -F ' ' -p Special:BannerRandom >> /srv/kafkatee/bannerRequests-sampled10.tsv.log
output pipe 10 /usr/bin/udp-filter -F ' ' -p beacon/impression >> /srv/kafkatee/beaconImpressions-sampled10.tsv.log
output pipe 1 /usr/bin/udp-filter -F ' ' -d wikimediafoundation.org,donate.wikimedia.org >> /srv/kafkatee/landingpages.tsv.log

The prototype one mentioned above collected the same, just at a 100 sampling rate instead of 10.

And the old udp2log pipeline:

pipe 100 /usr/bin/udp-filter -F '\t' -p beacon/impression >> /a/log/fundraising/logs/beaconImpressions-sampled100.tsv.log
pipe 100 /usr/bin/udp-filter -F '\t' -p Special:RecordImpression >> /a/log/fundraising/logs/bannerImpressions-sampled100.tsv.log
pipe 100 /usr/bin/udp-filter -F '\t' -p Special:BannerRandom >> /a/log/fundraising/logs/bannerRequests-sampled100.tsv.log

All the same stuff right?

Hmm. I thought Special:RecordImpression was retired and replaced by Special:BannerLoader. Andy or Adam will know the details.

@ellery, sadly we haven't yet been able to ditch Special:RecordImpression for Special:BannerLoader. Still waiting on a few other changes for that.

@Jgreen, most of that looks right. I didn't know about the landingpages.tsv.log bit--is that processed somehwere? Also, Special:BannerRandom is no longer used, so we can just get rid of that.

Will the procedure be then just to pipe the new logs to the existing pgheres.bannerimpressions database instead? Will there be any disruption during the switchover? I imagine the plan isn't to set up a parallel pgheres database?

@Jgreen, most of that looks right. I didn't know about the landingpages.tsv.log bit--is that processed somehwere? Also, Special:BannerRandom is no longer used, so we can just get rid of that.

Great, I'll remove Special:BannerRandom now to save us a bit of system load and space. I don't know about landingpages.tsv.log either, it's being collected by udp2log so we included it.

We need to get this cut over ASAP, as it is blocking Tech Ops in several important ways.

We do want landingpages.tsv.log, it's processed and statistics about the landing page impressions are stored in the database. Maybe check whether @ellery and @MeganHernandez_WMF find these numbers useful?

I have never used data on landing page impressions (I only use banner impressions and clicks).

We have used landing page impressions in the past and I think we do use this for email. We should summarize this thread for Caitlin & Peter to double check.

On a process note, I missed this thread. Email volume is very high right now and is getting busier. Can you please send a clear subject line if you need me to look at something? This was not clear to me: [Maniphest] [Changed Subscribers] T97676: Verify kafkatee use for fundraising logs on erbium

Thank you!

awight added a subscriber: Pcoombe.Nov 5 2015, 12:11 AM

I think we're prepared to make this change now. The sample rate is parsed out of the filenames, so that part of the switchover should be transparent.

The one catch here is that the import job is not idempotent, and we don't want to either miss or double-count anything. Maybe we can doctor the last old files and the first new files by hand, so that they splice nearly perfectly? There'll be a tiny glitch at the splice point, @Pcoombe is it better to take campaigns down for this, or just notify everyone that clickthrough rates could be affected for maybe a 1-hour window, and keep the campaigns up?

@Jgreen: does the splicing sound reasonable to you? There's also the question of rolling back if this doesn't work... we might have to recover using a database replica, and don't want to lose data from the old pipeline in this case.

Jgreen added a comment.Nov 5 2015, 2:59 PM

Maybe we can doctor the last old files and the first new files by hand, so that they splice nearly perfectly?
@Jgreen: does the splicing sound reasonable to you? There's also the question of rolling back if this doesn't work... we might have to recover using a database replica, and don't want to lose data from the old pipeline in this case.

Sure, seems pretty straightfoward. Let's choose a cutover maintenance window and we'll make it happen.

@awight Sounds like it would be safest to just take campaigns down if it's only for a short window. Please let me know when you plan to do this. Excited to see this happening!

Also let's keep the landingpages info if we can.

Jgreen added a comment.Nov 6 2015, 6:02 PM

@awight Sounds like it would be safest to just take campaigns down if it's only for a short window. Please let me know when you plan to do this. Excited to see this happening!

At this point it's just a matter of timing and we need to it done. We're blocking theTech Ops high priority goal of encrypting PII in cross-datacenter transit.

Also let's keep the landingpages info if we can.

Not a problem.

Change 252761 had a related patch set uploaded (by Awight):
Don't override log dir config

https://gerrit.wikimedia.org/r/252761

Change 252761 merged by Awight:
Don't override log dir config

https://gerrit.wikimedia.org/r/252761

Change 252862 had a related patch set uploaded (by Awight):
Reinstate insane hardcoded config

https://gerrit.wikimedia.org/r/252862

Change 252862 merged by Awight:
Reinstate insane hardcoded config

https://gerrit.wikimedia.org/r/252862

Change 252875 had a related patch set uploaded (by Awight):
Need debugging

https://gerrit.wikimedia.org/r/252875

Change 252875 merged by Cdentinger:
Need debugging

https://gerrit.wikimedia.org/r/252875

Jgreen closed this task as Resolved.Nov 13 2015, 12:33 AM

(07:14:09 PM) awight: Jeff_Green: so the new data is good, and is correctly multipled by 10 and not 100.
(07:15:06 PM) awight: and it looks like everything is in good shape to run without a babysitter.

mmodell removed a subscriber: awight.Jun 22 2017, 9:38 PM