Page MenuHomePhabricator

FRUEC: Detailed comparison of events in old and new log files for banner impression pipeline
Open, Needs TriagePublic2 Story Points

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2019, 7:25 PM
AndyRussG renamed this task from FRUEC: Debug minor discrepancy in banner impression data between old and new pipelines to FRUEC: Detailed comparison of events in old and new log files for banner impression pipeline.Wed, Nov 20, 3:58 PM
AndyRussG added a comment.EditedMon, Nov 25, 6:00 AM


Here are some results:

  • Over three samples, the new pipeline logs consistently contain less events than expected. The reduction is between 12 and 15% (of events in the old logs) across the three samples. (Updated: previously reported as between 11 and 15%, but there was a small mistake in the calculations.)
  • To verify that this was not due to a problem with Kafkatee, I counted HTTP requests in Hive for the sample windows. On this point, things look mostly ok; for both pipelines, the number of events in the log files corresponds pretty closely to the number of events in the wmf.webrequest table for the three samples. Note that for both pipelines, the logs are only a 10% sample of actual beacons sent from clients.
  • I tried to match log entries one-to-one between the two pipelines, however this didn't provide much useful data, since, again, the logs are a sample of actual events. The number of entries that did match is approximately what one would expect given this situation.
  • Also verified the EventLogging's error log for invalid events, to see if a significant number of events were not passing validation. This was not the case, though.
  • Next step: load actual web request logs from Hive for both pipelines, and match them one-to-one, to see if there's anything different about the logs in the old pipeline that don't have matching entries in the new pipeline.
AndyRussG added a comment.EditedWed, Nov 27, 5:47 AM

Just did one-to-one merges using web request logs in Hive, in both directions, using fairly large samples in both cases.

The general result was that many old logs had no corresponding new log entry, and almost no new log entries were without a corresponding old log entry. That is:

  • About 11% of the old pipeline web requests did not have corresponding web requests for the new pipeline.
  • About 0.025% of the new pipeline web requests did not have corresponding web requests for the old pipeline.

For checking old logs, I looked at a campaign that had 100% client-side sample rate on the new pipeline, and 1% client-side sample rate on the old pipeline. In that setup, all old pipeline web requests should have had a corresponding new pipeline web request. Query: P9756.

For checking new logs, I did the opposite: looked at campaigns with 100% client-side sample rate on the old pipeline and 1% client-side sample rate on the new pipeline, such that all new pipeline requests should have had a corresponding old pipeline request. Query: P9757.

Differences found in orphaned old pipeline events:

  • 28% orphaned GET requests vs. 10% overall GET requests
  • 63% orphaned Windows requests vs. 39% overall Windows requests

No significant differences for the following data points:

  • Average number of hits per IP
  • ISP
  • Average number of hits per referrer

Detailed stats and methodology

Windows for log samples:

(1) 20191105 21:00:00-22:59:59
(2) 20191110 11:00:00-18:59:59
(3) 20191120 17:00:00-17:59:59

As in T236835, I downloaded log files for a time window wider than that of the sample, then filtered for events with timestamps in the desired window. This is because events don't always land in the log file that corresponds to the event's timestamp.

(A) Jupyter notebook for Sample 1:

(B) Jupyter notebook for verifying total events in log files vs. events in Hive, and checking invalid events:

(C) Jupyter notebook for orphaned old log events in Hive:

Cells in the notebooks are referenced by the number in square brackets to the left of each cell.

Loading and initial processing of log files

I copied files from civi1001, filtered on timestamp for sample window, loaded the data into Pandas, created some useful columns, and dropped events that are clearly bots (as identified by is_bot or nocookies, for new and old logs, respectively). See Jupyter notebook (A) cells 1-9.

Checking for IPs with a large number of hits

We can sometimes identify bots or scripts by repeated hits from single IPs. Using this method, it seems they're not a huge issue in the CN pipeline.

Max number of hits from a single IP:

Sample 1Sample 2Sample 3
New max hits from a single IP2573
New total events34680181836362
Old top max from a single IP1055155
Old total events918942685184217

See Jupyter notebook (A) cells 10-12.

Divergence from expected number of events

Client side sample rates vary for both old and new pipelines. Sample rates can be changed from their default 1% on a per-campaign basis.

By looking at campaigns and sample rates for both pipelines, we can determine what the ratio of total new-to-old events should be, then see how far the actual ratio diverged from that expectation. Here's the procedure used:

  • Group old and new log dataframes by unique permutations of campaign name, old pipeline sample rate and new pipeline sample rate; count events for each permutation.
  • Merge the grouped dataframes.
  • For each permutation, calculate the expected ratio of new-to-old log events in the sample window, then calculate how much the actual ratio diverged from the expected ratio.
  • Remove permutations with less than 100 events in either pipeline, since small samples could skew the data.
  • Calculate overall average divergence of the actual ratio by multiplying the divergence for each permutation by the number of old pipeline events for that permutation, summing the results, then dividing by the total old pipeline events in the dataframe.

By this method, over the three samples, we get 12%-15% less events than expected in the new logs. (This supposes the number of events in the old logs is as expected.)

Divergence of the number of events in the new log from the number expected:

Sample 1Sample 2Sample 3

See Jupyter notebook (A) cells 15-19.

Checking Kafkatee via webrequest Hive table

Log files for both new and old CN pipelines contain a 10% sample of actual events sent from clients. The full data is accessible in Hive in the webrequest table. By comparing the number of events in Hive to the number of events in the log files, we can check whether Kafkatee is dropping events when it writes the sampled log files.

The result is that the number of events in the Kafkatee-generated, sampled log files can differ slightly from what would be expected based on data in Hive. It seems somewhat more likely to be lower than expected, but not by a huge amount. As per discussions with @Jgreen, this could be due to network issues, especially at peak times.

So, in summary, Kafkatee is not an issue for new-vs-old pipline discrepancies, but possible packet dropping causing a minor loss of event records is worth monitoring and investigating further. See T239564.

Log eventsEvents in HiveLog divergence from expected
Sample 1, new pipeline34680347720-0.26%
Sample 1, old pipeline9189912390.71%
Sample 2, new pipeline18183184572-1.49%
Sample 2, old pipeline42685429539-0.63%
Sample 3, new pipeline636264339-1.12%
Sample 3, old pipeline18421718378880.23%

See Jupyter notebook (B), up to cell 10.

Invalid events

Another possible issue with the new pipeline might have been event schema validation. Invalid events don't make it into the normal Kafka streams, so they wouldn't be picked up by the new pipeline. For Sample 1, only one invalid event appears in Hive for the CentralNoticeImpression schema. So, this is not the cause of missing events.

See Jupyter notebook (B), cell 12.

Pairing events to analyze missing new pipeline events

We can analyze missing events from either pipeline by matching pairs of events for a campaign where the client-side sample rate is 100% for the pipeline whose missing events we want to understand. Events in the pipeline with a lower sample rate that don't have a match in the pipeline with the 100% sample rate have information about events missing from the latter pipeline.

As explained above, almost no events seem to be missing from the old pipeline.

To analyze missing new pipeline events, I had to go to Hive, since sampling in the log files meant most events in those files wouldn't have a match. (See Jupyter notebook (A), cells 21-24, for an initial attempt using just the log files.)

To get meaningful data, I had to do the merge query directly in Hive, since the query results wouldn't fit in Pandas's in-memory storage. I used the randomcampaign property to pair events, for 2019-11-05 1:00:00 to 22:59:59, when a campaign with a 100% new-pipeline sample rate was active.

Results are described above: missing new log events (a.k.a. "orphaned" old log events) were a lot more likely to be from Windows clients and use GET requests.

See Jupyter notebook (C).