Page MenuHomePhabricator

FRUEC: Debug large discrepancy in data in initial test run.
Closed, ResolvedPublic4 Estimated Story Points

Description

A test run of FRUEC using only data on 2019-09-20 showed a large discrepancy for Landing Page events and Banner Impressions between legacy and testing databases:

MariaDB [(none)]> select count(*) from pgehres.landingpageimpression_raw where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00';                                      
+----------+
| count(*) |
+----------+
|     2401 |
+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> select count(*) from fruec.landingpageimpression_raw where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00';                                        
+----------+
| count(*) |
+----------+
|     1076 |
+----------+
1 row in set (0.01 sec)
MariaDB [(none)]> select sum(count) from pgehres.bannerimpressions where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00';
+------------+                                                                                                                                                                            
| sum(count) |                                                                                            
+------------+                                                                                                                                                                            
|     720530 |                                                                                            
+------------+                                                                                                                                                                            
1 row in set (0.03 sec) 

MariaDB [(none)]> select sum(count) from fruec.bannerimpressions where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00';                                              
+------------+                                                                                                                                                                                
| sum(count) |                                                                                                                                                                                
+------------+                                                                                                                                                                                
|     255530 |                                                                                                                                                                                
+------------+                                                                                                                                                                                1 row in set (0.02 sec)

Both systems ingressed the same number of files for the period compared in the above queries:

MariaDB [(none)]> select count(*) from pgehres.squidlog where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:15:01';                                                                                                       
+----------+
| count(*) |
+----------+
|       18 |
+----------+
1 row in set (0.34 sec)

MariaDB [(none)]> select count(*) from fruec.files where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:15:01';
+----------+
| count(*) |
+----------+
|       18 |
+----------+
1 row in set (0.00 sec)

Event Timeline

There's a wide difference in the number of events in the files being consumed:

andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/beaconImpressions-sampled10.tsv.20190920-200001.log.gz | wc -l
13183
andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/centralnotice-impressions.20190920-200001.json.sampled10.log.gz | wc -l
3841
andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/landingpages.tsv.20190920-200001.log.gz  | wc -l
2531
andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/landingpage-impressions.20190920-200001.json.log.gz | wc -l
159

For CentralNotice impressions, this is to be expected, since the client-side sample rate is lower in the new pipeline for most campaigns, and FRUEC should compensate for this. However, there shouldn't be this difference for LandingPage impressions.

Just verified on production that $wgFundraiserLandingPageELSampleRate is 1, and that LandingPage events are being sent correctly, at least from Chrome.

Also a large difference in log file contents for landing pages for a more recent date:

andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/landingpages.tsv.20191010-200001.log.gz  | wc -l
9814
andyrussg@civi1001:~$ zcat /srv/archive/banner_logs/2019/landingpage-impressions.20191010-200001.json.log.gz | wc -l
847

Working on the assumption that the cause or causes of the discrepancy could be different for LandingPage events than for CentralNotice ones... It seems that the discrepancy in CentralNotice event counts can be explained by the fact that neither FRUEC nor the legacy script take into account client-side sample rate.

(So, the above statement, "the client-side sample rate is lower in the new pipeline for most campaigns, and FRUEC should compensate for this", is wrong.)

Evidence that this is at least one cause of the discrepancy for CentralNotice events is that impression counts ingressed into test database are wildly off for Fundraising campaigns, but they seem pretty close for community campaigns--counts are just a bit lower for those campaigns in fruec. For community campaigns, client-side sample rates are 1% for both pipelines.

And for Fundraising campaigns, counts in the fruec database are just a bit less than 1% of the counts in pgehres. That also makes sense, since for those campaigns, the old pipeline has a 100% sample rate and the new one, 1%.

To make comparing the data easier, let's start taking into account client-side sample rate in FRUEC. Here is a task: T235461.

The fact that the counts are a bit lower than you'd expect for both types of campaign, given the client-side sample rate settings, indicates that there may be other issues causing less drastic differences in counts between the two pipelines. But let's fix this issue first.

(I still don't have an explanation for the difference in LandingPage counts.)

Looking at LandingPage counts, it seems there is either a significant difference in the contents of the log files for the two pipelines, or how the scripts are filtering events, or both.

MariaDB [(none)]> select count(*) as count, landingpage from pgehres.landingpageimpression_raw where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00' GROUP BY landing
page  ORDER BY count DESC;                            
+-------+--------------------------------------------+
| count | landingpage                                |
+-------+--------------------------------------------+     
|  2169 | default~default~default~default~default    |                                   
|   225 | default~default~JimmyQuote~default~default |
|     2 | default~default~FaceBook~default~default   |
|     2 | default~default~default~default~variable1  |
|     2 | default~default~Twitter~default~default    |
|     1 | WMFJA1                                     |
+-------+--------------------------------------------+
6 rows in set (0.02 sec)

MariaDB [(none)]> select count(*) as count, landingpage from fruec.landingpageimpression_raw where timestamp >= '2019-09-20 20:00:00' and timestamp < '2019-09-20 22:00:00' GROUP BY landingpa
ge  ORDER BY count DESC;                                                                 
+-------+--------------------------------------------+                                   
| count | landingpage                                |                                   
+-------+--------------------------------------------+                                   
|   468 | default~default~default~default~default    |                                   
|   193 | default~default~JimmyQuote~default~default |
|   102 | Thank_You/ja                               |
|    79 | Thank_You/fr                               |                                                                                                                                        
|    77 | Thank_You                                  |
|    34 | Ways_to_Give                               |
|    20 | FAQ                                        |
|    20 | Ways_to_Give/ja                            |
|    10 | Tax_deductibility/ja                       |
|    10 | Template:ThankYouEGMT                      |
|     9 | Matching_Gifts                             |
|     8 | Cancel_or_change_recurring_giving/ja       |
|     8 | Problems_donating/ja                       |
|     7 | Ways_to_Give/fr                            |
|     7 | Thank_You/he                               |
|     5 | Thank_You/it                               |
|     5 | Problems_donating                          |
|     3 | Tax_deductibility                          |
|     3 | Thank_You/es                               |
|     2 | Ways_to_Give/sv                            |
|     1 | Ways_to_Give/de                            |
|     1 | Ways_to_Give/es                            |
|     1 | Problems_donating/fr                       |
|     1 | Ways_to_Give/it                            |
|     1 | Ways_to_Give/pt-br                         |
|     1 | Tax_deductibility/fr                       |
+-------+--------------------------------------------+
26 rows in set (0.01 sec)
AndyRussG set the point value for this task to 4.Oct 23 2019, 4:45 PM

Made some progress on figuring out what the difference is between old and new logs. It looks like there are a lot of duplicate entries in the old log files.

I downloaded and unzipped old and new log files for the same 3-hour period (2019-10-20 from 19:00 to 21:59). Then, I filtered to examine entries with timestamps from only the middle hour of that period.

Counting only log entries with direct hits on Special:LandingPage, there is a huge difference in the counts. However, if you count unique IP addresses, they're almost the same!

$ cat old/*.log | awk '$9 ~ /^http:\/\/donate.wikimedia.org\/w\/index.php\?title=Special:LandingPage/' | awk '$3 ~/^2019-10-20T20/' | wc -l
15639

$ cat old/*.log | awk '$9 ~ /^http:\/\/donate.wikimedia.org\/w\/index.php\?title=Special:LandingPage/' | awk '$3 ~/^2019-10-20T20/' | awk '{print $5}' |  sort | uniq | wc -l
733

$ cat new/*.log | rg '"dt": "2019-10-20T20' | wc -l
1108

$ cat new/*.log | rg '"dt": "2019-10-20T20' | jq '.ip' | sort | uniq | wc -l
789

So actually it seems the problem is not duplicate entries in the old logs, but rather a few IP addresses hammering on the site using some sort of script, which doesn't run client-side code, so it doesn't generate any client-side events.

Here we can see the most of the lines in the old log for hits on Special:LandingPage are from IP addresses with more than 100 requests in the hour-long window:

$ cat old/*.log | awk '{if ( ( $9 ~ /title=Special:LandingPage/ || $9 ~ /wiki\/Special:LandingPage/ ) && $3 ~/^2019-10-20T20/ ) print}' | wc -l
15782

$ cat old/*.log | awk '{if ( ( $9 ~ /title=Special:LandingPage/ || $9 ~ /wiki\/Special:LandingPage/ ) && $3 ~/^2019-10-20T20/ ) print $5}' | sort | uniq -c | awk '$1 > 100' | awk '{s+=$1}END{print s}'
14504

(The first awk filters the URL for the two wiki URL formats that would go to Special:LandingPage, and for the correct timestamp, and outputs only the IP address. We then sort the IP addresses and count uniques, then filter for the counts greater than 100, then sum those.)

In the old log, hits on Special:LandingPage for IPs with less than 3 requests:

$ cat old/*.log | awk '{if ( ( $9 ~ /title=Special:LandingPage/ || $9 ~ /wiki\/Special:LandingPage/ ) && $3 ~/^2019-10-20T20/ ) print $5}' |  sort | uniq -c | awk '$1 < 3' | awk '{s+=$1}END{print s}'
786

Here is, I think, a similar filter on the new logs:

$ cat new/*.log | jq 'select( .dt | test("2019-10-20T20") ) | select( .event.landingpage | test( "LandingPage" ) ) | .ip' | sort | uniq -c|  awk '$1 < 3' | awk '{s+=$1}END{print s}'
426

(Here, we use jq to parse the JSON entries, filter on the landingpage and timestamp properties, and output only the ip property. The rest is the same as end of the bash expression for the old log.)

So, there is still a discrepancy after filtering like this, but we've narrowed down the problem a lot, I think...

I've dug deeper into the Landing Page discrepancy, comparing sequences of log entries from the same IPs in both old and new pipelines.

Here is the Jupyter notebook I used for this. (To view, click the "Download" link and open in a browser.)

So, for Landing Page data, the main difference between the old and new pipelines is that the new pipeline generates events via Javascript that runs in the user's browser, whereas the old one looks at server-side webrequest events.

This means that it's likely there will always be more events coming through the old pipeline, for two reasons:

  • Some bots/automated scripts are difficult to detect, and are surely generating events that make it through the old pipeline and into the db, without running the JS to generate the new events.
  • Some users may reload the page when there's some network problem, or for another reason, before their browser is able to run the Javascript that generates the event for the new pipeline, generating additional server-side webrequest events.

Nonetheless, I haven't seen anything in the log files to suggest that legitimate Landing Page impressions for actual users are not getting through the new pipeline.

On the contrary, for the reasons mentioned, data from the old pipeline may be inflated. Since, so far, I've found no evidence of a generalized bug that would prevent the JS from generating events on certain platforms or pages, I'd suggest that the new pipeline data should be considered more accurate than data in the old pipeline.

It's possible to dig into the log files even further. If we choose to do that, maybe we could focus on verifying that no client-side bug exists...

Thanks much!!

This task was about the large-scale discrepancy, which I think we can consider to be solved. There are still smaller unexplained differences between data we're getting from the old and new pipelines. I've created new tasks to investigate that: T236835 and T236834.

Looks good to me, makes sense that the new pipeline wouldn't pick up the bot activity.

I'am interesting in seeing what happens with the two new tasks!