Page MenuHomePhabricator

FRUEC: Debug minor discrepancy in landing page data between old and new pipelines
Open, Needs TriagePublic4 Story Points

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2019, 7:26 PM
AndyRussG claimed this task.Nov 4 2019, 7:49 AM
AndyRussG added a comment.EditedNov 4 2019, 8:03 AM

I've dug into this some more, improving filtering compared to what was done for T235284. I've got some more clarity about what the differences are, but it's still pretty ugly.

Here is the Jupyter notebook. (Starts out the same as the notebook for the above-mentioned bug. To view, click on "Download" and open in a browser.)

Also, here's an etherpad with some notes. I'll continue to update the notes there... also pasting what I have so far here.

tl;dr: In the new logs for the hour under study, out of 1079 events, 147 couldn't be matched to entries in the old logs. And in the old logs for the same hour, out of 1442 events, 515 couldn't be matched to entries in the new logs.

Known discrepancies in logs

  • Old logs are firehose of all HTTP requests to donate wiki.
  • Even some wiki page HTTP requests don't have corresponding events in the new piepline, for various reasons:
    • Some requests are from clients that don't run the JavaScript that generates the new log data. This includes bots, spiders, automated hacking scripts, and web-accelerating proxies.
    • Some wiki pages containing JS are background-loaded form landing pages. These are viewed using the action=raw URL parameter, and do not cause the new-event-generating JS to run.
    • The JS resources with the code to generate the new events may not load, for example, due to network issues, or the user reloading the page or navigating away before the JS loads and runs.
  • On titles with a trailing /en (such as "ThankYou/en"), the URL is re-written without the trailing "/en" on the client, causing the old and new logs to have different landingpage strings.

Known discrepancies in ingress scripts

  • Old script only ingresses Special:LandingPage titles and a few unused titles. We can adapt FRUEC to do the same, or just filter on those when we do data comparisons via database queries.
  • Old script only ingresses requests that use the ugly URL format (title=...), while the new pipeline takes events from both Mediawiki URL formats. Not yet clear if this will cause issues in the actual data users will wish to query. Could also just be an unfortunate limitation of the old script.

Clues, mysteries and other notes

  • In the new logs for the hour under study, out of 1079 events, 147 couldn't be matched to entries in the old logs.
    • Unmatched entries in the new logs are mostly from Thank_You pages visited from France with from iPhone or iPad.
    • Possible explanations for unmatched new log entries:
      • Someone hitting the EventLogging URL directly; seems unlikely, since it's from a lot of different IP addresses.
      • Incorrect code for matching entries in the Jupyter notebook.
      • Kafkatee problem that removes some entries on the old pipeline.
      • Different IPs sometimes shown in old vs. new pipelines, which would also cause corresponding entries not to match in the Jupyter notebook.
      • Some other issue or difference in processing on the Anaytics cluster.
  • In the old logs for the hour under study, out of 1442 events, 515 couldn't be matched to entries in the new logs.
AndyRussG added a comment.EditedNov 4 2019, 5:56 PM

Trying different options for better matching... Just a small improvement by allowing new log timestamps that are closest to the new old log ones, that is, removing the requirement that the new log event be after the old log one. With this method, we get 136 unmatched events in new log, and 510 in the old one.

However, the numbers go up to 144 and 516 if we limit the time difference to 2 minutes maximum.

Matching on landing page, country, language, and other event fields, with old log timestamps always earlier than the new log timestamps, by at most 30 seconds, we get:

Unmatched in new log: 58
Unmatched in old log: 427

Here's a summary of the situation regarding discrepancies between the new and old pipelines for Landing Pages (includes also measures obtained in T235284).

  • The discrepancies appear from the start of the pipelines, and are related to the different ways events are generated (old events are from server logs, and new events are generated client-side using EventLogging).
  • There are many more events in the old logs, due to hacking attempts, bots, web-accelerating proxies and network and browser issues that prevent JS from running.
  • Some events in the logs for the new pipeline don't have matching events in the old logs. While it's not many--much less than the number of extra events in the old logs--I don't have a convincing explanation for this.
  • FRUEC correctly ingresses into the database all the events it processes, with just a few exceptions (see T237997 and T238023). With a couple minor fixes in place, no discrepancies should be due to FRUEC issues.

These results were obtained using logs for two periods: 2019-10-20 20:00:00-20:59:59 and 2019-11-01 12:00:00-16:59:59. Results were consistent for both periods. (Exception: tests of FRUEC were only performed using logs from the first period.)

We can pretty safely assume that the new pipeline is more accurate than the old one, but that the data will never be identical. See also T237553.

(Full methodology and more details to follow.)

AndyRussG added a comment.EditedMon, Nov 18, 11:29 AM

Detailed stats and methodology

Windows for log samples:

(1) 2019-10-20 20:00:00-20:59:59 UTC
(2) 2019-11-01 12:00:00-16:59:59 UTC

Since events don't always land in the log file that corresponds to the event's timestamp, I downloaded files for a time window much wider than that of the sample, then filtered for events with timestamps in the desired window.

Jupyter notebook:

Cells in the notebook are referenced below by the number in square brackets to the left of each cell.

Extra events in old log: unfiltered and filtered by repeated IP

For this part of the analysis, I looked at hits on Special:LandingPage, which is the page that displays donation forms. However, the same results should hold up for hits on other pages.

(The old ingress script only records events from a few specific pages; of those, Special:LandingPage is the only page currently in use.)

In both samples, the old logs have more than an order of magnitude more Special:LandingPage events than the new logs. The explanation I have for this, as stated above, is: hacking attempts, bots, web-accelerating proxies, and network and browser issues that prevent JS from running.

I was able to filter out many invalid events from the old logs by removing those from IPs with many hits. (Note that, in theory, this might also remove some valid events, too.)

Total Special:LandingPage events: Sample 1

old [1]new [2]new/old
157825163.27%

Total Special:LandingPage events: Sample 2

oldnewnew/old
11490233632.93%

Events for Special:LandingPage from IPs with less than 100 such events: Sample 1

old [3]new [4]new/old
127851640.38%

Events for Special:LandingPage from IPs with less than 100 such events: Sample 2

oldnewnew/old
7276336346.22%

Percentage of events for Special:LandingPage from IPs with 100 or more such events: Sample 1

oldnew
91.90%0.00%

Percentage of events for Special:LandingPage from IPs with 100 or more such events: Sample 2

oldnew
93.67%0.00%

Processing details
Here are examples of the commands run to obtain these stats:

# [1] Total Special:LandingPage events, old log
$ cat 20191020T20/old/*.log | awk '{if ( ( $9 ~ /title=Special:LandingPage/ || $9 ~ /wiki\/Special:LandingPage/ ) && $3 ~/^2019-10-20T20/ ) print}' | wc -l

# [2] Total Special:LandingPage events, new log
$ cat 20191020T20/new/*.log | jq 'select( .dt | test("2019-10-20T20") ) | select( .event.landingpage | test( "LandingPage" ) ) | .ip' | wc -l

# [3] Events for Special:LandingPage from IPs with less than 100 such events, old log
$ cat 20191020T20/old/*.log | awk '{if ( ( $9 ~ /title=Special:LandingPage/ || $9 ~ /wiki\/Special:LandingPage/ ) && $3 ~/^2019-10-20T20/ ) print $5}' | sort | uniq -c | awk '$1 < 100'  | awk '{s+=$1}END{print s}'

# [4] Events for Special:LandingPage from IPs with less than 100 such events, new log
$ cat 20191020T20/new/*.log | jq 'select( .dt | test("2019-10-20T20") ) | select( .event.landingpage | test( "LandingPage" ) ) | .ip' | sort | uniq -c|  awk '$1 < 100' | awk '{s+=$1}END{print s}'

(This was uncovered in T235284; what's new here is just that I applied the same filtering to the second sample, so we see the result is similar in both samples.)

More thorough filtering and preparation of logs

New and old logs for both samples were loaded into Pandas data frames in a Jupyter notebook. Events were then filtered on more specific criteria.

Here's what I filtered for the old log:

  • Events from IPs with more than 100 hits (as above).
  • Events from non-wiki pages (i.e., page resources).
  • Events with non-OK HTTP status.
  • Events with an action URL param (to filter out action=raw, used in background calls to fetch wiki pages containing JS).
  • Events from devices with no cookies for donate wiki. (Normally, requests go through Special:FundraiserRedirector, which sets cookies. So, filtering for no cookies helps eliminate scripts and bots.)

Filtered on the new logs:

  • Events from IPs with more than 100 hits (as above; in fact there were none, though).
  • Events identified by the EventLogging system as generated by bots.

Total events following more thorough filtering: Sample 1

oldnewnew/old
1463110775.67%

Total events following more thorough filtering: Sample 2

oldnewnew/old
8404523362.27%

Results are slightly worse, though similar, if we add further filtering to allow only events from landing pages ingressed by the old script. (In practice, this was the same adding filtering on Special:LandingPage, as above.)

Total events following more thorough filtering, only ingressed landing pages: Sample 1

oldnewnew/old
82551562.42%

Total events following more thorough filtering, only ingressed landing pages: Sample 2

oldnewnew/old
5791325056.12%

Processing details
In addition to loading the logs into Pandas and filtering, we add some derived columns here to make matching easier later on. Following is a summary of processing, with references to the cells in the Jupyter notebook with the actual code.

  • Create a single old log file filtering on event timestamp [cell 3]
  • Load old log events into dataframe [cell 4]
  • Remove unneeded columns, create nocookies, source and id columns in old logs dataframe [cell 5]
  • Parse URL in old logs to create columns equivalent to those in the new logs, as well as url_type and has_action [cell 6]
  • Create a single new log file filtering on event timestamp [cell 9]
  • Load new log file into a datarame [cell 10]
  • Create source and id columns in new logs dataframe [cell 11]
  • Create a purged old logs dataframe, filtering out non-wiki pages, non-OK statuses, events from IPs with more than 100 requests, nocookie and action requests, and remove trailing language code (like in "ThankYou/en") from landing page titles (to allow matching, since these are inconsistent between old and new logs) [cell 12]
  • Create a purged new logs dataframe, filtering out events from identified bots and from IPs with more than 100 requests, and remove trailing language code from landing page titles. [cell 13]
  • Output sizes of purged dataframes. [cells 14 and 15]
  • Further filter, allowing only landing pages ingressed by legacy script. [cell 20]
  • Output sizes of further filtered dataframes. [cells 21 and 22]
Comparing and matching old and new filtered logs

First, I concatenated the purged sets of old and new logs (including all wiki pages, not just those ingressed by the old script), sorted by ip and timestamp, and examined the results in a spreadsheet, to see if anything stood out. Nothing did.

Then I tried a few ways of matching up old and new logs, settling finally on matching ip, landing page and associated fields from the URL, and timestamps within 20 seconds of each other.

For many events, this technique found clear one-to-one matches. However, both new and old logs also contain "orphaned" events, that is, events without a corresponding event in the other log.

There are more orphaned events in the old logs, however, there are also a significant number in the new logs. The orphans in the new logs are more worrisome, since I haven't found a good explanation for them.

Orphaned events, after attempted matching, following thorough filtering (described above): Sample 1

oldnew
Total orphaned530174
Orphaned as a percentage of events (after thorough filtering)36.23%15.72%

Orphaned events, after attempted matching, following thorough filtering (described above): Sample 2

oldnew
Total orphaned3574403
Orphaned as a percentage of events (after thorough filtering)42.53%7.70%

Processing details

  • Concatenate old and new purged log dataframes. [cell 16]
  • Reorganize columns of new dataframe. [cell 17]
  • Sort by ip and timestamp. [cell 18]
  • Export new dataframe to csv for analysis in spreadsheet. [cell 19]
  • One attempt at matching events, eventually not used. [cell 23]
  • Another attempt at matching events, eventually not used. [cell 24]
  • Match events from old and new logs; this is the method used to obtain data shown here. Also, create a dataframe of orphaned new events, for further analysis. [cell 28]
Double-checking filtering and matching

To double-check that the orphaned new events were not an artefacct of a mistake in matching or some other aspect of processing, I created a merged dataframe of old and new events, with columns indicating which events had corresponding entries in the old or new logs, or in both. I examined the dataframe in a spreadsheet, but didn't see any orphaned events with likely matches that had been omitted by the Pandas processing.

Processing details

  • Merge old purged log dataframe with information on which old log events have matches and which don't. [cell 34]
  • Merge in new events, including information on which new events have matches and which don't. [cell 35]
Explaining orphaned events in new logs

Why are there orphaned events in the old logs? In theory, they shouldn't exist, since new events are only generated when the user arrives at a landing page and the JavaScript for that page is loaded and run. For every time that happens, there should be an HTTP request for the base HTML of that page, and that request should generate an event in the old log.

The most likely explanation I can think of is that there are networks, proxies or clients that cache Donate Wiki content, even though they shouldn't. Cache headers for Donate Wiki seem correct; they are private, s-maxage=0, max-age=0, must-revalidate.

This seems plausible, since, in both samples, mobile platforms are more prevalent in orphaned new logs than in new logs overall. That is, if ISPs are caching responses and not calling our servers as they should, it sounds reasonable that the practice would be most frequent with mobile carriers, since bandwidth constraints might provide a incentive to do so. However, I couldn't find any specific reports of mobile carriers (or any other ISPs) caching when they shouldn't.

Mobile operating systems (iOS + Android) share in new logs: Sample 1

Orphaned new log entriesNew log entries overall
74.71%46.16%

Mobile operating systems (iOS + Android) share in new logs: Sample 2

Orphaned new log entriesNew log entries overall
55.00%33.67%

Predictibly, mobile browser share is also higher in orphaned new log entries. I also checked differences in shares for specific landing pages; on this point, there is a difference in Sample 1 (where new log orphans are more common for Thank You pages than Special:LandingPage), but not in Sample 2.

Other possible explanations include: someone hitting the EventLogging beacon URL directly (doesn't seem likely if one looks at orphans' URL param content) or issues in our infrastructure, or in processing for this analysis.

Processing details

  • Browser share in new orphaned logs and new purged logs. [cells 36 and 37]
  • OS family. [cells 38 and 39]
  • Landing page. [40 and 41]
Double-checking FRUEC

In addition to trying to analyse and explain the differences in log entries written by the old and new pipelines, I checked the FRUEC side of the pipeline. This was to ensure that no additional discrepancies in the data were added by FRUEC itself.

To do this, I used FRUEC to consume the sample logs, then queried the resulting contents of the landingpageimpression_raw table. The result of that query was loaded into a dataframe and compared with the dataframe generated directly from the log files.

I found just a small number of log entries that FRUEC did not process exactly as the legacy system does. This is due to a very minor bug that we should fix. (See T237997.)

(This test was only performed on logs from Sample 1.)

Processing details

  • Query landingpageimpression_raw and load the contents into a dataframe. [cell 29]
  • Set default country. [cell 31]
  • Merge the dataframes obtained via database and log files. [celll 32]
  • Output the number of non-matching entries [33].
AndyRussG changed the point value for this task from 2 to 4.Mon, Nov 18, 11:30 AM
AndyRussG moved this task from Doing to Review on the Fundraising Sprint A Wrinkle in Timezones board.
mpopov added a subscriber: mpopov.Wed, Nov 20, 5:43 PM
Isaac added a subscriber: Isaac.Thu, Nov 21, 8:00 PM

I'm not sure if this is at all pertinent, but we spent some time trying to debug a situation where we were missing about 10% of EventLogging data from people taking a survey served via the QuickSurveys tool. In essence, for 10% of readers, the tool was displaying the surveys correctly and we knew they had taken the survey but we never get the EventLogging that we should have. We ultimately decided that it was a mixture of adblock (which has settings that allow the javascript etc. to show the survey but blocks EventLogging) and, in our case, people could right-click off the page to take the survey and that wasn't triggered EventLogging as expected.

T220627#5641946

I'm not sure if this is at all pertinent, but we spent some time trying to debug a situation where we were missing about 10% of EventLogging data from people taking a survey served via the QuickSurveys tool. In essence, for 10% of readers, the tool was displaying the surveys correctly and we knew they had taken the survey but we never get the EventLogging that we should have. We ultimately decided that it was a mixture of adblock (which has settings that allow the javascript etc. to show the survey but blocks EventLogging) and, in our case, people could right-click off the page to take the survey and that wasn't triggered EventLogging as expected.
T220627#5641946

Hi @Isaac! Thanks so much for pointing this out. I don't think it explains why we have extra, unmatched events in our new LandingPage pipeline, which does use EventLogging, vs. the old pipeline, which uses server logs of requests for the pages' base HTML. However... it does look like a possible explanation for the missing entries in the new pipeline for CentralNotice (banner system). In that case we do have a means of looking at details of missing EventLogging events. Please see T236834#5696044 and the two comments below that. I'll also post on the task you mentioned. Thanks again! :)