Page MenuHomePhabricator

Popups EventLogging events occasionally invalid
Closed, DeclinedPublic

Description

Looking at some eventlogging logs, I see occasional invalid events for the Popups schema. It looks like the totalInteractionTime field is sometimes set to null, but it is required to be an integer.

2017-11-13 15:26:32,297 [8480] (MainThread) Unable to validate: ?%7B%22event%22%3A%7B%22pageTitleSource%22%3A%22South%20Korea%20in%20the%20Vietnam%20War%22%2C%22namespaceIdSource%22%3A0%2C%22pageIdSource%22%3A40906410%2C%22isAnon%22%3Atrue%2C%22popupEnabled%22%3Atrue%2C%22pageToken%22%3A%2275877e98a26f57c1%22%2C%22sessionToken%22%3A%22d855c363bf2ceb19%22%2C%22hovercardsSuppressedByGadget%22%3Afalse%2C%22previewCountBucket%22%3A%221-4%20previews%22%2C%22totalInteractionTime%22%3Anull%2C%22action%22%3A%22dismissed%22%2C%22linkInteractionToken%22%3A%22106b1f8f3420bbb1%22%2C%22pageTitleHover%22%3A%22White_Horse_Division%22%2C%22namespaceIdHover%22%3A0%2C%22previewType%22%3A%22page%22%2C%22perceivedWait%22%3A834%7D%2C%22revision%22%3A16364296%2C%22schema%22%3A%22Popups%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D;	cp4027.ulsfo.wmnet	2682473	2017-11-13T15:26:31	-	"XXXXXXXX" (None is not of type 'integer')
Developer notes

Decoded event data:

{
  "event": 
  {
    "pageTitleSource":"South Korea in the Vietnam War",
    "namespaceIdSource":0,"
    pageIdSource":40906410,
    "isAnon":true,
    "popupEnabled":true,
    "pageToken":"75877e98a26f57c1",
    "sessionToken":"d855c363bf2ceb19",
    "hovercardsSuppressedByGadget":false,
    "previewCountBucket":"1-4 previews",
    "totalInteractionTime":null,
    "action":"dismissed"
    ,"linkInteractionToken":"106b1f8f3420bbb1",
    "pageTitleHover":"White_Horse_Division",
    "namespaceIdHover":0,
    "previewType":"page",
    "perceivedWait":834
  },
  "revision":16364296,
  "schema":"Popups"
  ,"webHost":"en.wikipedia.org",
  "wiki":"enwiki"
}

Event Timeline

@Tbayer, I CCed you on this because I don't know the proper person do to so. Could you route it to where it should go?

Seems to be stemming from the actionTypes.ABANDON_END reducer case.

@phuedx, @Jhernandez what's redux's good/fast way to figure out where the problem maybe happening in the chain of events?

Sorry, I don't understand the question.

The event data is populated in src/reducers/eventLogging.js, you can look for totalInteractionTime and see where it is created and what could cause it to be null.

For dismissed, the event seems to be created here createClosingEvent - src/reducers/eventLogging.js#L75

From looking at that, I'm not sure what could make the totalInteractionTime be null, since Math.round seems to be unable to produce null:

> Math.round(null)
0
> Math.round()
NaN
> Math.round(undefined)
NaN

There are also a bunch of unit tests, so if we find a bug we should add a test /tests/node-qunit/reducers/eventLogging.test.js#L679-L828


It would be good to know for debugging and prioritizing:

  • How often does this happen
  • Browser/OS stats
  • List of events in the same session for the page token, to try and reproduce the user actions
  • How often does this happen[?]

I'd argue that we need an answer to this question before assessing priority relative to other tasks.

AFAIK events that can't be validated aren't written to the backing store (in the case of Popups, Hadoop and not MySQL). @Ottomata: Can you confirm this?

Another concern I have is that the schema dashboard doesn't appear to show this validation error: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=Popups&from=1510531200000&to=1510617600000&panelId=10&fullscreen

@Ottomata: If it is the case that invalid events aren't written to the backing store, then could you help us get an idea of how frequently this error was occurring (errors/$unitTime)?

Another concern I have is that the schema dashboard doesn't appear to show this validation error

Huh, I'm not totally sure where this error metric is supposed to come from...

If it is the case that invalid events aren't written to the backing store, then could you help us get an idea of how frequently this error was occurring (errors/$unitTime)?

Check this out! We have been working on auto refinement of all eventlogging data to Hive (announcement to come out next week). Invalid events are written to a kafka topic eventlogging_EventError with the EventError schema. Soooo, if you have Hive access:

select count(*) from event.eventerror where year = 2017 and month=11 and day in (20, 21) and event.schema = "Popups";
_c0
6

Note that I only productionized this refinement process yesterday, so there isn't a lot of data in Hive. At this moment, the above query should be looking at about the last 24 hours of data.

So, not very often. Only about 6 invalid events in the last ~24hours.

...

So, not very often. Only about 6 invalid events in the last ~24hours.

Interesting, but the Popups schema has a very low event rate in general currently, because the experiment was deactivated last week (T178500). It would be good to know the ratio of errors to correctly logged events during the time of the experiment.

There used to be a grafana dashboard based on the EventError topic. @mforns, do you know if that still exists?

Aside from that hope, we have some process logs on the eventlogging server going back about a week. But that's it.

Between 2017-11-14T06:35 and 2017-11-15T06:35:

[@eventlog1001:/home/otto/processor_logs_2017-11-14] $ cat ./* | grep Popups | grep 'Unable to validate' | wc -l
11891

Ah, so I've been test refining stuff occasionally over the last month, and found some EventError data in my test Hive database. It looks like I have full days of EventErrors for Nov 2nd.

select count(*) from eventerror where event.schema = 'Popups' and year=2017 and month=11 and day=2;
_c0
11573

The bulk of the errors are on Windows , you can see that by looking at the event unparsed on eventerror table on "otto_json_refine_test" database.

There used to be a grafana dashboard based on the EventError topic.

Ooops, I meant to say ‘There used to be a Logstash/Kibana dashboard based
on the EventError topic'

ovasileva lowered the priority of this task from High to Medium.Nov 28 2017, 5:52 PM

@Ottomata is there a way to get answers to Joaquin's comments @ https://phabricator.wikimedia.org/T180356#3773858

We'd like to investigate this and get to the bottom of the problem but not sure of the process to do so, so will need some handholding about where to look and how to look. Specifically we are interested in user agents and frequency. Can we query dropped events somewhere? Sorry in advance if this is a dumb question!

The bulk of the errors are on Windows , you can see that by looking at the event unparsed on eventerror table on "otto_json_refine_test" database.

How can we look at this further? In particular we're interested in knowing specific user agents/browser agents to see if there is any pattern there.

@Jdlrobson : Look at my comments prior, you can look at errors in detail on eventerror table on ottomata's test database on cluster, the bulk of the errors are on windows.

Ya, check the event.eventerrors table in Hive, where event.schema = 'Popups'. The tricky bit is that the original event data is in 'raw' form, i.e. the html encoded JSON that was sent in the query params of the GET request.

Since you'll need some code to parse the rawEvent, it might be easier to do this in Spark rather than directly in Hive. You can use Spark in Scala or Python. Assuming you'd prefer Python, here's a way!

On stat1005, run pyspark2 to get a Python Spark shell. Then:

from urllib import unquote

def decode_raw_event(s):
    """Replace %xx escapes by their single-character equivalent."""
    bytes = s.strip('?').replace('+', ' ').encode('utf-8')
    return unquote(bytes).decode('utf-8')

q = 'select event.rawEvent from event.eventerror where year=2017 and month=11 and day=22 and event.schema = "Popups"'

# Execute the SQL using SparkSQL.  This returns a Spark DataFrame.
result_df = spark.sql(q)

# Map each row (RDD) in the DataFrame from the raw event string to the decoded version:
raw_events = result_df.rdd.map(lambda row: decode_raw_event(row.rawEvent))

# Print out some decoded raw events that failed produce due to some error.
# NOTE: These are not JSON, they are the log line sent by varnishkafka. 
# The first bit of the line is the JSON string of the event that was sent by the client.

print(raw_events.take(1)[0])
...
{"event":{"pageTitleSource":"Shark","namespaceIdSource":0,"pageIdSource":43617,"isAnon":true,"popupEnabled":true,"pageToken":"XXXXXX","sessionToken":"XXXXXX","hovercardsSuppressedByGadget":false,"previewCountBucket":"21+ previews","totalInteractionTime":null,"action":"dismissed","linkInteractionToken":"XXXXX","pageTitleHover":"Jersey_Shore_shark_attacks_of_1916","namespaceIdHover":0,"previewType":"page","perceivedWait":731},"revision":16364296,"schema":"Popups","webHost":"en.wikipedia.org","wiki":"enwiki"};	XXXXX	3451610	2017-11-22T14:04:41	-	"Mozilla/5.0 XXXXXXX"

Or ,even easier, just dump out all table and use old fashion unix commands, that is what i did to look at user agents.

hive -f select.hql > out.txt will dump all table in out.txt (which works for small tables)

Select.hql file contains:

use otto_json_refine_test;

select * from eventerror where event.schema = 'Popups' and year=2017 and month=11 and day=2;

I'm not seeing any errors on 13th November when querying the Popups event table (which is odd because one is included in the description).
In fact I'm only seeing 40 errors for the whole month of November. Am I reading this correctly? Did we purge errors before that date or does this truly reflect the month?

Of these 40 errors, they come from 14 unique session ids (clients), all which were anonymous, distributed as follows:
[15, 10, 6, 4, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1 ]

All those events happened after we disabled EventLogging on the 15th and before 24th (https://phabricator.wikimedia.org/T178500) which is strange. Given they are anonymous, it's possible certain cached pages are still opted into the experiment.

1 event includes the new timestamp field, which has not been deployed anywhere (and scheduled for deploy in https://phabricator.wikimedia.org/T181493).. so that is also very strange. (How?!)

I see no correlation in browsers and the errors, but 3 on OSX and the rest on Windows NT 10.0.

I'd need more data to work out what's going on here, since the number of errors is so low.

[1] select event.rawEvent from event.eventerror where year=2017 and month=11 and event.schema = "Popups"

From looking at that, I'm not sure what could make the totalInteractionTime be null, since Math.round seems to be unable to produce null:

> Math.round(null)
0
> Math.round()
NaN
> Math.round(undefined)
NaN
JSON.stringify(NaN);
> null

Data is submitted to EventLogging as JSON. Many JavaScript values (such as functions, undefined, NaN, and others) do not have a JSON equivalent. Such properties are either omitted during serialisation, or (in the case of NaN) given null.

@Jdlrobson see prior comments , the table with errors is on otto's test database, not on events database

Select:

use otto_json_refine_test;

select * from eventerror where event.schema = 'Popups' and year=2017 and month=11 and day=2;

Errors:

  75 Mozilla/5.0 (X11; CrOS x86_64 9765.85.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.123 Safari/537.36\
  76 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
  88 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36\
 105 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 106 Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 110 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36\
 114 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 118 Mozilla/5.0 (Windows NT 10.0; WOW64; rv
 135 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv
 136 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv
 174 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 175 Mozilla/5.0 (Windows NT 6.1; rv
 194 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063\
 206 Mozilla/5.0 (Windows NT 6.1; WOW64; rv
 223
 223 Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv
 405 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 411 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36\
 530 Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
 587 Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv
 992 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\
1205 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv
2856 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\

Argg. Sorry this makes a lot more sense now. I thought otto_json_refine_test was some file that Otto was using that I didn't have access to. Didn't realise it was a database dump.

That said even when I use otto_json_refine_test I can only find events on the 1st, 2nd and 3rd November.
For the 3 days I'm seeing the following dropped events:

18240
211572
36494

447 are from a strange webhost - en.wiki.admicos.cf - I've excluded them from the analysis. Of the remaining events,

All events (25862) are from anonymous users.
19891 are dwelledButAbandoned events, 5970 are dismissed, 1 pageLoaded

  • 15969 of the 25862 events (excluding the strange web host) occur when popups is disabled. All the events where popups is disabled are dwelledButAbandoned events.
  • When popups is enabled, there are 9893 errors. The majority (5970) are dismissed events, 3922 are dwelledButAbandoned. 8525 of these occur on Windows NT. 6990 are Chrome.

So I think there's a couple of problems here:

  1. When popups gets disabled something goes wrong with EventLogging of dwelledButAbandoned. What's curious, is that a dwelledButAbandoned event should not be possible when popups is disabled.. I think T181859 captures the problem here.
  2. There are a smaller amount of events where popups is enabled that are dropped for dwelledButAbandoned, dismissed events. I've not looked into them yet, but I can't seem to see a pattern there other than that they occur for anonymous users. Maybe this is occurring for cached pages. Until we can see some more recent data and T181859 is fixed, I can't draw any conclusions.

    Is it possible to have an entire dump or a latter dump for month of November? There's several things I want to know that I don't seem to be able to get from this data that would have helped with problem #2:
  3. If erroneous events are still being submitted despite EventLogging being disabled on 15th
  4. The frequency of errors from anonymous users where popups is enabled per day (3 day sample is too small) - is it dropping over time?

<snip>What's curious, is that a dwelledButAbandoned event should not be possible when popups is disabled.</snip>

Thanks for digging into this, @Jdlrobson!

IIRC we should be seeing dwelledButAbandoned events when the user is in the control condition or they have disabled Page Previews. These events represent that user dwelling on a link and then abandoning it before a preview appears. When Page Previews is disabled, then a preview simply won't appear.

That said even when I use otto_json_refine_test I can only find events on the 1st, 2nd and 3rd November.

otto_json_refine_test has been a scratch database for me to test importing EventLogging events into Hive. I never imported regularly into it, so any data that is in there is real data, but is likely pretty incomplete.

event.eventerror only has data since Nov 20; this system is new and only about 98% ready, so we haven't announced it for prod yet.

Hey, another problem:

Field timestamp is repeated with multiple non StructType types: StructField(timestamp,DoubleType,true) , StructField(timestamp,LongType,true))

This is an error message I'm getting from the Hive refinement step. It seems that sometimes clients are emitting the new timestamp field sometimes as integers, and sometimes as decimals. This works in JSON because number doesn't care, but it is really important to know the type in strongly typed systems. This error means that the refinement failed for the hours of data that have different types in them. It looks like the first successful hour of data contained only decimal fields, so the timestamp field in Hive is of type double.

Can yall make sure the client code always sends decimal numbers? Adding a .0 to the end of the number would be good enough! :)

Making a separate task about ^^

These events represent that user dwelling on a link and then abandoning it before a preview appears.

Got it. Looks like we need to send the totalInteractionTime however (or make that optional). Let's continue talking about that in T181859

I've hit a wall with this. My initial hypothesis in T181859 didn't turn out to be correct.

The only pattern I'm seeing is that totalInteractionTime is being sent as null for some anonymous users with dwelledButAbandoned events (mostly, but not exclusive to Windows NT Chrome ) when page previews is disabled. I can't reproduce locally.

According to code this field is defined in two places:

totalInteractionTime: Math.round( interaction.finished - interaction.started )
totalInteractionTime: Math.round( action.timestamp - state.interaction.started )

My guess is somehow one of those is undefined and resolves to NaN (and thus is logged as null since no JSON equivalent

e.g.

Math.round( undefined -  1000) or Math.round(1000- undefined)

Ultimately, these all seem to use timestamp which if I read correctly ultimately comes from this function:

/**
 * Returns timestamp since the beginning of the current document's origin
 * as reported by `window.performance.now()`.
 * @return {number|null}
 */
function getCurrentTimestamp() {
	if ( window.performance && window.performance.now ) {
		// return an integer; see T182000
		return Math.round( window.performance.now() );
	}
	return null;
}

Given this '''can'' return null, is it possible for window.performance.now to be returning undefined in Chrome 62? No, right?
At very least it would be good to verify if getCurrentTimestamp is actually returning null unexpectedly and have better error handling for when it does (or use [[ https://github.com/wikimedia/mediawiki/blob/870103526c6e9bc6dd9750c89e51f9ac3f2a6481/resources/src/startup.js#L14 | mw.now ]]which seems to have a fail safe for when not available).


Update:
@bmansurov pointed out actually timestamp is generated by:

/**
 * Mixes in timing information to an action.
 *
 * Warning: the `baseAction` parameter is modified and returned.
 *
 * @param {Object} baseAction
 * @return {Object}
 */
function timedAction( baseAction ) {
	baseAction.timestamp = mw.now();

	return baseAction;
}

It thus doesn't seem possible for it not to be defined unless timedAction is not called.
Is there a event combination, where it may not be? I notice for instance previewDwell is not a timedAction. Is there a possible situation where events run out of sequence? I'm not sure how to check that.. we don't log last event in EventLogging.

Isn't currentTimestamp only used as the timestamp property of the data object?

Krinkle rescinded a token.
Krinkle awarded a token.
Krinkle subscribed.

Given we're not logging events any more, can we decline this task @Ottomata ?