Page MenuHomePhabricator

NULL-values for useragent column in event.searchsatisfaction
Closed, ResolvedPublic5 Estimated Story Points

Description

Starting on 2020-08-03 at 17:00 UTC and ending on 2020-08-07 some time after 18:00 UTC, the useragent struct in event.searchsatisfaction is NULL for all events, as seen in the output of the Hive query below. Given that it's populated at other times, I'd expect it to be for that time period too.

Query:

SELECT SUBSTR(meta.dt, 1, 13) AS log_time, count(*) AS num_events
FROM event.searchsatisfaction
WHERE year = 2020
AND month = 8
AND day >= 3
AND useragent IS NOT NULL
GROUP BY SUBSTR(meta.dt, 1, 13)
ORDER BY log_time
LIMIT 250

Output:

log_time	num_events
NULL	613
2020-06-24T23	6
2020-08-03T00	888514
2020-08-03T01	871628
2020-08-03T02	857297
2020-08-03T03	780824
2020-08-03T04	742786
2020-08-03T05	743775
2020-08-03T06	839824
2020-08-03T07	981014
2020-08-03T08	1112431
2020-08-03T09	1194447
2020-08-03T10	1173086
2020-08-03T11	1254520
2020-08-03T12	1372443
2020-08-03T13	1517743
2020-08-03T14	1591469
2020-08-03T15	1582655
2020-08-03T16	1489345
2020-08-07T18	1295397
2020-08-07T19	1289413
Time taken: 64.221 seconds, Fetched: 21 row(s)

I also checked the event.homepagemodule table (which is a client-side EventLogging table), but did not find a gap there. So I suspect this is an issue with event.searchsatisfaction because it's an Event Platform table.

Event Timeline

Hmm, we encountered a different but similar smelling problem in T259924. I reverted the change I made on Aug 3 early this evening. I'll investigate this (and backfill) tomorrow.

BTW, is user_agent_map non Null? It should have mostly the same data.

BTW, is user_agent_map non Null?

user_agent_map is fine.

Backfilling searchsatisfaction and templatewizard, which are the only two legacy EL tables migrated to Event Platform.

sudo -u analytics /usr/bin/spark2-submit --name refine_eventlogging_legacy_backfill_T259944 --class org.wikimedia.analytics.refinery.job.refine.Refine --files /etc/hive/conf/hive-site.xml,/etc/refinery/refine/refine_eventlogging_legacy.properties,/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar,/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --master yarn --deploy-mode cluster --queue production --driver-memory 8G --executor-memory 4G --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:hive-jdbc-1.1.0-cdh5.10.0.jar:hive-service-1.1.0-cdh5.10.0.jar --conf spark.dynamicAllocation.maxExecutors=64  --principal analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA --keytab /etc/security/keytabs/analytics/analytics.keytab /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.129.jar --config_file refine_eventlogging_legacy.properties --since=2020-08-03T13:00:00 --until=2020-08-07T20:00:00 --ignore_done_flag=true --ignore_failure_flag=true

application_1596639839773_24138

Backfill successful.

20/08/09 21:35:39 INFO Refine: Successfully refined 103 of 103 dataset partitions into table `event`.`SearchSatisfaction` (total # refined records: 119750916)
20/08/09 21:35:39 INFO Refine: Successfully refined 103 of 103 dataset partitions into table `event`.`TemplateWizard` (total # refined records: 13074)
presto:event> select useragent.browser_family, user_agent_map['browser_family'] from templatewizard where year=2020 and month=8 and day=3 and hour=17 limit 10;
 browser_family |  _col1
----------------+---------
 Chrome         | Chrome
 Chrome         | Chrome
 Firefox        | Firefox
 Chrome         | Chrome
 Chrome         | Chrome
 Edge           | Edge
 Chrome         | Chrome
 Chrome         | Chrome
 Chrome         | Chrome
 Chrome         | Chrome
(10 rows)

Today I'll try to figure out what happened here as part of T259924.

Ah of course. This is not related to the array of structs bug described in T259924. This is instead because the Refine transform function that adds the legacy useragent struct column only does so if the DataFrame being transformed as a useragent field, and in T255818 we stopped merging in the Hive table early on in the process, so the field doesn't exist.

I will fix this by also checking the destination Hive table schema to see if it has a legacy useragent field that should be populated.

Change 619298 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] refine - Add legacy useragent column if field exists in event schema or in Hive

https://gerrit.wikimedia.org/r/619298

mforns triaged this task as High priority.
mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
mforns added a project: Analytics-Kanban.

Thanks for taking care of this so quickly, @Ottomata, very much appreciated!

BTW, is user_agent_map non Null? It should have mostly the same data.

I didn't check, mostly because user_agent_map doesn't have the is_bot field, which my queries are using to remove bot queries. Not sure if user_agent_map.device_family = "Spider" would be a better approach for Event Platform tables. Or if user_agent_map should get an is_bot field added?

Change 619298 merged by Ottomata:
[analytics/refinery/source@master] refine - Add legacy useragent column if field exists in event schema or in Hive

https://gerrit.wikimedia.org/r/619298

Nuria set the point value for this task to 5.