Page MenuHomePhabricator

[BUG] userAgent missing from all EventLogging analytics Hive tables between 2018-11-29 and 2018-11-14
Closed, ResolvedPublic8 Estimated Story PointsBUG REPORT

Description

Starting from Nov 29 2018, user agent of the MobileWikiAppSearch table is not being parsed, and thus all fields in the useragent struct in NULL. This doesn't seem to be an app side problem since iOS had a release on Nov 30, but Android didn't. Most data in the table are from Android.

SELECT DISTINCT useragent
FROM event.mobilewikiappsearch
WHERE year=2018 and month=11 and day=30
---
useragent
{"browser_family":null,"browser_major":null,"browser_minor":null,"device_family":null,"is_bot":null,"is_mediawiki":null,"os_family":null,"os_major":null,"os_minor":null,"wmf_app_version":null}

This bug prevent the search platform team's dashboard from being updated, and lead to the drop of overall search clickthrough rate (T211306).

Can someone from Analytics Engineering look into it? Thanks!

Event Timeline

fdans triaged this task as High priority.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Update: This is happening in several other tables as well:

mobilewikiappdailystats
mobilewikiappiosfeed
mobilewikiappiosloginaction
mobilewikiappiosreadinglists
mobilewikiappiossearch
mobilewikiappiossessions
mobilewikiappiossettingaction
mobilewikiappiosuserhistory

Seems this is also affecting non-iOS EventLogging schemas, e.g. ReadingDepth and NavigationTiming.

uuid="8522d275ebcb5f7cbd82f44ca27c6dcf"

Raw data seems correct

{"dt": "2018-11-30T10:00:00Z", "event": {"action": "results", "app_install_id": "SOME", "client_dt": "2018-11-30T19:00:00+0900", "invoke_source": 4, "language": "de", "number_of_results": 4, "session_token": "d653ebc1-4eab-4473-80e8-5306c6a278d4", "time_to_display_results": 571, "type_of_search": "full"}, "ip": "SOME", "recvFrom": "cp5010.eqsin.wmnet", "revision": 18204643, "schema": "MobileWikiAppSearch", "seqId": 182964798, "userAgent": {"browser_family": "Android", "browser_major": "6", "browser_minor": "0", "device_family": "Generic Smartphone", "is_bot": false, "is_mediawiki": false, "os_family": "Android", "os_major": "6", "os_minor": "0", "wmf_app_version": "2.7.264-r-2018-11-05"}, "uuid": "8522d275ebcb5f7cbd82f44ca27c6dcf", "wiki": "dewiki"}

But refined data is wrong:

2018-11-30T10:00:00Z {"action":"results","appInstallID":null,"numberOfResults":null,"position":null,"searchSessionToken":null,"source":null,"timeToDisplayResults":null,"typeOfSearch":null,"sessionToken":null,"ts":null,"app_install_id":"SOME","client_dt":"2018-11-30T19:00:00+0900","invoke_source":4,"language":"de","number_of_results":4,"session_token":"d653ebc1-4eab-4473-80e8-5306c6a278d4","time_to_display_results":571,"type_of_search":"full"} NULL 18204643 MobileWikiAppSearch NULL {"browser_family":null,"browser_major":null,"browser_minor":null,"device_family":null,"is_bot":null,"is_mediawiki":null,"os_family":null,"os_major":null,"os_minor":null,"wmf_app_version":null} 8522d275ebcb5f7cbd82f44ca27c6dcf dewiki 153.250.45.144 GEOCODED_DATA 2018 11 30 10

cc @Ottomata so we look at this tomorrow

This is indeed a bug! Am investigating! Not yet sure why this is happening.

Do we know when this started?

It isn't just useragent, recvfrom and seqid are also null.

FYI this started happening during the last refinery-source deployment around 2018-11-29T17:41:00Z. I did make a change to the refinement code then, but I'm not yet sure how it is related. Still investigating.

Change 479743 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Normalize DataFrame before converting it to widen typed schema

https://gerrit.wikimedia.org/r/479743

Change 479743 merged by Ottomata:
[analytics/refinery/source@master] Normalize DataFrame before converting it to widen typed schema

https://gerrit.wikimedia.org/r/479743

Mentioned in SAL (#wikimedia-analytics) [2018-12-14T19:50:34Z] <ottomata> staring refinery release deploy process for refinery 0.0.82 to fix T211833

Mentioned in SAL (#wikimedia-operations) [2018-12-14T20:07:26Z] <otto@deploy1001> Started deploy [analytics/refinery@ef1f7c6]: deploying refinery-source 0.0.82 with fix for T211833

Change 479753 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refinery-job to 0.0.82 for refine.pp jobs

https://gerrit.wikimedia.org/r/479753

Mentioned in SAL (#wikimedia-operations) [2018-12-14T20:13:30Z] <otto@deploy1001> Finished deploy [analytics/refinery@ef1f7c6]: deploying refinery-source 0.0.82 with fix for T211833 (duration: 06m 04s)

Change 479753 merged by Ottomata:
[operations/puppet@production] Bump refinery-job to 0.0.82 for refine.pp jobs

https://gerrit.wikimedia.org/r/479753

Thanks for the swift action on this! Let's make sure that other users of EL data become aware of the issue as well, e.g. via an announcement on Analytics-l. Will it be possible to backfill the missing data?

Yup, backfilling now, will send email for sure.

This affects all top level table fields (ones in EventCapsule) that have capital letters in their names. Re-refining all EventLogging Hive data since 2018-11-29T17:00:00.

# Remove _REFINED success flag files
for el_table in $(ls | grep -vE '^mediawiki'); do
    (echo "Removing _REFINED flags for $el_table"
    sudo -u hdfs hdfs dfs -rm /wmf/data/event/$el_table/year=2018/month=11/day=29/hour={17,18,19,20,21,22,23}/_REFINED
    sudo -u hdfs hdfs dfs -rm /wmf/data/event/$el_table/year=2018/month=11/day=30/hour=*/_REFINED
    sudo -u hdfs hdfs dfs -rm /wmf/data/event/$el_table/year=2018/month=11/day=30/hour=*/_REFINED
    sudo -u hdfs hdfs dfs -rm /wmf/data/event/$el_table/year=2018/month=12/day=30/hour=*/_REFINED) &
done

# Submit re-Refine job:
sudo -u hdfs /usr/local/bin/refine_eventlogging_analytics --since 2018-11-29T16:00:00

Mentioned in SAL (#wikimedia-analytics) [2018-12-14T20:45:10Z] <ottomata> re-refining all hive EventLogging tables since 2018-11-29T17:00:00. - T211833

Ottomata set the point value for this task to 8.
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata added a subscriber: Nuria.

Mentioned in SAL (#wikimedia-analytics) [2018-12-17T15:23:18Z] <ottomata> re-running refine_eventlogging_analytics with --ignore_done_flag (backfilling didn't complete properly on friday) - T211833

I can see this is fixed for 11/30 but I think refine of data is still running.

Yeah, the backfill job is still running but seems stuck. Investigating...

Mentioned in SAL (#wikimedia-analytics) [2018-12-18T19:36:49Z] <ottomata> re-running refine_eventlogging_backfill again for days in december - T211833

Ottomata renamed this task from [BUG] User agent parsing error for MobileWikiAppSearch table to [BUG] userAgent missing from all EventLogging analytics Hive tables between 2018-11-30 and 2018-11-14.Dec 18 2018, 8:15 PM
Ottomata renamed this task from [BUG] userAgent missing from all EventLogging analytics Hive tables between 2018-11-30 and 2018-11-14 to [BUG] userAgent missing from all EventLogging analytics Hive tables between 2018-11-29 and 2018-11-14.

The days in 2018-11 are done. Here's how I launched a re-refine for days in 2018-12. This loops over every day between Dec 1 and Dec 14 and spawns a refine_eventlogging_analytics backfill job for each day.

# on an-coord1001, launch 14 individual refine jobs in parallel.
# This avoids launching one huge refine job.

for day in $(seq -w 01 14); do
  echo "Backfilling $day"
  echo ""

  job_name=refine_eventlogging_analytics_backfill_2018-12-${day}_T211833

  cmd="/usr/bin/spark2-submit --name $job_name --class org.wikimedia.analytics.refinery.job.refine.Refine --files /etc/hive/conf/hive-site.xml,/etc/refinery/refine/refine_eventlogging_analytics.properties,/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar,/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --master yarn --deploy-mode cluster --queue production --driver-memory 8G --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:hive-jdbc-1.1.0-cdh5.10.0.jar:hive-service-1.1.0-cdh5.10.0.jar --conf spark.dynamicAllocation.maxExecutors=64  /home/otto/refinery-source/refinery-job/target/refinery-job-0.0.83-SNAPSHOT.jar --config_file refine_eventlogging_analytics.properties --since 2018-12-${day}T00:00:00 --until 2018-12-${day}T23:00:00 --ignore_done_flag=true --ignore_failure_flag=true"

  (
    echo $cmd 
    echo ""
    $cmd >> /var/log/refinery/$job_name 2>&1
  ) &
  sleep 2
done

Alright! Looking waayyy better this time. Backfilling should be done.

@mforns feel free to schedule the sanitize job.

@chelsyx the useragent data should be back. Can you confirm?

@Ottomata Confirmed that useragent data is back. Thanks for the fix!

Restricted Application changed the subtype of this task from "Task" to "Bug Report". · View Herald TranscriptOct 15 2019, 11:10 AM