Page MenuHomePhabricator

agent_type field does not work for anything except last few hours
Closed, InvalidPublic

Description

I tried to move Zero analytics to the new table, and decided to test the new wonderful fields like agent_type ... and it only works on the most recent hours of data. This query is ok:

select distinct agent_type from webrequest where webrequest_source='mobile' and year=2015 and month=4 and day=10 and hour=23;

This fails

select distinct agent_type from webrequest where webrequest_source='mobile' and year=2015 and month=4 and day=10 and hour=13;

At first I thought it was the 13th hour, but anything before that also fails.

Caused by: java.lang.IllegalStateException: Column agent_type at index 28 does not exist in {dt=dt, uri_path=uri_path, accept_language=accept_language, range=range, client_ip=client_ip, x_analytics_map=x_analytics_map, x_cache=x_cache, content_type=content_type, is_pageview=is_pageview, geocoded_data=geocoded_data, x_analytics=x_analytics, x_forwarded_for=x_forwarded_for, cache_status=cache_status, response_size=response_size, hostname=hostname, record_version=record_version, uri_query=uri_query, uri_host=uri_host, ip=ip, http_method=http_method, http_status=http_status, time_firstbyte=time_firstbyte, user_agent_map=user_agent_map, sequence=sequence, user_agent=user_agent, referer=referer}

Even simpler - this fails:

select * from webrequest where webrequest_source='mobile' and year=2015 and month=4 and day=10 and hour=23 limit 10;

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik subscribed.
Yurik set Security to None.
Ironholds claimed this task.
Ironholds subscribed.

That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.

@kevinator or @Ottomata, is there an easy and *quick* way to check the version? I tried

hive -e "select record_version from wmf.webrequest where webrequest_source='mobile' and year=2015 and month=4 and day=12 and hour=10 limit 1;" 2>/dev/null

hoping that it would quickly export one value that i can use in a bash/python script, alas, it started the whole hadoop job, and took forever, plus gave tons of extra output (that wasn't part of the data.

You should just use it (or the date) in a conditional. Make sure either the partition date > 2015-04-10, or that the record_version = "0.0.3". Otherwise, don't use the field :)