Page MenuHomePhabricator

Make gecoded data and chosen client_ip available as fields in refined webrequest data
Closed, ResolvedPublic

Event Timeline

Ottomata created this task.Feb 12 2015, 9:45 PM
Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, JAllemandou.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2015, 9:45 PM

When this is done, the wmf.webrequest Hive table will have the following new fields:

obtained using ClientIPUDF, or IpUtil methods:

  • client_ip

Obtained using GeocodedDataUDF, or Geocode methods by geocoding the client_ip:

  • continent
  • country_code
  • country
  • subdivision
  • city
  • postal_code
  • latitude
  • longitude
  • timezone

The geocoded data may make sense to keep in a map field type rather than top level fields, I am not sure.

You will need to:

  • alter the (Parquet formatted) wmf.webrequest table in such a way that previous data that does not have these fields still works in select statements (default data? is this even possible?)
  • In refinery repository, modify the create_webrequest_table.hql file to reflect the schema changes.
  • In refinery repository, modify oozie/webrequest/refine/refine_webrequest.hql to use the UDFs to populate the new fields.

Once the changes have been reviewed and merged, we will re-submit the oozie job to populate the new data when it runs.

Not sure, but this may be helpful once we upgrade (hopefully today):
https://issues.apache.org/jira/browse/HIVE-6456

kevinator triaged this task as Medium priority.
kevinator edited projects, added Analytics-Kanban; removed Analytics-Engineering.
kevinator set Security to None.
kevinator moved this task from Next Up to In Progress on the Analytics-Kanban board.
ggellerman closed this task as Resolved.Mar 2 2015, 3:08 PM
ggellerman added a subscriber: ggellerman.