Page MenuHomePhabricator

Make gecoded data and chosen client_ip available as fields in refined webrequest data
Closed, ResolvedPublic

Event Timeline

Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, JAllemandou.

When this is done, the wmf.webrequest Hive table will have the following new fields:

obtained using ClientIPUDF, or IpUtil methods:

  • client_ip

Obtained using GeocodedDataUDF, or Geocode methods by geocoding the client_ip:

  • continent
  • country_code
  • country
  • subdivision
  • city
  • postal_code
  • latitude
  • longitude
  • timezone

The geocoded data may make sense to keep in a map field type rather than top level fields, I am not sure.

You will need to:

  • alter the (Parquet formatted) wmf.webrequest table in such a way that previous data that does not have these fields still works in select statements (default data? is this even possible?)
  • In refinery repository, modify the create_webrequest_table.hql file to reflect the schema changes.
  • In refinery repository, modify oozie/webrequest/refine/refine_webrequest.hql to use the UDFs to populate the new fields.

Once the changes have been reviewed and merged, we will re-submit the oozie job to populate the new data when it runs.

Not sure, but this may be helpful once we upgrade (hopefully today):
https://issues.apache.org/jira/browse/HIVE-6456

kevinator triaged this task as Medium priority.
kevinator edited projects, added Analytics-Kanban; removed Analytics-Engineering.
kevinator set Security to None.
kevinator moved this task from Next Up to In Progress on the Analytics-Kanban board.
ggellerman added a subscriber: ggellerman.