Page MenuHomePhabricator

Issues querying table in Hive
Closed, DeclinedPublic

Description

Hi all,

I am trying to run a query in both Hue and beeline, which is returning 0 rows (unexpected behavior). Earlier this week, I was trying to execute an ALTER statement in the same database, which generated a permissions error:
Permission denied: user=eyener, access=WRITE, inode="/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02"

Database: CPS
Usage: Following instructions outlined by a former Fundraising Analytics member found in the following wiki to test the transfer of banner history data for the Big English campaign

Wiki documentation: https://wikitech.wikimedia.org/wiki/Analytics/Fundraising#EN6C_Banner_History_Documentation

Hue/beeline query:
USE cps;
SELECT json_string FROM centralnoticebannerhistory20191202 where year = 2019 and month = 12
and day = 02 and hour = 1;

ALTER statement:
ALTER TABLE CentralNoticeBannerHistory20191202
ADD PARTITION (year=2019, month=12, day=02, hour=00)
LOCATION '/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/00';

Please let me know if you need any additional information, and thank you for the help.

My best,
Erin

Event Timeline

EYener created this task.Feb 6 2020, 2:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2020, 2:23 PM

Hi @EYener,

  • About the query:
use cps;
show partitions centralnoticebannerhistory20191202;


OK
partition
year=2019/month=12/day=02/hour=16
year=2019/month=12/day=02/hour=17
year=2019/month=12/day=02/hour=18
year=2019/month=12/day=02/hour=19
year=2019/month=12/day=02/hour=20
year=2019/month=12/day=02/hour=21
year=2019/month=12/day=02/hour=22
year=2019/month=12/day=02/hour=23
Time taken: 0.055 seconds, Fetched: 8 row(s)

--> the 0 result from your query comes from the fact that there is no partition for the hour you query.

Working example (don't forget to put limit in your query not to flood your shell):

USE cps;
SELECT json_string FROM centralnoticebannerhistory20191202 where year = 2019 and month = 12
and day = 02 and hour = 16 LIMIT 10;
  • Alter statement:

From a stat machine

hdfs dfs -ls /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/00
ls: `/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/00': No such file or directory

hdfs dfs -ls /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/
Found 8 items
drwxr-xr-x   - analytics analytics          0 2019-12-02 17:05 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
drwxr-xr-x   - analytics analytics          0 2019-12-02 18:06 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/17
drwxr-xr-x   - analytics analytics          0 2019-12-02 19:06 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/18
drwxr-xr-x   - analytics analytics          0 2019-12-02 20:05 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/19
drwxr-xr-x   - analytics analytics          0 2019-12-02 21:05 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/20
drwxr-xr-x   - analytics analytics          0 2019-12-02 22:05 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/21
drwxr-xr-x   - analytics analytics          0 2019-12-02 23:06 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/22
drwxr-xr-x   - analytics analytics          0 2019-12-03 00:08 /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/23

--> The alter statement fails because the location doesn't exist.

EYener added a comment.Feb 6 2020, 3:56 PM

Thank you, @JAllemandou! I've learned several new Hive features and commands.

I have one additional (small) question. Using your guidance, I was able to CREATE and ALTER a table for CentralNoticeBannerHistory20191215 with partitions for hour 00 and 01. When I run the ALTER statement in beeline, I see a response of:

INFO : Completed executing command(queryId=hive_20200206154040_b9a46b91-7080-4996-bee6-fc1e574a3680); Time taken: 0.027 seconds
INFO : OK
No rows affected (0.05 seconds)

When I query the table at partition hour = 00, for instance, I can see there are over 42K rows. Is there any way to assess the volume of data being added by the ALTER statement?

Thank you for your help!

fdans moved this task from Incoming to Ops Week on the Analytics board.Feb 6 2020, 5:56 PM

@EYener hiya! It seems you are using the raw json data. It'd be better to use the refined table in the event database: event.CentralNoticeBannerHistory. See also https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_&_Hive

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> show create table event.CentralNoticeBannerHistory;
...
CREATE EXTERNAL TABLE `CentralNoticeBannerHistory`(
  `dt` string,
  `event` struct<i:string,l:array<string>,n:bigint,e:string,r:double>,
  `recvfrom` string,
  `revision` bigint,
  `schema` string,
  `seqid` bigint,
  `useragent` struct<browser_family:string,browser_major:string,browser_minor:string,device_family:string,is_bot:boolean,is_mediawiki:boolean,os_family:string,os_major:string,os_minor:string,wmf_app_version:string>,
  `uuid` string,
  `webhost` string,
  `wiki` string,
  `ip` string,
  `geocoded_data` map<string,string>,
  `topic` string)
PARTITIONED BY (
  `year` bigint,
  `month` bigint,
  `day` bigint,
  `hour` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/wmf/data/event/CentralNoticeBannerHistory'
...

Hi again @EYener,
Partitions in hive are a SQL representation for folders. Adding a partition only tells hive that it should look into a folder to find files related to the values of the partition (for instance, information relative to partition(year=2019, month=12, day=02, hour=16) is in folder /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16) This scheme allows to filter the amount of data that is read for queries on big tables.
Something else to know is that the default way for hive to encode partition information in folder is field=value (for instance /my/table/year=2020/month=12/...). But hive also allows to link partitions with path not field-value structured (as in the first example).
Finally with that, adding a partition doesn't give you any information about data size as it is only linking partition values to a folder. The easiest to get some information on data sie without querying is to look at folder size. For instance on a stat machine:

hdfs dfs -du -s -h /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
10.9 M  32.8 M  /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
EYener added a comment.Feb 6 2020, 7:56 PM

Thanks for the suggestion, @Ottomata - I'll take it back to the team and see what makes sense. Since we've been using the json_string format since 2016, it might make sense to have the 2019 data in the same format as well.

EYener added a comment.Feb 6 2020, 9:00 PM

I'm curious, @Ottomata and @JAllemandou, if there is an elegant solution to dynamically filling partitions. IE, once a table is created with the partition types declared and a main location established, is the best way to fill the table to declare individual ALTER statements by day, hour, etc? Or is there a better way to accomplish this?

Thank you for the help!

Heh, this is one of the reasons not to use the raw data; partitions are added automatically to the refined tables in the event database. The refined table also includes pre-geocoded information.

I think there isn't a good way to do it with the raw data. Hive does have a way to automatically add partitions from paths on disk, but the raw data hierarchy doesn't include the names of the partitions so it has no way to know e.g. that '2020' in an 'hourly/2020/01/02/05' path should be the year partition field.

We do have some Oozie workflows that add partitions to raw data as it comes in, but setting this up for your own special table would require you to learn Oozie and maintain your own Oozie jobs (which you can do!)
See also https://github.com/wikimedia/analytics-refinery/tree/master/oozie/util/hive/partition/add and an example usage in the webrequest load job.

But really, this data is already partitioned (and stored in a format much more efficient for querying) in the event database.

Nothing to add to what Andrew said. Adding partitions with folders not following the pattern convention needs to be done 'manually' (can be done through a script, but with explicit single partition commands).

Nuria added a subscriber: Nuria.Feb 17 2020, 5:24 PM

Let's not use this data, closing ticket . The data is on events database on centralbannerhistory table and been so for a few years.

Nuria closed this task as Declined.Feb 17 2020, 5:24 PM