Page MenuHomePhabricator

Cirrus query clicks cron job for dropping partitions older than 90 days have started failing
Closed, ResolvedPublic

Description

This has failed twice so far, May 21 & 22 2019.

Per email alerts the command run is:

export PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python && /srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions -d 90 -D discovery -t query_clicks_hourly,query_clicks_daily >> /var/log/refinery/drop-query-clicks.log

The error message is:

/bin/sh: 1: cannot create /var/log/refinery/drop-query-clicks.log: Permission denied

The cron job comes from puppet, in modules/profile/manifests/analytics/refinery/job/data_purge.pp
Looking at that file I see there is one other users of refinery-drop-hive-partitions (and one call that uses ensure => absent). The structure is a bit different though, cirrus uses cron where the other uses profile::analytics::systemd_timer. Additionally the other provides a -f argument with a log file, rather than piping stdout.

Does the cirrus line for ensuring we delete data in a timely manner need to be updated to some new standard for running recurring tasks in the analytics cluster, and perhaps be aligned ?

The related changes to the other user (mediawiki-raw-cu-changes-drop-month) are:

I97446c4f702a1ffa0db8ee07e4e0fb80cf3fe2ec : Use standard logging approach similar to sqoop job
I8528b190774922165971aee9a1fbc0a3e0fdc953 : profile::analytics::refinery::job::data_purge: move crons to timers

Also possible the above are simply changes, but not related to this exact failure.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans subscribed.

Checked with @Ottomata, no need for Operations here, removing the tag.

Change 512235 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Convert cirrus data retention from cron to systemd

https://gerrit.wikimedia.org/r/512235

Error message changed today:

Traceback (most recent call last):
  File "/srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions", line 153, in <module>
    hive.drop_partitions(table, partition_specs_to_drop)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 190, in drop_partitions
    return self.query(q, use_tempfile=True)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 355, in query
    out = self.script(f.name, check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 365, in script
    return self._command(['-f', script], check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 372, in _command
    return sh(cmd, check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/util.py", line 121, in sh
    .format(command_string, p.returncode), stdout, stderr)
RuntimeError: ('Command: hive --service cli --database discovery -f /tmp/tmp-hive-query-cFTKZi.hiveql failed with error code: 1', '', 'Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nlog4j:WARN No such property [maxBackupIndex] in org.apache.log4j.DailyRollingFileAppender.\n\nLogging initialized using configuration in file:/etc/hive/conf.analytics-hadoop/hive-log4j.properties\nOK\nTime taken: 0.874 seconds\nFAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Got exception: java.io.IOException Failed to move to trash: hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/hour=11\n')

The formatted exception is:

Command: hive --service cli --database discovery -f /tmp/tmp-hive-query-cFTKZi.hiveql failed with error code: 1  Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
log4j:WARN No such property [maxBackupIndex] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/conf.analytics-hadoop/hive-log4j.properties
OK
Time taken: 0.874 seconds
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Got exception: java.io.IOException Failed to move to trash: hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/hour=11

@EBernhardson

This might be related to the recent change of user.
Before, all those crons and systemd timers were executed by the hdfs user.
This last week all has been migrated to the analytics user.
However, it's weird, because the puppet code specifies the analytics user,
and the log file that appears in the message also belongs to that user.
Those seem correct.

Regarding the second error, maybe the analytics user is missing permits to delete folders belonging to (ebernhardson analytics-search-users).
I'm not sure, but I think analytics user has less permits than hdfs user.

Change 512235 merged by Gehel:
[operations/puppet@production] Convert cirrus data retention from cron to systemd

https://gerrit.wikimedia.org/r/512235

Change 512702 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Convert cirrus data retention from cron to systemd.

https://gerrit.wikimedia.org/r/512702

Really sorry for the trouble @EBernhardson, of course I broke it and went away for a few days of vacations without realizing :(

elukey@stat1007:~$  ls -l /mnt/hdfs/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/
total 80
drwxrwxr-x 4 ebernhardson analytics-search-users 4096 Feb 19 12:55 hour=10
drwxrwxr-x 4 ebernhardson analytics-search-users 4096 Feb 19 13:58 hour=11
[...]

Can we move the timer to a more suitable user? We are trying to get rid of the hdfs user's usage as much as possible :)

Quick info: the timer is currently failing due to perms of /etc/hive/conf.analytics-hadoop/hive-site.xml, that are now:

elukey@an-coord1001:~$ ls -l /etc/hive/conf.analytics-hadoop/hive-site.xml
-r--r----- 1 hive analytics 6813 Jan  7 10:42 /etc/hive/conf.analytics-hadoop/hive-site.xml

If possible, I'd like to move again this timer to the analytics user. What do you think?

Hm. If this is running already as hdfs user, then moving to analytics user makes sense. However, it should be possible for folks to run jobs as their team's system user. The problem with that here is that we are using the hive-site.xml on an-coord, which has pws in it. One day maybe we'll move job launching to a different server.

We have an analytics-search user, member of the analytics-search group. This was primarily done because we needed an account search platform could sudo to to submit oozie jobs, would it make sense to also run the systemd timer under this user and add it to the analytics group for read access to hite-site.xml ? Alternatively perhaps analytics user could be added to analytics-search group, and that would be sufficient to run the timer under analytics and not expand what analytics-search can read?

I'd rather not mix the groups/users if possible. The hive-site.xml on an-coord1001 is important since it contains passwords, but only because the hive-server/metastore daemons need it (they run on that host). Maybe we could move this job to stat1004 or stat1007, run it via the analytics-search user and exploit the fact that the same file has the following perms (no passwords):

elukey@stat1007:~$ ls -l /etc/hive/conf/hive-site.xml
-r--r--r-- 1 hive hdfs 5882 Nov 29 12:46 /etc/hive/conf/hive-site.xml

Moving it to stat1007 seems reasonable to me. A test run of the script using sudo -u analytics-search shows the data purge completing as expected. I expect the log output to /var/log/refinery on stat1007 will fail, the analytics-search user does not have write access there. Since this isn't really refinery, should we create a new log directory in /var/log/analytics-search or some such to emit to?

Additionally we need to bikeshed a place in puppet to put these data purge timers that don't run an an-coord*

+1 for /var/log/analytics-search, seems reasonable to me. It should be also very easy to do with the current puppet code. I think that we could create a dedicated profile and add the timer to it? Then include the profile in stat1007's role.

Change 513038 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce profile::analytics::search::data_drop

https://gerrit.wikimedia.org/r/513038

Change 512702 merged by Gehel:
[operations/puppet@production] Convert cirrus data retention from cron to systemd.

https://gerrit.wikimedia.org/r/512702

Change 513038 merged by Elukey:
[operations/puppet@production] Introduce profile::analytics::search::jobs

https://gerrit.wikimedia.org/r/513038

On stat1007:

● search-drop-query-clicks.service - Drop cirrus click logs from Hive/HDFS following data retention policies.
   Loaded: loaded (/lib/systemd/system/search-drop-query-clicks.service; static; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-05-30 11:38:21 UTC; 13s ago
  Process: 45955 ExecStart=/srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions -d 90 -D discovery -t quer

 Main PID: 45955 (code=exited, status=0/SUCCESS)    <==============

May 30 11:37:29 stat1007 systemd[1]: Started Drop cirrus click logs from Hive/HDFS following data retention policies..
elukey@stat1007:~$ ls -l /var/log/analytics-search/
total 4
-rw-r--r-- 1 analytics-search analytics-search 588 May 30 11:38 drop-query-clicks.log

elukey@stat1007:~$ cat /var/log/analytics-search/drop-query-clicks.log
2019-05-30T11:37:29 INFO   Looking for partitions to drop for discovery.query_clicks_hourly...
2019-05-30T11:37:49 INFO   Dropping 40 Hive partitions from table discovery.query_clicks_hourly
2019-05-30T11:38:02 INFO   No partition directories need removed for table discovery.query_clicks_hourly
2019-05-30T11:38:02 INFO   Looking for partitions to drop for discovery.query_clicks_daily...
2019-05-30T11:38:15 INFO   Dropping 2 Hive partitions from table discovery.query_clicks_daily
2019-05-30T11:38:21 INFO   No partition directories need removed for table discovery.query_clicks_daily

@EBernhardson let me know if it is ok or if something is missing :)

Everything looks to be running appropriately. Thanks!

fdans claimed this task.