Cirrus query clicks cron job for dropping partitions older than 90 days have started failing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	May 22 2019, 11:37 PM

Description

This has failed twice so far, May 21 & 22 2019.

Per email alerts the command run is:

export PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python && /srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions -d 90 -D discovery -t query_clicks_hourly,query_clicks_daily >> /var/log/refinery/drop-query-clicks.log

The error message is:

/bin/sh: 1: cannot create /var/log/refinery/drop-query-clicks.log: Permission denied

The cron job comes from puppet, in modules/profile/manifests/analytics/refinery/job/data_purge.pp
Looking at that file I see there is one other users of refinery-drop-hive-partitions (and one call that uses ensure => absent). The structure is a bit different though, cirrus uses cron where the other uses profile::analytics::systemd_timer. Additionally the other provides a -f argument with a log file, rather than piping stdout.

Does the cirrus line for ensuring we delete data in a timely manner need to be updated to some new standard for running recurring tasks in the analytics cluster, and perhaps be aligned ?

The related changes to the other user (mediawiki-raw-cu-changes-drop-month) are:

I97446c4f702a1ffa0db8ee07e4e0fb80cf3fe2ec : Use standard logging approach similar to sqoop job
I8528b190774922165971aee9a1fbc0a3e0fdc953 : profile::analytics::refinery::job::data_purge: move crons to timers

Also possible the above are simply changes, but not related to this exact failure.

Details

Subject	Repo	Branch	Lines +/-
Introduce profile::analytics::search::jobs	operations/puppet	production	+52 -0
Convert cirrus data retention from cron to systemd.	operations/puppet	production	+0 -10
Convert cirrus data retention from cron to systemd	operations/puppet	production	+13 -3

Customize query in gerrit

Event Timeline

EBernhardson created this task.May 22 2019, 11:37 PM

Restricted Application added projects: Analytics, Discovery-Search. · View Herald TranscriptMay 22 2019, 11:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

EBernhardson updated the task description. (Show Details)May 22 2019, 11:38 PM

Ottomata added a subscriber: elukey.May 23 2019, 1:20 PM

Checked with @Ottomata, no need for Operations here, removing the tag.

EBernhardson triaged this task as High priority.May 23 2019, 5:04 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Change 512235 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Convert cirrus data retention from cron to systemd

https://gerrit.wikimedia.org/r/512235

gerritbot added a project: Patch-For-Review.May 23 2019, 8:39 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.May 23 2019, 9:01 PM

Error message changed today:

Traceback (most recent call last):
  File "/srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions", line 153, in <module>
    hive.drop_partitions(table, partition_specs_to_drop)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 190, in drop_partitions
    return self.query(q, use_tempfile=True)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 355, in query
    out = self.script(f.name, check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 365, in script
    return self._command(['-f', script], check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 372, in _command
    return sh(cmd, check_return_code)
  File "/srv/deployment/analytics/refinery/python/refinery/util.py", line 121, in sh
    .format(command_string, p.returncode), stdout, stderr)
RuntimeError: ('Command: hive --service cli --database discovery -f /tmp/tmp-hive-query-cFTKZi.hiveql failed with error code: 1', '', 'Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nPicked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8\nlog4j:WARN No such property [maxBackupIndex] in org.apache.log4j.DailyRollingFileAppender.\n\nLogging initialized using configuration in file:/etc/hive/conf.analytics-hadoop/hive-log4j.properties\nOK\nTime taken: 0.874 seconds\nFAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Got exception: java.io.IOException Failed to move to trash: hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/hour=11\n')

The formatted exception is:

Command: hive --service cli --database discovery -f /tmp/tmp-hive-query-cFTKZi.hiveql failed with error code: 1  Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
log4j:WARN No such property [maxBackupIndex] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/conf.analytics-hadoop/hive-log4j.properties
OK
Time taken: 0.874 seconds
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Got exception: java.io.IOException Failed to move to trash: hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/hour=11

@EBernhardson

This might be related to the recent change of user.
Before, all those crons and systemd timers were executed by the hdfs user.
This last week all has been migrated to the analytics user.
However, it's weird, because the puppet code specifies the analytics user,
and the log file that appears in the message also belongs to that user.
Those seem correct.

Regarding the second error, maybe the analytics user is missing permits to delete folders belonging to (ebernhardson analytics-search-users).
I'm not sure, but I think analytics user has less permits than hdfs user.

Change 512235 merged by Gehel:
[operations/puppet@production] Convert cirrus data retention from cron to systemd

https://gerrit.wikimedia.org/r/512235

Change 512702 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Convert cirrus data retention from cron to systemd.

https://gerrit.wikimedia.org/r/512702

Really sorry for the trouble @EBernhardson, of course I broke it and went away for a few days of vacations without realizing :(

elukey@stat1007:~$  ls -l /mnt/hdfs/wmf/data/discovery/query_clicks/hourly/year=2019/month=2/day=19/
total 80
drwxrwxr-x 4 ebernhardson analytics-search-users 4096 Feb 19 12:55 hour=10
drwxrwxr-x 4 ebernhardson analytics-search-users 4096 Feb 19 13:58 hour=11
[...]

Can we move the timer to a more suitable user? We are trying to get rid of the hdfs user's usage as much as possible :)

Quick info: the timer is currently failing due to perms of /etc/hive/conf.analytics-hadoop/hive-site.xml, that are now:

elukey@an-coord1001:~$ ls -l /etc/hive/conf.analytics-hadoop/hive-site.xml
-r--r----- 1 hive analytics 6813 Jan  7 10:42 /etc/hive/conf.analytics-hadoop/hive-site.xml

If possible, I'd like to move again this timer to the analytics user. What do you think?

Hm. If this is running already as hdfs user, then moving to analytics user makes sense. However, it should be possible for folks to run jobs as their team's system user. The problem with that here is that we are using the hive-site.xml on an-coord, which has pws in it. One day maybe we'll move job launching to a different server.

We have an analytics-search user, member of the analytics-search group. This was primarily done because we needed an account search platform could sudo to to submit oozie jobs, would it make sense to also run the systemd timer under this user and add it to the analytics group for read access to hite-site.xml ? Alternatively perhaps analytics user could be added to analytics-search group, and that would be sufficient to run the timer under analytics and not expand what analytics-search can read?

EBernhardson removed a project: Patch-For-Review.May 28 2019, 5:18 PM

EBernhardson moved this task from Needs review to Incoming on the Discovery-Search (Current work) board.

I'd rather not mix the groups/users if possible. The hive-site.xml on an-coord1001 is important since it contains passwords, but only because the hive-server/metastore daemons need it (they run on that host). Maybe we could move this job to stat1004 or stat1007, run it via the analytics-search user and exploit the fact that the same file has the following perms (no passwords):

elukey@stat1007:~$ ls -l /etc/hive/conf/hive-site.xml
-r--r--r-- 1 hive hdfs 5882 Nov 29 12:46 /etc/hive/conf/hive-site.xml

Moving it to stat1007 seems reasonable to me. A test run of the script using sudo -u analytics-search shows the data purge completing as expected. I expect the log output to /var/log/refinery on stat1007 will fail, the analytics-search user does not have write access there. Since this isn't really refinery, should we create a new log directory in /var/log/analytics-search or some such to emit to?

Additionally we need to bikeshed a place in puppet to put these data purge timers that don't run an an-coord*

+1 for /var/log/analytics-search, seems reasonable to me. It should be also very easy to do with the current puppet code. I think that we could create a dedicated profile and add the timer to it? Then include the profile in stat1007's role.

Change 513038 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce profile::analytics::search::data_drop

https://gerrit.wikimedia.org/r/513038

gerritbot added a project: Patch-For-Review.May 29 2019, 8:07 AM

Change 512702 merged by Gehel:
[operations/puppet@production] Convert cirrus data retention from cron to systemd.

https://gerrit.wikimedia.org/r/512702

Change 513038 merged by Elukey:
[operations/puppet@production] Introduce profile::analytics::search::jobs

https://gerrit.wikimedia.org/r/513038

On stat1007:

● search-drop-query-clicks.service - Drop cirrus click logs from Hive/HDFS following data retention policies.
   Loaded: loaded (/lib/systemd/system/search-drop-query-clicks.service; static; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-05-30 11:38:21 UTC; 13s ago
  Process: 45955 ExecStart=/srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions -d 90 -D discovery -t quer

 Main PID: 45955 (code=exited, status=0/SUCCESS)    <==============

May 30 11:37:29 stat1007 systemd[1]: Started Drop cirrus click logs from Hive/HDFS following data retention policies..

elukey@stat1007:~$ ls -l /var/log/analytics-search/
total 4
-rw-r--r-- 1 analytics-search analytics-search 588 May 30 11:38 drop-query-clicks.log

elukey@stat1007:~$ cat /var/log/analytics-search/drop-query-clicks.log
2019-05-30T11:37:29 INFO   Looking for partitions to drop for discovery.query_clicks_hourly...
2019-05-30T11:37:49 INFO   Dropping 40 Hive partitions from table discovery.query_clicks_hourly
2019-05-30T11:38:02 INFO   No partition directories need removed for table discovery.query_clicks_hourly
2019-05-30T11:38:02 INFO   Looking for partitions to drop for discovery.query_clicks_daily...
2019-05-30T11:38:15 INFO   Dropping 2 Hive partitions from table discovery.query_clicks_daily
2019-05-30T11:38:21 INFO   No partition directories need removed for table discovery.query_clicks_daily

@EBernhardson let me know if it is ok or if something is missing :)

Maintenance_bot removed a project: Patch-For-Review.May 30 2019, 12:10 PM

Everything looks to be running appropriately. Thanks!

• fdans closed this task as Resolved.Jun 3 2019, 3:59 PM

• fdans claimed this task.

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Cirrus query clicks cron job for dropping partitions older than 90 days have started failingClosed, ResolvedPublicActions

Description

Details

Event Timeline

Cirrus query clicks cron job for dropping partitions older than 90 days have started failing
Closed, ResolvedPublic
Actions