Page MenuHomePhabricator

Make hdfs-rsync process sub-folders recursively
Closed, ResolvedPublic

Description

Currently the hdfs-rsync lib works only at folder top-level. This means that if the content of a source subfolder changes, it is not synchronized onto destination, as the subfolder has not changed.

Event Timeline

fdans triaged this task as High priority.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Change 557052 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Change 557099 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Add hdfs-rsync wrapper script

https://gerrit.wikimedia.org/r/557099

Change 557099 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Add hdfs-rsync wrapper script

https://gerrit.wikimedia.org/r/557099

Change 557117 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Add hdfs-tools 0.0.1 .jar with git-fat

https://gerrit.wikimedia.org/r/557117

Change 557117 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Add hdfs-tools 0.0.1 .jar with git-fat

https://gerrit.wikimedia.org/r/557117

Change 557052 merged by Ottomata:
[operations/puppet@production] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Yahoo!

[@stat1007:/home/otto] $ which hdfs-rsync
/usr/local/bin/hdfs-rsync
[@stat1007:/home/otto] $ hdfs-rsync --help
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
hdfs-rsync Provides an rsync equivalent to copy data between HDFS and local FS.
Usage: hdfs-rsync [options] src...[dst]


To prevent mistakes between local and remote filesystems, src and dst have to be provided as
fully qualified absolute URIs, for instance file:/home or hdfs:///user/hive.

src should be globs leading to some existing files and dst should be a folder, existing
or at a path where it can be created.

Note: a trailing slash in src (as in /example/src/) is changed to a pattern matching only the
      directory content (as in /example/src/*) mimicking standard rsync behavior.

Syntax for filter/include/exclude rules is similar to the standard rsync one:
 * One rule per command-line option.
 * Order matters as inclusion/exclusion is done picking the first matching pattern.
 * Include/exclude options expect patterns only (use filter if you need modifiers)
 * Extraneous files to be deleted on dst follow include/exclude rules. Use --delete-excluded
   to NOT use the rules on those files.
 * the filter rule format is: RULE[MODIFIERS] PATTERN
     - RULE is either '+' (include) or '-' (exclude).
     - MODIFIERS are optional and can be '!' (negative pattern match) or '/' (match pattern
       against full path even if no '/' or ''), or both.
     - PATTERN is the pattern to match.
   Note: A single space is expected between the rule and modifier char sequence and the pattern.
   Note: Use quotes around the rules when you use wildcard patterns to prevent the shell
         interpreting them.

 About patterns:
 * Pattern wildcard characters are: '*', '**', '?', (see rsync doc). You can escape wildcards
   using '\'. We don't reproduce the character-class wildcards as their definition is not
   present in documentation.
 * Patterns starting with a '/' are anchored, meaning they match only from the root of the
   transfer  (similar to '^' in regex).
 * Patterns with a trailing '/' match only directories.
 * Patterns containing '/' (not counting trailing '/') or '**' are matched against the full
   path (including leading directories). Otherwise it is matched only against the final
   component of the filename.

  --help                Prints this usage text
  --dry-run             Only log instead of actually taking actions (default: false)
  -v, --verbose         Add verbosity to logging (DEBUG)
  -q, --quiet           Remove verbosity from logging (WARN)
  -r, --recursive       Recurse into directories (default: false)
  -p, --perms           Preserve permissions (default: false)
  -t, --times           Preserve modification times (default: false)
  --times-diff <value>  Milliseconds by which modificationTimes can differ and still be considered equal (default: 1000)
  --size-only           Skip files that match in size (default: false)
  -I, --ignore-times    Don't skip files that match size and time (default: false)
  --delete              Delete extraneous files from dst dirs (default: false)
  --delete-excluded     Delete extraneous files from dst dirs even if present in exclude rule (default: false)
  --chmod <value>       affect file and/or directory permissions
  --filter <value>      Add a filter rule
  --include <value>     Add inclusion pattern (this is an alias for: --filter '+ PATTERN')
  --exclude <value>     Add exclusion pattern (this is an alias for: --filter '- PATTERN')
  src...[dst]           Fully qualified URI, one or more sources followed by zero or one destination