Executing the script causes this error:
Running guard: /a/refinery-source/guard/tools/../MediaFileUrlParser/run_guard.sh Error The input TSV file '/a/squid/archive/sampled/sampled-1000.tsv.log-20170403.gz' does not exist
Discussion on the Analytics mailing list:
The refinery-source guards (of which there is only one) were made by Christian to do integration testing on real recent data, so ensure that things like all our different webrequest UDF parsers keep working, in case the raw webrequest log format changes. The one that exists, is called MediaFileUrlParser guard, and it expects there to be sampled tsv logs from which it will read data. MediaFileUrlParser doesn’t need sampled tsv lines, it only needs full URLs. However, since the sampled tsv files existed, and they were sampled, I guess Christian decided to extract URLs from them, so sampling for this test wouldn’t have to be done a 2nd time. Now that we don’t have sampled tsv files anymore, this job is failing. To fix, if we want to fix this, we should probably turn this into a regular MapReduce or Spark job of some kind that works with the data in Hadoop.