Broken /a/refinery-source/guard/ script on stat1002
Open, HighPublic


Executing the script causes this error:

Running guard: /a/refinery-source/guard/tools/../MediaFileUrlParser/
Error The input TSV file '/a/squid/archive/sampled/sampled-1000.tsv.log-20170403.gz' does not exist

Discussion on the Analytics mailing list:

The refinery-source guards (of which there is only one) were made by Christian to do integration testing on real recent data, so ensure that things like all our different webrequest UDF parsers keep working, in case the raw webrequest log format changes. The one that exists, is called MediaFileUrlParser guard, and it expects there to be sampled tsv logs from which it will read data.  MediaFileUrlParser doesn’t need sampled tsv lines, it only needs full URLs.  However, since the sampled tsv files existed, and they were sampled, I guess Christian decided to extract URLs from them, so sampling for this test wouldn’t have to be done a 2nd time.

Now that we don’t have sampled tsv files anymore, this job is failing. To fix, if we want to fix this, we should probably turn this into a regular MapReduce or Spark job of some kind that works with the data in Hadoop.
elukey created this task.Jun 3 2017, 6:12 AM
elukey added a subscriber: faidon.
Nuria raised the priority of this task from Normal to High.
Nuria moved this task from Incoming to Dashiki on the Analytics board.
Nuria added a subscriber: Nuria.

Change 357372 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Disable role::analytics_cluster::refinery::job::guard

Change 357372 merged by Elukey:
[operations/puppet@production] Disable role::analytics_cluster::refinery::job::guard

Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 10 2017, 4:06 PM
fdans moved this task from Backlog (Later) to Deprioritized on the Analytics board.Jan 8 2018, 5:02 PM
Joe added a subscriber: Joe.

@elukey is this still ongoing? It's opened with priority high.