Page MenuHomePhabricator

Broken /a/refinery-source/guard/ script on stat1002
Closed, ResolvedPublic


Executing the script causes this error:

Running guard: /a/refinery-source/guard/tools/../MediaFileUrlParser/
Error The input TSV file '/a/squid/archive/sampled/sampled-1000.tsv.log-20170403.gz' does not exist

Discussion on the Analytics mailing list:

The refinery-source guards (of which there is only one) were made by Christian to do integration testing on real recent data, so ensure that things like all our different webrequest UDF parsers keep working, in case the raw webrequest log format changes. The one that exists, is called MediaFileUrlParser guard, and it expects there to be sampled tsv logs from which it will read data.  MediaFileUrlParser doesn’t need sampled tsv lines, it only needs full URLs.  However, since the sampled tsv files existed, and they were sampled, I guess Christian decided to extract URLs from them, so sampling for this test wouldn’t have to be done a 2nd time.

Now that we don’t have sampled tsv files anymore, this job is failing. To fix, if we want to fix this, we should probably turn this into a regular MapReduce or Spark job of some kind that works with the data in Hadoop.

Event Timeline

Nuria raised the priority of this task from Medium to High.Jun 5 2017, 3:36 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.
Nuria added a subscriber: Nuria.

Change 357372 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Disable role::analytics_cluster::refinery::job::guard

Change 357372 merged by Elukey:
[operations/puppet@production] Disable role::analytics_cluster::refinery::job::guard

Joe added a subscriber: Joe.

@elukey is this still ongoing? It's opened with priority high.

fgiunchedi claimed this task.
fgiunchedi added a subscriber: fgiunchedi.

Boldly resolving, the class has been removed from puppet in I830a80fd7eb