From https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/6#note_13763:
Let's figure out independently of this PR what is a good place on HDFS to put this FILTER_PARQUET parquet file.
From https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/6#note_13763:
Let's figure out independently of this PR what is a good place on HDFS to put this FILTER_PARQUET parquet file.
@xcollazo , what about setting paths with VariableProperties, pretty much as we do with the conda artifact?
Something like helper = var_props.get('helper', '/path/to/hdfs')
Yes, that makes sense.
I had two concerns with the current way of using static files:
@mfossati suggestion takes care of (1), as we could just override temporarily while we change the default to the new place.
For (2): We can also come up with some namespacing strategy like:
/user/analytics-platform-eng/structured-data/image_suggestions/ for image_suggestions static files
and
/user/analytics-platform-eng/structured-data/section_topics/ for section_topics static files
etc.
I think the above would make it clear that these static files should not be touched by other folks.
This was implemented in the DAGs, see:
Closing.