Page MenuHomePhabricator

Decide on file naming and organization structure
Closed, ResolvedPublic

Description

There are several categories of file and data which we'll handle in this scraper:

NameContainer format(s)Content formatExample filenameDescription
HTML dump (tarball)tar + gzip + splitJSON lines, Parsoid-RDFahawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gzThese are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0).
Uncompressed dumpnoneJSON lineshawiki_0.ndjsonSame as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline.
Intermediate summarygzip (TODO)JSON lineshawiki-references-20230320.ndjsonThe output of HtmlPageParser is a map of summary statistics for each page in a dump.
Sampleanyanyhawiki-sample100.ndjsonWe may take smaller samples of each file type while prototyping the processor.

There are several dimensions of organization to consider for each file type.

NameExample valueDescription
snapshot date20230320Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps.
wikihawikiDatabase name of a wiki.
file typetarball / summary / sample / ...The contents and purpose of a file.
aggregation scalepage / wikiOne line of text corresponds to what level of detail?
pipeline stageraw input / intermediate result / final outputNote that the intermediate files need a more specific name.

This task is complete when we agree on a directory structure and file naming convention.

Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already.

Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. The input share will probably have an existing directory structure that should be hardcoded into the scraper.


Draft proposal to edit and extend:

inputs/
  <snapshot>/
    <wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz

aggregates/
  <snapshot>/
    <wiki>-references-<snapshot>.ndjson.gz

reports/
  <snapshot>/
    all-wikis-references-statistics-<snapshot>.csv
    all-wikis-references-templates-<snapshot>.csv

sampled/
  inputs/
    <snapshot>/
      <wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.ndjson.gz
  aggregates/
    <snapshot>/
      <wiki>-references-<snapshot>-sample<count>.ndjson.gz
  reports/
    <snapshot>/
      all-wikis-references-statistics-<snapshot>-sample<count>.csv

Reasoning:

  • Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), and have their own predetermined structure which we should consider as opaque here.
  • Input vs. outputs must be the highest-level partitioning because it crosses filesystem boundaries.
  • Snapshots are siblings because more than one snapshot set can be present.
  • Aggregate file names should include a meaningful term "references" because these data are the summarized references per article over a wiki.
  • Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here, formats should be user-friendly, ...
  • File naming is verbose enough to convey meaning even out-of-context.
  • I went back and forth about "sampled". Reproducing the tree seems like a safe and clear partition, reproducing the tree is a bit odd but I'm thinking it's a bit of a scratch directory compared to the others.

Event Timeline

awight updated the task description. (Show Details)

Observations and remarks:

  • What's ".ndjson"?
  • Organizing inputs by date sounds good. When changes are made in this directory it's most probably either adding or removing a set of files with the same date. While this can be done with wildcards it's easier when it's a directory.
  • What about e.g. "aggregated/" instead of "references/"?
  • I think the intermediate files should have a date as well. I think it's good to keep them for a while. This makes it much easier to possibly make changes to reports, or generate reports that compare data from two different intermediate files.

Otherwise this looks great. Are there specific questions left to discuss?

Thanks for the review!

  • What's ".ndjson"?

Newline-delimited JSON. In other words, JSON lines.

When changes are made in this directory it's most probably either adding or removing a set of files with the same date.

That's exactly it, thanks for putting into words!

  • What about e.g. "aggregated/" instead of "references/"?

My reasoning was that "references" is the focused analysis task we're doing now so we can give it a meaningful name. We might use the same framework for answering more questions in another domain, so we chould leave space for siblings. "aggregated" is a bit like "data" in generality--but I think our ideas are perfectly complementary! How about "/aggregated/references"? Or YAGNI?

  • I think the intermediate files should have a date as well.

Yeah I think you're absolutely right, also about keeping multiple intermediate files. I'll update the task description with examples.