Decide on file naming and organization structure
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Mar 23 2023, 11:41 AM

Description

There are several categories of file and data which we'll handle in this scraper:

Name	Container format(s)	Content format	Example filename	Description
HTML dump (tarball)	tar + gzip + split	JSON lines, Parsoid-RDFa	hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz	These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0).
Uncompressed dump	none	JSON lines	hawiki_0.ndjson	Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline.
Intermediate summary	gzip (TODO)	JSON lines	hawiki-references-20230320.ndjson	The output of HtmlPageParser is a map of summary statistics for each page in a dump.
Sample	any	any	hawiki-sample100.ndjson	We may take smaller samples of each file type while prototyping the processor.

There are several dimensions of organization to consider for each file type.

Name	Example value	Description
snapshot date	20230320	Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps.
wiki	hawiki	Database name of a wiki.
file type	tarball / summary / sample / ...	The contents and purpose of a file.
aggregation scale	page / wiki	One line of text corresponds to what level of detail?
pipeline stage	raw input / intermediate result / final output	Note that the intermediate files need a more specific name.

This task is complete when we agree on a directory structure and file naming convention.

Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already.

Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. The input share will probably have an existing directory structure that should be hardcoded into the scraper.

Draft proposal to edit and extend:

inputs/
  <snapshot>/
    <wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz

aggregates/
  <snapshot>/
    <wiki>-references-<snapshot>.ndjson.gz

reports/
  <snapshot>/
    all-wikis-references-statistics-<snapshot>.csv
    all-wikis-references-templates-<snapshot>.csv

sampled/
  inputs/
    <snapshot>/
      <wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.ndjson.gz
  aggregates/
    <snapshot>/
      <wiki>-references-<snapshot>-sample<count>.ndjson.gz
  reports/
    <snapshot>/
      all-wikis-references-statistics-<snapshot>-sample<count>.csv

Reasoning:

Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), and have their own predetermined structure which we should consider as opaque here.
Input vs. outputs must be the highest-level partitioning because it crosses filesystem boundaries.
Snapshots are siblings because more than one snapshot set can be present.
Aggregate file names should include a meaningful term "references" because these data are the summarized references per article over a wiki.
Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here, formats should be user-friendly, ...
File naming is verbose enough to convey meaning even out-of-context.
I went back and forth about "sampled". Reproducing the tree seems like a safe and clear partition, reproducing the tree is a bit odd but I'm thinking it's a bit of a scratch directory compared to the others.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T332162 Run scraper on samples from several wikis
Resolved	None	T332045 Stream input file from a tarball
Resolved	None	T332047 Script to fetch the latest HTML dump for a given wiki
Resolved	awight	T332051 Alternative output formats
Resolved	awight	T332053 Basic aggregation from intermediate format
Resolved	None	T332056 Include additional page metadata
Resolved	awight	T332159 New cloud instance and attached volume for dumps processing
Resolved	awight	T332040 Shut down our previous Cloud VPS project and create a new one
Resolved	taavi	T332773 Destroy "wmde-templates-alpha" Cloud VPS project
Resolved	Andrew	T332777 Request creation of "dump-references-processor" VPS project
Resolved	awight	T333549 Requesting Cloud VPS access to NFS mount /public/dumps
Resolved	None	T332165 Implement subdirectories for different wiki outputs
Resolved	None	T332873 Decide on file naming and organization structure
Resolved	None	T334591 Collect stats on templates inside of ref tags

Event Timeline

awight created this task.Mar 23 2023, 11:41 AM

awight updated the task description. (Show Details)Mar 23 2023, 12:45 PM

awight updated the task description. (Show Details)

awight updated the task description. (Show Details)Mar 23 2023, 12:54 PM

awight added a project: Dumps-Generation.Mar 23 2023, 1:46 PM

awight updated the task description. (Show Details)

awight updated the task description. (Show Details)Mar 23 2023, 1:52 PM

awight added a parent task: T332165: Implement subdirectories for different wiki outputs.Mar 24 2023, 9:29 AM

awight updated the task description. (Show Details)Mar 24 2023, 9:33 AM

Observations and remarks:

What's ".ndjson"?
Organizing inputs by date sounds good. When changes are made in this directory it's most probably either adding or removing a set of files with the same date. While this can be done with wildcards it's easier when it's a directory.
What about e.g. "aggregated/" instead of "references/"?
I think the intermediate files should have a date as well. I think it's good to keep them for a while. This makes it much easier to possibly make changes to reports, or generate reports that compare data from two different intermediate files.

Otherwise this looks great. Are there specific questions left to discuss?

awight updated the task description. (Show Details)Mar 28 2023, 10:20 AM

In T332873#8725086, @thiemowmde wrote:

Thanks for the review!

What's ".ndjson"?

Newline-delimited JSON. In other words, JSON lines.

When changes are made in this directory it's most probably either adding or removing a set of files with the same date.

That's exactly it, thanks for putting into words!

What about e.g. "aggregated/" instead of "references/"?

My reasoning was that "references" is the focused analysis task we're doing now so we can give it a meaningful name. We might use the same framework for answering more questions in another domain, so we chould leave space for siblings. "aggregated" is a bit like "data" in generality--but I think our ideas are perfectly complementary! How about "/aggregated/references"? Or YAGNI?