Page MenuHomePhabricator

allow for comments in headers for CSV files
Closed, ResolvedPublic

Description

Having the extension ignore comments in the headers of CSV files allows for using files with embedded metadata as per the Model for Tabular Data and Metadata on the Web.

In my use case, users upload CSV files to the wiki and use get_file_data to create views on them through wiki pages, supplemented by other information entered in the wiki.

It's useful to have self-documenting CSV files, for when they are downloaded and used independently.

Event Timeline

@ahmad - how do you include comments in a CSV file?

Just in the header, before the rows begin, preceeded by hashes.

There is an example in the specification linked to above.

I'm not sure any spreadsheets or the like can currently produce or even
ignore these, but as a proposed standard for tabular metadata, CSV files
published for others, humans and machines to consume, could have them.

Okay, I get it. I don't know what the best approach is here - it would be easy to have the code just ignore lines that start with a "#", but on the other hand, commented lines like this are not part of the standard CSV specification, and there's always the small chance that a row is actually supposed to start with a "#".

One other option is to have something similar to the parameter "json offset=", which is used to let #get_web_data to know to skip the first X characters of the JSON. Could there be a similar parameter, "csv rows offset=" or something, to have the code skip the first X rows of the CSV? Or would it be too difficult to know exactly how many comment rows each file will have?

I do not think there's a way to tell beforehand the number of comment
lines to expect in the header, since the idea is used to include an
arbitrary schema description.

So perhaps an option to account for headers or not at all would be more
useful than a an option to offset rows. Or perhaps a new CSV type, just
like there's now one with column headers and another without. If the
user knows he's using the CSV specification for metadata, then they are
likely to properly escape hashes in the data.

On another note, an option to offset rows might be useful anyway.
I'm currently suffering with poor performance while looking up rows in a
large CSV files of 15K lines / 3.5MB using filters. So assuming the data
is sorted, perhaps offsetting to a given row before starting the
granular lookup which -- entails reading each line to test the data
restrictions (filters) I assume? -- may increase the efficiency?
I'm guessing here as haven't actually looked at the implementation.

I was just considering splitting the file horizontally (on rows) which
is really unjustified from the perspective of data design, since having
split the file vertically (on columns) have barely improved the
situation from "timing out and running out of memory" to "crawling with
full resource exhaustion".

I understand this is not meant to be an efficient DB engine, but it fits
my application perfectly within the model of the wiki.

Thanks, Yaron. You're always responsive.

Thinking of it, I don't think a row offset will help in my case, since I cannot know in advance how many rows are there before the sought one. But it might be useful in other cases.

Do all of your CSV files have thousands of rows? If so, it sounds like you have a bigger problem than comments in the files.

If the goal is to store the data in the wiki, what I would recommend is: install the Data Transfer and Cargo extensions, use Data Transfer to import all of the CSV rows into template calls in the wiki (they can all go on one page), and then use Cargo to store all that template data in the database, so that it can be queried in the wiki. That may sound like overkill, but I actually think it's the simplest approach for you that would let you actually access all this data.

You are right. The issue with the performance is different from
comments. I just brought it up prompted by your suggestion of a
row_offset option.

Your suggestion is not an overkill compared to what I did: I have
already used Data Transfer, but instead of importing the data in the
wiki, I opted to create pages the title of each is a key from the CSV
file, and the content is a template that uses External Data to fetch the
rest of the information from the CSV based on the key.

My aim is to allow an authoritative user to modify the CSV and re-upload
it to the wiki, without having to re-import and update all the pages.
The user also -- so far -- prefers to work using an external tool to
change data in bulk in the CSV.

The CSV contains the intrinsic information of each object, that is
supplemented with in-wiki annotations in the longer term .

I use SMW instead of Cargo just because I'm more comfortable using it.
I'm also not sure of the final over-all data structure, and I have and
impression that SMW is more flexible.

At least that was the theory. But with the performance bottleneck I was
considering importing the data into the wiki directly, as you suggest,
and re-importing from the updated CSV as needed, which runs in batch but
concludes successfully. Either this or splitting the table, until it has
stabilised, before finally importing the data in the wiki.

Will adding |ignore regex parameter to {{#get_web_data:}} help? For comments starting wish hashes, set |ignore regex=/#.*$/m.

Also, now you can try to use |offset=, if you know how many lines of comments there are.

Yaron_Koren claimed this task.

I think this problem can be considered fixed, with the "offset" parameter.