Page MenuHomePhabricator

Commons uploads: support Data and Document formats
Open, Needs TriagePublicBUG REPORT

Description

This is an umbrella bug report for a range of high-impact, long-standing bugs which prevent sharing, finding, and collaborating on documents and data.
See also @JeanFred's original tracking bug T44725, linked to over 200 other tickets, including hundreds of formats not covered here.

The problem:

  • It is very hard to share data files or document files on Wikimedia projects. This limits the efficacy of community members, and of outreach to new communities (Who would join a community that rejects every format of their work?).
  • In the rare cases where a data file (in the Commons:Data namespace) or document (as pdf) is shared on Commons, it is hard to find them, as the search interface hides them.
  • Adding new format support is easy, and the highest-impact work we do, but inertia + lack of clear process has made it surprisingly rare. After two decades we have support for only 20 filetypes. The only media category where we support the most common formats (directly or via transcoding) is Images.
  • Many filetypes that our communities depend on to run the projects and do daily research, are blocked from upload to our wikis -- pushing them to use non-free services (such as Dropbox or Google), or non-public ones (such as email).

Issues blocking uploading data + documents:

0) Organize related tickets and data: Add FileTypes tag for new requests + analyses.

There are hundreds of file formats we should eventually support; the hard-to-follow response to dozens of past bugs on related issues is in part related to the lack of a single queue for related work.
The "how to add support for new filetypes" guide is a great start; a tag here will help. Work on things like finessing tabular-data support (into a new namespace) (on Commons only) might also merit the tag.

  • 0.1) Reopen tickets for gathering relevant data. T77796 - data on what unsupported filetypes are being uploaded

1) Add Data and Documents as categories in Commons search.

(Currently it has Images, Audio, Video, and Other Media -- add Documents, Data, before Other)

  • 1.1) Make these search .tab and .map datasets in the Data namespace - T252327
  • 1.2) Allow searching Newfiles to filter by format. - T66768

2) Add upload support for essential document file formats

(Currently we support only PDF, arguable the least open of all of these, and DJVU)

  • 2.1) Add .RDF support
  • 2.3) Add .EPUB format - is there an underlying problem w/ zipped formats? T252250
  • 2.2) Add .ODT support - T45154 (for all ODF formats)
  • 2.4) Add .ODP support - T45154 (for all ODF formats), (presentations)
  • 2.5) Review other OO formats - OASIS (T4089),

3) Add upload support for essential data file formats

(Currently we support NO STRUCTURED DATA FORMATS AT ALL, despite using them in every technical part of our workflow)
"There is currently a major issue with storing statistical data in Wikidata, which would be solved if we could upload the data to Commons as Tabular Data files." - @NavinoEvans on T181319

  • 3.1) Add .CSV support - in use across wikimedia, just not as files
  • 3.2) Add .JSON support - in use on many other MW instances, see T68036
  • 3.3) Add .XML support (see also Music XML + Lilypond: T214023 , T208494)
  • 3.4) Add .SQLite support (widely used, selected as an archival format by the Library of Congress)
  • 3.5) Add .ODS support : T45151
  • 3.6) Update related conversations. How to deal with open datasets, Talk:Allowable file types

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Sj updated the task description. (Show Details)
Sj updated the task description. (Show Details)
Sj updated the task description. (Show Details)

https://commons.wikimedia.org/wiki/Commons:Project_scope#Must_be_a_media_file will a barrier to implementation of most of these requests.

There are existing tasks to cover most of these requests, they should be added as subtasks.

Sj added a subscriber: NavinoEvans.
Sj updated the task description. (Show Details)
Sj updated the task description. (Show Details)
Sj added a subscriber: JeanFred.

Linking properly to JeanFred's broader tracking bug for file format support.

A full family-tree of bugs and tracking bugs might look something like:

  1. not-yet-extant Parent bug: Commons uploads: support common non-image formats
    • ref: T44725 + the hundreds of file-format-support bugs it tracks;
    • ref: bugs filed against the idea of file-format filters
    • ref: nbugs about the difficulty / lossiness of attempts to map existing freely-licensed files (in an unsupported format) onto something Commons accepts
  1. This bug: Commons uploads: support Data and Document formats
    • ref: related file-format support bugs (small subset of 44725)
    • ref: bugs about difficulty importing datasets to WD (when they should really be files on Commons)
    • ref: bugs about lossiness of attempts to import from freely licensed documents into wikitext
    • ref: non-format-specific bugs about the difficulty of importing datasets into Commons:Data: namespace, or the lack of File: namespace metadata + templates for such datasets
Sj renamed this task from Commons uploads: add Data and Document categories, and major file formats for each (Umbrella) to Commons uploads: support Data and Document formats.Dec 30 2021, 6:06 PM
Sj updated the task description. (Show Details)

Comment from a discussion this week about "what workflows the wikiverse means to support" and "what Commons is for" (vs. other free-knowledge media collections): it occurs to me that "what file formats can be discoverable shared as free knowledge" is not a Commons question though it is often simplified to "what can be uploaded to Commons" --> "what current {Commons + WM} workflows can easily support".

So perhaps the community-consensus-needed, commons, structured-data tags are implementation details for which there are alternatives if those details make implementation unnecessarily complicated. (can't avoid upload-wizard probably, as long as 99% of uploads come through it).