Page MenuHomePhabricator

Allow filtering for type when search in data namespace
Closed, ResolvedPublic

Description

We would like to create different datatypes on Wikidata for map and tab data within the data namespace.
Therefore we need to have the possibility to filter when searching in that namespace.
We would like to filter something like pagetype:map or pagetype:tab, similar to filetype:drawings.

QUESTION: should we expose the internal distinction between content model and media type, between uploaded media and page content? Or should we rather hide that distinction?

Event Timeline

the pages in the data namespace really are not files, so suggest something other than "filetype". Maybe "datatype" or something?

  1. For all use cases that matter they are files. The only difference is that they are guaranteed to not be binary, and therefore can be edited and shown in diffs.
  2. The Cirrus search keywords filetype:… and filemime:… already exist. Please do not invent new ones that do have the exact same semantics.
  3. Both file types currently allowed in the Data:… namespace do have MIME types: "application/vnd.geo+json" and "application/vnd.datapackage+json" (the later is not fully approved at the moment, see https://github.com/frictionlessdata/specs/issues/333).

The most simple solution is to expose the MIME type via filemime:… and the file extension via filetype:….

maybe we can allow searching by content model?

Please keep in mind that (at least for mediawiki), things in the Data namespace are *not* files, they do not *have* a file type or mime type or media type.

Page content has a content model, and a content format. Yes, this distinction is kind of silly on a conceptual level, and should go away eventually. But for now, uploaded media and page content are two fundamentally separate and different things.

So instead of exposing filetype: or filemime: based on the file extension in the title, MediaWiki should just always expose the content model of all pages (but not the content format, that's internal).

daniel renamed this task from Allow filtering for filetype when search in data namespace to Allow filtering for type when search in data namespace.Jan 26 2017, 2:34 PM
daniel updated the task description. (Show Details)

In the Data namespace, we have Tabular.JsonConfig and Map.JsonConfig as content models, but as Daniel suggests, searching by content model could be a more general thing and a field that MediaWiki exposes

You keep repeating this. I'm sure the developers that need to know this know this. But I, as a user, do not care. There are MIME types defined for whatever is in the Data: namespace right now. Cirrus supports a tag to search for MIME types. So I, as a user, expect a search for filemime:application/vnd.geo+json to work, without learning a new keyword just because some file types are in an other namespace for reasons I, as a user, don't care much about.

Implementing a keyword that allows searching for the content model is fine, but:

  • This must be enabled for both the File: and the Data: namespace.
  • This will most probably clash with the filemime: keyword, because some content models look like/are MIME types too.

as a user, imho, files are handled quite differently from content pages like tabular and map data and wouldn't want to conflate these when searching.

@thiemowmde btw, file_mime is already indexed for File content (think you know this already)

I feel like I still did not made my point clear.

When I, as a user, are using the search interface, I have more than one possibility to specify the namespace I want to limit my request to: I can either use prefix:Data:, use the boolean flags, or just start my query with Data:.

What I, as a user, do not need are two keywords that allow filtering by MIME type in one namespace, but stop working for no obvious reason the moment I limit my search to an other namespace. This is entirely counter-intuitive and a huge source of confusion and frustration.

I would even go so far and argue that filemime: should be ditched and merged with filetype:, because there is zero overlap in the strings the two accept. The frontend should have a single keyword, and the backend should figure out if what the user provided is a MIME type in the Data: or File: namespace, or a file type that covers more than one MIME type. My prototype already does this (via JavaScript pre-parsing): https://de.wikipedia.org/wiki/Benutzer:TMg/advancedSearch.js

Quickly talked to @daniel:

  • What Cirrus stores internally should be in different fields. There is no disagreement on that.
  • What I'm talking about above is that we do not have to expose these fields individually to the user. The backend code can understand that, for example, filemime:application/vnd.geo+json is about a MIME type that is known to be in the Data: namespace, and magically use the proper Cirrus field, even if it is not called filemime internally.
  • I suggest the following algorithm for a unified filetype:… keyword that accepts everything. The order is "most specific" to "least specific". filemime:… can be an alias for filetype:… with this algorithm.
    1. Compare the users input with a list of known MIME types in the relevant namespaces (only Data: and File: for a start). Either use filemime or filedata (or however this new field is called) internally, depending on the namespace.
    2. Compare the users input with a list of known content types. Use the internal Cirrus field for the content type.
    3. Compare the users input with a list of known file types (like "audio" and "video"). Use filetype internally.
  • A minor edge case is that pages in the File: namespaces do have two MIME types, one for the file, and one for the file description page. We must make sure File: pages are not indexed by their application/x-wiki MIME type.

The backend code can understand that, for example, filemime:application/vnd.geo+json is about a MIME type that is known to be in the Data: namespace, and magically use the proper Cirrus field, even if it is not called filemime internally.

This is technically possible but looks awfully like Wikimedia-specific thing, which I don't like having in the code. So I wonder how we make it flexible enough so that we don't have to special-case each mime type.

filemime:… can be an alias for filetype:… with this algorithm.

Don't like this - MIME and mediawiki file type are rather different things. I'd leave MIME alone - it's pretty clear what it is and should be used for this specific purpose, MIME types. The fact that is has distinct domain with filetype doesn't mean they have to be united.

daniel removed Yurik as the assignee of this task.Jan 26 2017, 6:56 PM
daniel added a subscriber: Yurik.

@Smalyshev can we just do T156371: ContentHandler should expose the content-model to search engines.? The indexing should be trivial. What keyword to use, and how, may need dome more discussion. But filtering explicitly by content model would be useful.

Yes, adding a field with content model is very easy. I can do the patch.

WMDE-leszek claimed this task.
WMDE-leszek triaged this task as Medium priority.
WMDE-leszek moved this task from Monitoring to Done on the Wikidata-Former-Sprint-Board board.
WMDE-leszek added a subscriber: WMDE-leszek.

@Jonas says T156371 is the same as this one.
Patches that implemented the feature in question have been (according to him) https://gerrit.wikimedia.org/r/#/c/334412/ and https://gerrit.wikimedia.org/r/334715.
Based on this information I am resolving this ticket.