Create and deploy an extension that implements an authenticated key-value store.
Open, HighPublic

Description

Background

Our mobile apps (namely the Android app, for now) will need to synchronize certain data with the user's account on Wikipedia, so that the user's data will persist across different devices, and be accessible from different platforms. This includes various preferences that the user sets within the app, as well as more complex user-created structures like private reading lists, which are currently being developed client-side.

While it is possible to use userjs preferences for storing this information, it becomes impractical for the more complex data (such as reading lists), because all userjs options are transmitted with each pageview for a logged-in user, which would make pageview payloads inefficient for heavy users of these features.

Proposal

Implement a simple, private, per user key-value storage API.

Each user will have their own keyspace, and the keyspace used will always be that for the currently-authenticated user. There will be no access to other users' storage other than by logging into the other user's account (e.g. in a separate session). This avoids one of the major complaints about Gather: since Gather lists were publicly visible, it required policing for violations of policies which the community was not inclined to perform.

The store will provide no "revision history" and no logging: when updating or deleting a value, the old value is erased without possibility of recovery. Logging and/or history are required when a resource may be changed by multiple users or is publicly visible, neither of which are the case here and omitting this reduces the complexity of the implementation significantly.

Operations supported on the store will minimally include get, set, add, and delete. Ideally CAS will be supported for modifications, and ideally batch operations (e.g. multiple gets or sets in one request) will be allowed.

Open questions

  • Should this be implemented as a MediaWiki action API endpoint or a restbase service?
    • As a MediaWiki action API endpoint, it would be available in all MediaWiki installations without further effort and could potentially reuse existing code for communicating with storage backends. @Anomie will likely write and maintain it in this case.
    • As a restbase service, it might be easier to integrate a backend that isn't already supported by MediaWiki, and the input format wouldn’t necessarily be constrained to being equivalent to HTTP form posts. A developer willing to create and maintain it would need to be found.
  • What backend should be used to store the data?
    • If we go the action API route: The easy solution would be an SQL table, much like the existing user_properties table. On the other hand, with a little effort we could abstract the backend so that different solutions can be plugged in without rewriting everything; in this case, would it be best to use an existing abstraction such as BagOStuff or create a new one?
  • What limits should be placed on the implementation?
    • Key length? (for comparison, user_properties limits to 255 bytes)
    • Value length? (for comparison, user_properties limits to 65535 bytes)
    • Total number of keys or total value size (per user)?
  • Should there be one store per wiki, or a global store? Or, in other words, should using the store require a centralized account?
  • Should expiration be supported?
  • Should enumeration of keys be supported? For example, "return all keys with prefix 'foo'".
  • Should non-string values be natively supported in some manner?
    • We recommend no. Clients may store non-string values in a serialized format (e.g. json), or they may use one key per value and an additional "index" key if necessary.
  • Should "tagging" be natively supported in some manner?
    • We recommend no. Clients wanting tagging can easily enough implement it on top of the existing storage by using a key to store the list of keys having a particular tag.
  • Does anyone have ideas for preventing misuse (cf. Commons being used for illegal file sharing) besides setting a relatively low limit on total data per user?

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Yurik removed a subscriber: Yurik.Mar 3 2016, 6:19 PM
Tgr added a comment.Mar 3 2016, 7:04 PM
  • Do we need expiry? Should there be a maximum expiry?

Also, do we need CAS operations?

  • Do we need the ability to list all keys present for a user?

Listing all keys for a given user in a given namespace (ie. matching by key prefix) is probably more useful.

  • Should this really be an extension, or is "authenticated key-value store" a feature that belongs in core?

"Start as an extension, move to core if it turns out to be useful" is generally a good approach IMO.

Tgr added a comment.Mar 3 2016, 7:08 PM

I'm in agreement with your response to the comments on that thread.

I'm in agreement with my response too :) Just wanted to point out that this will require discussion (probably an RfC, which is a good idea anyway).

As for userjs, it's something of a hidden feature - if you don't know about it already, you probably won't find out about it from the API documentation because you don't know where to look. Not that I would expect any real benefit from security by obscurity.

brion added a subscriber: brion.Mar 9 2016, 9:14 PM

Key-value stores wouldn't seem to scale well for things like reading lists which might expand to hundreds/thousands/more entries. (We didn't think 30k+ watchlists would happen back in the day either but they are very much a thing.)

Is there a process issue with deploying database tables that's being worked around here?

Anomie added a comment.Mar 9 2016, 9:41 PM

Is there a process issue with deploying database tables that's being worked around here?

The use case is that the mobile apps want cloud storage of their bookmarks lists (and possibly other stuff in the future) so they can sync them between devices.

I believe their current plan is to store the "reading list" as a JSON blob under one key (or split across multiple keys if it exceeds whatever the DB size limit is), but I might be mistaken. $PLAN[-2] was to try to use Gather's API for their reading lists, but (1) Gather is going away, at least in the short term, and (2) it only supports titles on the local wiki, and they want cross-wiki title lists. $PLAN[-1] was to do the current plan using userjs user options since that already exists in the action API, until it was noticed that options are included in the JS for all web UI page views which would be major bloat.

Another potential plan would be for someone to make a service of some sort to more explicitly store whatever data they want to store, instead of doing it with JSON blobs in a generic key-value store.

Tgr added a comment.Mar 9 2016, 9:47 PM

Is there a process issue with deploying database tables that's being worked around here?

It's more of an issue of coming up with some temporary solution relatively quickly (mobile app plans were built on Gather, inadequate as it was, and are blocked on finding a replacement now), and trying to do it in a way that's useful to other people as well. Eventually saved pages would have its own API and DB, once we have a better idea what it should look like.

brion added a comment.Mar 9 2016, 9:47 PM

Another potential plan would be for someone to make a service of some sort to more explicitly store whatever data they want to store, instead of doing it with JSON blobs in a generic key-value store.

That's what I'd tend to recommend for the sanity of future maintainers. Consider also the need to expose the same feature and data set through the web.

brion added a comment.Mar 9 2016, 9:53 PM

It's more of an issue of coming up with some temporary solution relatively quickly (mobile app plans were built on Gather, inadequate as it was, and are blocked on finding a replacement now), and trying to do it in a way that's useful to other people as well. Eventually saved pages would have its own API and DB, once we have a better idea what it should look like.

Adding a backend and API for a cross-wiki generic key-value store versus a purpose-built cross-wiki table for a reading list are probably about the same amount of boilerplate.

Just remember if you start with the key-value store you have to do all that work twice, and implement another migration step on the client.

If it'll be useful for other things, then that may be an acceptable trade-off. Just keep in mind future maintenance, scalability, and cross-platform feature parity. :)

FWIW, MassMessage already has cross-wiki target lists stored in a structured JSON format that are editable as wiki pages: https://meta.wikimedia.org/wiki/MassMessage/Lists/Wikipedia_Library :)

RobLa-WMF moved this task from Inbox to Backlog on the TechCom-RFC board.Mar 10 2016, 12:19 AM
Tgr added a comment.Mar 10 2016, 6:42 AM

FWIW, MassMessage already has cross-wiki target lists stored in a structured JSON format that are editable as wiki pages: https://meta.wikimedia.org/wiki/MassMessage/Lists/Wikipedia_Library :)

That does not help with private reading lists though (unless you mean they should share some kind of interface).

Adding a backend and API for a cross-wiki generic key-value store versus a purpose-built cross-wiki table for a reading list are probably about the same amount of boilerplate.

Just remember if you start with the key-value store you have to do all that work twice, and implement another migration step on the client.

Indeed, using a key-value store is more work. On the other hand it remains useful even if plans change, which tends to happen around the WMF :) I don't have a strong opinion either way.

RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM
RobLa-WMF triaged this task as Low priority.Jun 8 2016, 6:57 PM
RobLa-WMF added a subscriber: RobLa-WMF.

Belated priority update discussed in E187: RFC Meeting: triage meeting (2016-05-25, #wikimedia-office) (see log at P3179)

Anomie updated the task description. (Show Details)Jul 19 2016, 5:40 PM

Maybe related to T134811? At least seems to be the instance of the same API (and we already did BagOStuff implementation for it).

Maybe related to T134811? At least seems to be the instance of the same API (and we already did BagOStuff implementation for it).

That looks like backend for storing stuff from PHP, while this is about a frontend in either the action API or restbase.

Usually the best and fastest method to store key:value pairs is redis. I don't know about security of it though.

The easy solution would be an SQL table, much like the existing user_properties table

Please whatever you use, do not use the main s* shards, this probably doesn't need 20 redundant copies nor JOINing with core metadata. This is more like a parsercache, x1, or external storage, on its own separate, isolated, backend. Also please provide (when mature) an estimation of disk space needed to provision the required hardware.

brion added a comment.Jul 19 2016, 7:06 PM

The easy solution would be an SQL table, much like the existing user_properties table

Please whatever you use, do not use the main s* shards, this probably doesn't need 20 redundant copies nor JOINing with core metadata. This is more like a parsercache, x1, or external storage, on its own separate, isolated, backend. Also please provide (when mature) an estimation of disk space needed to provision the required hardware.

It's not like parser cache, because the data must be retained and is not regeneratable from source data.

It's not like external storage, because the data is not immutable and needs to be able to change.

It is indeed not like watchlist, because the data won't be joined.

But I'm not sure we need to go inventing a new storage backend, do we? It sounds like data will be relatively small and mutable, similar to user preference storage with a larger limit and not requiring all data items to be packed into a single item.

An SQL table on the central auth wiki sounds like an easy setup. @jcrespo what's the situation with redundancy and space for SQL on central auth stuff?

brion added a comment.Jul 19 2016, 7:08 PM

(eg, this should presumably be per-user-account and *NOT* per-wiki.)

But I'm not sure we need to go inventing a new storage backend, do we?

I wasn't suggesting you invent anything new. A table will certainly work (all those are tables, BTW). I was suggesting not to put it on an s* shard if it doesn't need to join and the access pattern is very different (e.g. x1).

Non mature features (I assume this is a new one) can lead to outages, so everything we can separate is a plus for high availablity. E.g. if you put it next to centralauth a heavy user misusing the service or a vulnerability can bring down all wikis login because it lives on the same server (something that has happened in the past). So it would be the worse place to put a new feature.

Regarding redundancy, if you put it on x1, that will be replicated 5 times (10 if you have into account the RAID 10), if you put it on s1, it will be replicated 20 times (40 if we take into account the RAID 10!): https://dbtree.wikimedia.org/ There is a "cost" depending on the redundancy, parsercaches have no redundancy (not even disk mirroring, for obvious reasons), x1 has more redundancy, and en wiki has a high redundancy (mostly due to load, rather than high availability). There is a hidden cost that I wanted to mention here- I was not suggesting to setup a new backed, just giving examples of different redundancy levels.

x1 would probably ideal for this, but if you plan to store 1TB of data (that is why I asked), we could setup "x2".

brion added a comment.Jul 19 2016, 7:29 PM

But I'm not sure we need to go inventing a new storage backend, do we?

I wasn't suggesting you invent anything new. A table will certainly work (all those are tables, BTW). I was suggesting not to put it on an s* shard if it doesn't need to join and the access pattern is very different (e.g. x1).

Gotcha. Yeah, secondary cluster is probably fine per rest of note. :)

Non mature features (I assume this is a new one) can lead to outages, so everything we can separate is a plus for high availablity. E.g. if you put it next to centralauth a heavy user misusing the service or a vulnerability can bring down all wikis login because it lives on the same server (something that has happened in the past). So it would be the worse place to put a new feature.

*nod* It needs to be treated conceptually as part of centralauth I think -- in that it's going to be global user data, not tied to a single wiki -- but storage need not be directly connected to the auth data, and separateness is good.

Regarding redundancy, if you put it on x1, that will be replicated 5 times (10 if you have into account the RAID 10), if you put it on s1, it will be replicated 20 times (40 if we take into account the RAID 10!): https://dbtree.wikimedia.org/ There is a "cost" depending on the redundancy, parsercaches have no redundancy (not even disk mirroring, for obvious reasons), x1 has more redundancy, and en wiki has a high redundancy (mostly due to load, rather than high availability). There is a hidden cost that I wanted to mention here- I was not suggesting to setup a new backed, just giving examples of different redundancy levels.

x1 would probably ideal for this, but if you plan to store 1TB of data (that is why I asked), we could setup "x2".

Sounds like x1 would indeed be ideal then; some redundancy is needed (you wouldn't want to lose that data on purpose) but it's not our mission-critical content.

Data should be much less than 1TB from what I understand, so a separate cluster shouldn't be necessary.

Deskana added a subscriber: Deskana.

To clarify:

  • a core feature that needs joining existing core data? -let's put it on s* shards, divided by wiki
  • an independent feature that does not need joining existing data and is common to all wikis? -let's separate from s* shards (we have multiple options in that case). Centralauth and meta both live on s7 with eswiki and other wikis, they are not physically separated!

So if this is an arbitrary k/v store, how is it going to handle things like page moves, deletions, etc.?

brion added a comment.Jul 19 2016, 8:02 PM

So if this is an arbitrary k/v store, how is it going to handle things like page moves, deletions, etc.?

That'd be a question specifically for a reading list implementation built on top of a k/v store; depending on what it's doing, it may or may not have to do anything in particular to handle them. The reading lists are meant for offline-capable usage on-device, and currently are simply not synchronized across devices; I believe the intent is to allow multiple devices (or a single device across a wipe/reinstall) to resynchronize their local reading lists from the remote one.

Handling whether a page got deleted or moved is up to the device-side implementation of the reading list feature. Might simply do nothing in particular; might detect 'this page was deleted or moved' logic and some sort of response such as replacing the page or marking it as dead. But that's entirely a device-side action, as I understand. Could perhaps be server-"accelerated" by doing a bunch of page status checks from the server-side copy of the list and returning a short list of 'pages needing handling', but I don't know if that's something that's planned or necessary.

I find this task difficult to follow.

MediaWiki core already has:

  • a per-user list in the form of Special:Watchlist
  • public page lists in the form of categories
  • per-user preferences that work with MediaWiki extensions and gadgets

Regarding user preferences, how do OAuth applications handle this?

What's needed here that MediaWiki core does not provide or could not be extended to provide? Can the task description be updated to better explain the specific use-cases envisioned here ("certain data" really isn't helpful)?

Some of the comments regarding requirements in this task seem shortsighted. Can you really say that the data won't ever need to be joined? It seems like joining against the page table to check for page existence for reading lists, for example, is an obvious use-case. Not having any ability to undo or track changes also seems foolish given that you're dealing with users and user input. (While watchlists are similar in not tracking changes or allowing removed entries to be easily re-added, this is more of a bug than a feature.)

RobLa-WMF raised the priority of this task from Low to Needs Triage.Jul 20 2016, 4:17 AM
RobLa-WMF moved this task from Backlog to Inbox on the TechCom-RFC board.

We originally set the priority and put this in the backlog before @Anomie wrote the description and sent his update to wikitech-l.

If the request here is to share mobile app configuration between devices, I would like to better understand why people feel that storing and managing outside/external client applications' preferences should be the purview or responsibility of MediaWiki.

In thinking a bit more about this task and as already alluded to in previous comments, if MediaWiki's involvement is needed here, it already has both authenticated and unauthenticated key–value stores. The revision/text/page/user database tables provide one version. The watchlist/page/user tables provide a second version. The categorylinks/category/page tables provide a third version. The user_properties/user tables provide a fourth version.

I think pinning Gather's failure on having public lists dramatically misses the point that both Gather's implementation and deployment were poorly managed. A better example to look at might be MassMessage and its use of ContentHandler. With the type of flexibility that I think is being sought here—support for both blobs of structured data and support for things that might be a bit more arbitrary—MediaWiki's page objects already provide this. As an added bonus, they also come with versioning, monitoring, content suppression capability, anti-abuse features, limited write restrictions, and even more limited read restrictions.

If the limited write and read restrictions are truly problematic, I'd much rather see these two very common feature requests properly addressed instead of trying to work around them by building yet another separately managed set of tables.

I can't help but think of https://www.mediawiki.org/wiki/Everything_is_a_wiki_page. While using wiki pages undeniably has its own set of challenges, it immediately answers open questions in this task about limiting key and value sizes (wiki pages have a maximum page.page_title length and a maximum page content size) or listing keys by prefix (wiki page titles already support listing by prefix due to the unique index we place on page.page_title). Wiki pages also support "tagging" via the categorization system. (The categorization system is another piece of MediaWiki infrastructure that could desperately use love instead of building out yet another feature to indefinitely support.)

brion added a comment.Jul 20 2016, 6:43 PM

User preferences are not good for this sort of usage as they're packed into a single blob that gets shipped around, etc. Here, we want a separate blob that doesn't get shipped around in other places and may grow arbitrarily large (though will usually be either not present for a given user, or relatively small).

Watchlist doesn't cover the case because it's not the watchlist, it's a separate list. (And I don't know how much other data folks may want to store.)

Pages are conceivable as a backing store, but our data management model makes pages public by default, whereas preferences and watchlists are not. It also introduces a revisioning model that is explicitly not asked for here. It also pushes the storage for the data into a combination of primary page/revision database and the primary ES text store, something data folks are asking to avoid.

As far as I know, tagging, categories, and sorting anything in any particular order are not requirements asked for here.

Could perhaps be server-"accelerated" by doing a bunch of page status checks from the server-side copy of the list and returning a short list of 'pages needing handling', but I don't know if that's something that's planned or necessary.

If it were something the apps teams want to do, they'd need to write an extension that hooked the appropriate events and did whatever it is they want done. It's well outside the scope of this RFC.

I find this task difficult to follow.

MediaWiki core already has:

  • a per-user list in the form of Special:Watchlist
  • public page lists in the form of categories

While "reading lists" is one of the reasons the app teams want this, this isn't an implementation of page lists.

  • per-user preferences that work with MediaWiki extensions and gadgets

This is discussed in the description. See the second paragraph under "Background".

Regarding user preferences, how do OAuth applications handle this?

If you mean the app preferences, that's outside the scope of MediaWiki and OAuth. This would be for something like the configuration of the font size in the mobile app.

If you're meaning permission for OAuth consumers to access the stored data, most likely there would be a new user right that would have to be granted for an OAuth consumer to be able to make use of the store.

Can you really say that the data won't ever need to be joined?

Yes.

It seems like joining against the page table to check for page existence for reading lists, for example, is an obvious use-case.

If they want to do that with their reading lists, then they would need to implement something else to store them. BTW, such a thing would be complicated by the fact that their "reading lists" plan is to have one list with titles from multiple wikis, so a join would be complicated in any case. That's all well outside the scope of this proposal.

Not having any ability to undo or track changes also seems foolish given that you're dealing with users and user input. (While watchlists are similar in not tracking changes or allowing removed entries to be easily re-added, this is more of a bug than a feature.)

User preferences don't have history or undo either. I can't think of anything that's user-private that does, actually.

RobLa-WMF triaged this task as High priority.Jul 20 2016, 8:23 PM
RobLa-WMF claimed this task.

We originally set the priority and put this in the backlog before @Anomie wrote the description and sent his update to wikitech-l. We're speaking about it in E234

Tgr added a comment.Jul 20 2016, 8:27 PM

Watchlist doesn't cover the case because it's not the watchlist, it's a separate list. (And I don't know how much other data folks may want to store.)

Global watchlists (T126641: [RFC] Devise plan for a cross-wiki watchlist back-end) are close enough and abstracting them into something more generic that can store multiple lists would be another way to fulfill the reading list use case. Maybe worth discussing at E235: ArchCom RFC Meeting W29: Devise plan for a cross-wiki watchlist back-end (2016-07-20, #wikimedia-office)? It would probably be more complex (especially given the slightly different requirements for the different types of lists) than a simple key-value store, though.

brion added a comment.Jul 20 2016, 8:30 PM

One open question that I see coming up in quick discussions is the abuse question -- there are I think two main areas of potential abuse for single-user-access blob storage:

  1. denial of service (storing lots of data for the lulz)
  2. inappropriate file sharing (store copyrighted or illegal files as blobs, share an account)

I suspect both methods can already be abused somewhat with user prefs and other things, though an overly simplistic per-user blob store could increase size limits (not sure if/what limits are on prefs now offhand). In both cases, an abuse tool would need client-side code of some kind (JS running on site, or some client tool) since it's not directly HTTP-addressable.

Formatting requirements could make it harder to store arbitrary file blobs, but nothing makes it impossible. (Eg, store your evil data as base64 strings in a JSON structure.)

(not sure if/what limits are on prefs now offhand)

There's no limit on number of entries or total size, as far as I know.

How would everyone feel about discussing this in next week's ArchCom RFC office hour (E237)?

How would everyone feel about discussing this in next week's ArchCom RFC office hour (E237)?

Works for me.

User preferences are not good for this sort of usage as they're packed into a single blob that gets shipped around, etc. Here, we want a separate blob that doesn't get shipped around in other places and may grow arbitrarily large (though will usually be either not present for a given user, or relatively small).

We could fix/change this architecture. I filed T140858 specifically about reconsidering outputting every user option into the page HTML.

I find this task difficult to follow.

MediaWiki core already has:

  • a per-user list in the form of Special:Watchlist
  • public page lists in the form of categories

While "reading lists" is one of the reasons the app teams want this, this isn't an implementation of page lists.

Let's say that tomorrow you had this authenticated key–value store implemented, what would you use it for specifically? I get the feeling that every time someone asks for a use-case, there's weirdly a lot of hand-waving.

  • per-user preferences that work with MediaWiki extensions and gadgets

This is discussed in the description. See the second paragraph under "Background".

Quoting that paragraph:

While it is possible to use userjs preferences for storing this information, it becomes impractical for the more complex data (such as reading lists), because all userjs options are transmitted with each pageview for a logged-in user, which would make pageview payloads inefficient for heavy users of these features.

So we already have an authenticated per-user key–value store. Could you, for example, add a new key/prefix, similar to userjs, that simply gets omitted from the HTML?

Regarding user preferences, how do OAuth applications handle this?

If you mean the app preferences, that's outside the scope of MediaWiki and OAuth. This would be for something like the configuration of the font size in the mobile app.

You say that app preferences are outside the scope of MediaWiki... what do you want to use this authenticated key–value store for, exactly? I'm still lost.

Let's say that tomorrow you had this authenticated key–value store implemented, what would you use it for specifically?

What would I use it for? Nothing. I'm not involved with the mobile apps.

So we already have an authenticated per-user key–value store. Could you, for example, add a new key/prefix, similar to userjs, that simply gets omitted from the HTML?

Such a thing would be possible, but would make the code handling user options even more complex.

Regarding user preferences, how do OAuth applications handle this?

If you mean the app preferences, that's outside the scope of MediaWiki and OAuth. This would be for something like the configuration of the font size in the mobile app.

You say that app preferences are outside the scope of MediaWiki... what do you want to use this authenticated key–value store for, exactly? I'm still lost.

MediaWiki and the OAuth extension don't care how an app stores its user preferences, although this proposed API would be one way that an app could do so.

To clarify a few points, in anticipation of E237...

The Android app actually already uses user-js preferences for storing some of our user settings within the app. The only thing that prevented us from using user-js from saving reading lists is the fact that they're transmitted unconditionally with every pageview.

So, technically all we would need is something that's equivalent to user-js, except not sent with every pageview (sent only upon specific request). We wouldn't need it to support tagging / categories / sorting / expiration / etc. A lot of that can be done client-side, if necessary.

Either that, or...

If we single out reading lists as a use case, the "ultimate" way of saving them is to implement multiple watchlists in MediaWiki, and allow watchlists to be named, and allow watchlists to be cross-wiki. But AFAIK this is still a long way off.

So, a good intermediate solution should be a balance of something that's painless to implement in the backend, but flexible and general enough to support the needs of the apps (and other clients) in the short- to possibly-long term.

Tgr added a comment.EditedJul 27 2016, 6:32 PM

We could fix/change this architecture. I filed T140858 specifically about reconsidering outputting every user option into the page HTML.

There are many problems with using user_properties:

  • they are loaded on every request, which would lead to memory and performance issues if they were used more heavily
  • they are output into the HTML on every request
  • there is no way to query them individually via the API (all options will be output into the API response)
  • the values are limited to 65535 bytes which is actually not that much for storing something like article lists (the limit for page titles is 255 bytes so a few hundred article names fit into a single record at most, even if you go for an efficient format and not e.g. JSON with metadata which would otherwise be a much more convenient choice)
  • they are not cross-wiki (you can get around that by selecting a specific wiki and using that, like the mobile apps do with meta, but there are many problems with that - users not being logged in on that wiki, users not having an attached account on that wiki, having to build in assumptions about Wikimedia's farm setup into supposedly generic software)
  • in general mixing a key-value store (userjs-*) with an internal configuration store seems like questionable design.

In short, user_properties is aimed at storing a small amount of internal per-wiki configuration settings which are needed very often. The key-value store would be aimed at storing large amounts of global data which are needed infrequently. It could replace userjs-* keys, although those do not seem heavily used anyway.

(FWIW enwiki users now have 58 different userjs-* keys, out of which 10 seem to be used by more than a handful users. Mostly those seem to be related to WMF products, with the exception of Wikipedia:igloo. Other large wikis use them even less.)

Anomie updated the task description. (Show Details)Jul 27 2016, 8:50 PM
Anomie updated the task description. (Show Details)
daniel added a subscriber: daniel.Jul 27 2016, 9:40 PM

There is a proposed standard called "RemoteStorage" that seems to fit the bill pretty well: https://remotestorage.io/. It defines an OAuth protected online storage interface based on a REST interface. The idea being that clients should be able to choose where they store their data, and storage providers should all talk the same protocol. The spec seems to be pretty mature, see https://datatracker.ietf.org/doc/draft-dejong-remotestorage/. I'm not sure how mature the available implementations are, but even if we end up writing our own K/V storage interface, we should perhaps follow this spec. Or at least evaluate it.

GWicke added a subscriber: GWicke.EditedJul 27 2016, 9:41 PM

Another point we should consider is the requirement for data format versioning & migration. While it is fine to handle all validation & migration in a single client, doing the same consistently across a large number of clients would be a challenge at best. Essentially, unconstrained blobs would make the data private to a single client. Even a single client like the app can run into problems. For example, an old version of the Android app might not have the code to gracefully deal with newer formats unless formats are only ever changed in backwards-compatible ways, and unknown information is carefully preserved.

All of this would be less of an issue if we handled schema validation & migration on the server. Setting up a key-value bucket for each use case along with a schema & documentation is pretty easy, and I think would provide a better balance of flexibility, stability & usability. The per-usecase separation would allow us to document & version each API, deprecate & drop APIs that are no longer needed, and set appropriate quotas & rate limits per use case.

Great IRC conversation today! I've posted the log in the event (E237). Gergo's concerns outlined in T128602#2499662 sound like dealbreakers for the user_properties approach. There's a more careful summarization of E237 that one of us should do, but it seems like either a new DB table, a RESTbase backend, or some sort of 3rd party storage system are the most viable alternatives we discussed (right?)

Here's the beginning of a detailed comment @jcrespo made on E237:

In E237#2836, @jcrespo wrote:

For the specific use case of reading lists (this is not for the general case, everything you mentioned regarding generic store solution still stands):

Jaime then proposes a schema and shows how we might use it. If I read this correctly, Jaime's comments seem to comport with what @tstarling suggested during the meeting:

21:28:19 <TimStarling> my vote is to just add a table
...
21:28:46 <TimStarling> avoid joins so that you can hack up some cross-server thing later if need be, query groups or something

Both of you seem to be suggesting that we avoid overengineering something to try solving the specific case of better watchlist management, correct?

Hopefully these two goals aren't mutually exclusive:

  1. Accelerating progress on our watchlist data architecture
  2. Providing mobile apps (and other client software) generic key-value storage for ease of developer prototyping of account-specific features
daniel added a comment.EditedAug 3 2016, 8:21 PM

I would like to push a bit more on looking into the IETF RemoteStorage spec proposal. It seems to fit the bill quite well. I believe that RemoteStorage should be evaluate to answer two questions:

  • can we use an existing RemoteStorage implementation to cover the needs put forth by this RFC? (And if not, why not?)
  • If we can't use an existing implementation, should we implement the RemoteStorage protocol as specified? (And if not, why not?)

From skimming the spec, it seems like a good approach, and designed exactly for our use case. So we should at least consider going with a (semi-)well known spec.

I would like to push a bit more on looking into the IETF RemoteStorage spec proposal.

It's worth pointing out a couple of things about that spec:

  • It's an Internet Draft. Those are fairly easy to produce, and all have this disclaimer: Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
  • It doesn't appear to be the product of a working group. One would have to dig further to figure out what that means.
  • The version you're citing "Expires: 15 December 2014". Fortunately, there appears to be a more up-to-date version of it available: https://datatracker.ietf.org/doc/draft-dejong-remotestorage/

My inclination is also to break our habit of attempting to improve upon crufty inventions like the wheel. That said, this particular spec doesn't appear to be ready (based on my cursory investigation).

There are probably some good lessons to be learned from the document, and from the discussions around it. It's probably worth at least a skim. Does it make sense to include this in the reference material for this RFC?

  • If we can't use an existing implementation, should we implement the RemoteStorage protocol as specified? (And if not, why not?)

If our own implementation is in the action API, "why not?" is because that draft's request/response paradigm (using HTTP verbs and statuses) doesn't match the action API's (using HTTP as a transport, with its own verbs and status reporting mechanism) at all. OTOH, the draft's paradigm seems ideally suited to restbase as I understand it.

Tgr added a comment.EditedSep 14 2016, 3:18 AM

I took some time to look into remoteStorage. The communitiy behind the spec seems reasonably active, there are multiple people involved in maintaining the draft, regular pull requests / wiki edits / forum posts; there are at least four independent server implementations, and lots of clients. (See github:remotestorage/spec:source.txt for the latest version of the spec, community.remotestorage.io and wiki.remotestorage.io for the community, unhosted.org for the wider vision.)

There are two design choices that make it a poor match for our use case: that they are aiming for more of a file server than a key-value store, and that they are aiming for public stores (ie. a store might be tied to a specific user, but not to a specific application or application provider). Which makes sense; TBH the point of trying to conform to a public data read/write standard (with capability discovery and a protocol for asking user permission and whatnot) for implementing our own data storage is entirely beyond me.

More specifically,

  • remoteStorage stores documents, not strings; that results in a bunch of requirements that we otherwise wouldn't need (the ability to store item metadata such as content-type, ability to handle chunked PUT requests, a rather complex versioning scheme)
  • remoteStorage implementations must be able to list the keys. It's implemented in a way that imitates a directory tree, which would force us to care about "directories" not growing too large, which would mean appliations could not choose the keys freely and could be forced to use some sort of hashed subdir scheme.
  • Authentication is by CORS and OAuth 2.0 bearer tokens or Kerberos. CORS is problematic if we ever want to use the K-V store in browsers, since IE9 XDomainRequest does not support REST verbs; we can avoid that problem by hosting on the same domain but that would make it even more pointless than it already is. OAuth means we would have to host two OAuth servers, one for MediaWiki, the other for the remoteStorage service, both with their own authorization dialogs, which seems super confusing. (Or I suppose we could set up another endpoint where a client can exchange a session ID for a bearer token... ofc that would mean not using any of the reference implementations.) Kerberos is only mentioned in passing as an alternative and I'm not familiar with it, so no idea how that would work.
Tgr added a comment.Sep 14 2016, 4:02 AM

I'll try to summarize what are the options and their status as I understand them:

  1. do not build a key-value store, write a dedicated domain-specific API every time we need authenticated data storage (for the current apps need which resulted in this RfC, that would probably mean some sort of multiple watchlists feature in core).
    • That seems like the ideal long-term solution. I mainly see the key-value store as a rapid prototyping tool that would allow us to quickly build client features that need data storage, and cheaply modify or discard them as our understanding of use cases changes. (Yes, the WMF shares the standard problem where prototypes are promoted to final products without any change out of laziness / lack of resources. That does not mean prototypes are a bad idea; not following up on them is.)
    • In the shorter term, I worry that we would build an API without a good understanding of what we need of it (we certainly did that with Gather, which had a somewhat similar scope). Also, there is some major work happening on watchlists already; trying to work in parallel on cross-wiki watchlists and multiple watchlist would be unlikely to work out well.
  2. build it as a new action API module.
    • This still seems like the reasonable thing to me. We could freeride on a lot of things that MediaWiki and the API already provides (authentication, DB handling with cross-wiki access, continuations/batching, centralauthtoken API etc) so it would be fairly easy to do.
  3. add it to the options / userinfo APIs (the user_props table) with some hacks, e.g. don't embed options starting with a _ into the page HTML
    • still seems like a bad idea to me for the reasons expressed in T128602#2499662 (most fundamentally, it would be a big pile of hacks and I don't see what the advantage would be compared to doing it cleanly; it's doubtful that it would be significantly less work)
  4. build it as a RESTBase service.
    • This also looks like a reasonable option, although I don't know enough about RESTBase to really judge (we get a lot of things free for the action API, not sure how much that'd be true for RESTBase). In any case, that would be out of scope for Reading Infrastructure; maybe MCS could do that.
  5. use some external tool, such as a server that implements remoteStorage.
    • remoteStorage seems to be a poor match per T128602#2635622; I don't think any other alternative came up.

The RfC was stalled on making a decision (and on reviewing the remoteStorage draft which is now done). What would be the process for moving it forward?

He7d3r added a subscriber: He7d3r.Oct 17 2016, 12:54 PM
dr0ptp4kt moved this task from Backlog to Next Up on the Reading-Admin board.

I see Remotestorage has some downsides, I guess it also depends on how much you'd like to give users the ability to choose their own server to store their data on rather than 'just' putting it on a wikimedia server.

If Remotestorage is a contender still, perhaps we could do something together in that area with Nextcloud - we'd like to give people the ability to store data in a location of their choice (be it a private Nextcloud server or one at a provider) and I can imagine a scenario where WikiMedia would run a Nextcloud instance to store this (and other?) data for users while enabling the users to, instead, pick a server of their own choosing as store. There are already some 5-10 million Nextcloud users (I'm counting its predecessor, ownCloud, here too) and we would be interested in having Remotestorage support.

Now even without Remotestorage but with a more custom API, Nc might be a solution for data storage, we have a rather nice app development interface which would make it perhaps easier to support various data storage needs. WebDAV/CalDAV and CardDAV are already built in. And scaling Nextcloud is a pretty 'solved' problem. See https://docs.nextcloud.com/server/10/admin_manual/installation/deployment_recommendations.html - note that the numbers there assume every user connects 2-3 times per minute to the server (sync clients...), which in case of an app just querying for bookmarks upon usage means you can probably scale easily 100x easier.

Just some food for thought ;-)

Aklapper removed RobLa-WMF as the assignee of this task.Nov 7 2016, 11:11 PM