Maniphest T206504

Create a new endpoint which returns articles in need of a description
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bearND
	Oct 9 2018, 2:33 AM

Description

The first stage of the App Editor Tasks feature requires two API endpoints which return the following:

A random entity in Wikidata with a Wikipedia article but not a description in the requested language
A random entity in Wikidata with a description in language A but not language B (where Wikipedia articles corresponding to the entity exist in both languages)

The DB queries required to derive candidates meeting these criteria are far too heavy to run on end-user demand, so we'll need to pregenerate and cache results somewhere.

Details

	Subject	Repo	Branch	Lines +/-
	Add maintenance script to populate DB tables with entity description info	mediawiki/extensions/GettingStarted	master	+141 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Mholloway	T212793 Build infrastructure required to support the Suggested Edits feature
		Resolved		• Mholloway	T206504 Create a new endpoint which returns articles in need of a description

Event Timeline

• bearND created this task.Oct 9 2018, 2:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2018, 2:33 AM

• bearND updated the task description. (Show Details)Oct 9 2018, 2:35 AM

• bearND updated the task description. (Show Details)

I think we'd be better off building the functionality we need into the MediaWiki API to support this, even if we end up wrapping it in a RESTBase endpoint for the apps to consume. AIUI, WDQS is way too heavy/slow to support this use case; and the random article approach, while more promising at first blush, doesn't seem sustainable. I'd expect the number of articles without descriptions to decrease over time (and probably fairly quickly once this feature is launched), so that we'd need to be constantly increasing either the number of random pages requested of the MW API, or making more retries per end-user request to get an article without a description, or both.

As an example of how the MW API can handle this better, consider list=pageswithprop. The inverse of this, "pageswithoutprop," is exactly what we want, and shouldn't be too difficult to build. (That said, the implementation would be a bit more complex, since pageswithprop deals exclusively with pageprops, and page descriptions aren't pageprops (except in the case of local descriptions on enwiki), but the principle is similar.)

I suppose another alternative would be to periodically query WDQS to gather a large pool of description-less pages that could then be fed to the apps on demand, but I think we'd be better off just relying on the MW API.

+@Tgr for other MW API-based suggestions.

• Mholloway updated the task description. (Show Details)Oct 9 2018, 3:01 PM

Will the API look for titles missing a description in some arbitrary user language, or in wiki content language? If there is no Wikidata description in the content language, but there is a local override, should that be taken into account?

Yes, local override should be taken into account (eventually). Possibly it's good enough for a prototype without it but eventually this needs to be taken into account. I think the MW API should be able to indicate if the description is central or local and if possibly filter on it, too.

I think this is only for the wiki content language for the first stage. For a later stage we'll also want something like given a pair of languages A and B, give me some articles which have a description in language A but not in B, so the user can translate the description.

Wikidata descriptions are stored in the central Wikibase server DB, and overrides are stored in the enwiki DB, so real-time filtering on records not having either is probably not possible (short of making MediaWiki duplicate that data somehow) - cross-server queries are not entirely impossible in MariaDB but probably not something we want to get into. So there would have to be some two-stage process that fills up a queue with prospective entries based on a wikidata query, then filters it based on the enwiki query, then discards entries as they get returned by the API.

Alternatively, just make sure there aren't many items which have a local override but no central description (having some bot automatically copy the local overrides to Wikidata does not seem too dangerous) and do the whole thing on Wikidata. (And Commons for image captions, soonish, I suppose? From an API POV it's the same thing, entity descriptions.)

The Wikidata query would be something like SELECT ips_site_page title, ips_item_id wiidata_id FROM wb_items_per_site items_on_wiki LEFT JOIN wb_terms items_with_description ON concat('<prefix>', ips_item_id) = term_full_entity_id AND term_language = '<lang>' AND term_type = 'description' WHERE ips_site_id = '<wikidb>' AND ips_site_page NOT LIKE 'Talk:%' AND ... term_row_id IS NULL. (This assumes that the API only returns items which have an article on the target wiki. For items without an article it's probably not possible to explain to users what they are supposed to describe. Even if the item has a label in the target language, that's not much to go on. It would be possible to translate descriptions from another language but that seems like a different use case.) The EXPLAIN for that is

+------+-------------+------------------------+------+----------------------------------------+-----------------------+---------+------------------+----------+-------------------------+
| id   | select_type | table                  | type | possible_keys                          | key                   | key_len | ref              | rows     | Extra                   |
+------+-------------+------------------------+------+----------------------------------------+-----------------------+---------+------------------+----------+-------------------------+
|    1 | SIMPLE      | items_on_wiki          | ref  | wb_ips_item_site_page                  | wb_ips_item_site_page | 34      | const            | 16975016 | Using index condition   |
|    1 | SIMPLE      | items_with_description | ref  | term_full_entity,term_search_full,tmp1 | term_search_full      | 103     | const,func,const |        1 | Using where; Not exists |
+------+-------------+------------------------+------+----------------------------------------+-----------------------+---------+------------------+----------+-------------------------+

which is not really something that can be run on a user request. (Maybe there's a better plan but I doubt "items for which this other table does not have any row with type=description" is something that can be well-supported with indexes.) So I guess some queue would be needed anyway?

• Mholloway triaged this task as Medium priority.Nov 16 2018, 9:15 PM

• Mholloway added a parent task: T207332: Create an "App Editor Tasks" list/feed that can be unlocked by active in-app editors.

I had a chat with @phuedx about this yesterday after the Audiences-Platform sync, and he pointed out that there are a couple of projects extant and running in production that already do something very close to this:

There's the Recommendation-API (https://meta.wikimedia.org/wiki/Recommendation_API), which does something so close to what's requested in the design docs (giving a list of articles that exist in lang A and not in lang B) that I'd guess it inspired this feature, and I see that it apparently live-queries WDQS in the course of processing a user request, so maybe that's not out of the question after all. As far as I can tell there's no edge caching or RESTBase storage of responses.

https://github.com/wikimedia/mediawiki-services-recommendation-api/blob/master/lib/translation.js#L196-L231

Example: https://es.wikipedia.org/api/rest_v1/data/recommendation/article/creation/translation/en/Detroit

There's also MediaWiki-extensions-GettingStarted (https://www.mediawiki.org/wiki/Extension:GettingStarted), the purpose of which is to provide suggested editing tasks for new editors. It fetches tasks from article lists maintained in Redis which are populated based on membership in certain configured categories (such as "All articles needing copy edit").

Example: https://en.wikipedia.org/w/api.php?action=query&list=gettingstartedgetpages&gsgptaskname=copyedit&gsgpcount=1

This could provide a good model for us to follow in the event we need to maintain a pool of suggested edit tasks.

• Mholloway edited parent tasks, added: T205125: [EPIC] Encourage more editing in the Android app via an "App Editor Tasks" list (aka Edit Action Feed) ; removed: T207332: Create an "App Editor Tasks" list/feed that can be unlocked by active in-app editors.Nov 20 2018, 6:44 PM

• Mholloway mentioned this in T209997: Create a new API endpoint which returns Commons images in need of a caption or caption translation.Nov 20 2018, 7:02 PM

I've stood up a test server for suggested description additions implementing the same logic as @Dbrant's client-side code at https://edit-action.wmflabs.org.

The endpoints are:

Get description for $lang: https://edit-action.wmflabs.org/{lang}.wikipedia.org/v1/needs/description
Get term with description in $srcLang that needs description in $dstLang: https://edit-action.wmflabs.org/{srcLang}.wikipedia.org/v1/needs/description/in/{dstLang}

Examples:

https://edit-actions.wmflabs.org/es.wikipedia.org/v1/needs/description
https://edit-actions.wmflabs.org/en.wikipedia.org/v1/needs/description/in/de

Dbrant awarded a token.Dec 17 2018, 6:16 PM

• Mholloway claimed this task.Jan 2 2019, 4:57 PM

• Mholloway edited projects, added Product-Infrastructure-Team-Backlog-Deprecated (Kanban); removed Product-Infrastructure-Team-Backlog-Deprecated.

• Mholloway moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

• Mholloway mentioned this in T212793: Build infrastructure required to support the Suggested Edits feature.Jan 2 2019, 5:06 PM

• Mholloway edited parent tasks, added: T212793: Build infrastructure required to support the Suggested Edits feature; removed: T205125: [EPIC] Encourage more editing in the Android app via an "App Editor Tasks" list (aka Edit Action Feed) .

• Mholloway raised the priority of this task from Medium to High.Jan 2 2019, 5:10 PM

LGoto subscribed.Jan 2 2019, 5:21 PM

As mentioned before, there are a few projects already out there implementing something rather similar to this. I took a closer look at the ones I know about.

Extension:GettingStarted

Status: Active (deployed in Wikimedia production)
Backend: Redis
Release status: stable
GettingStarted is a MediaWiki extension written to provide suggested editing tasks to newly registered editors. It includes both a frontend component providing several interfaces for suggesting editing tasks to users, and a backend component for generating, storing, and serving pages to suggest. It currently supports generating page suggestions via either configured page categories or the results of a MW API morelike query. Page suggestions generated from categories are stored as sets in Redis to be served to clients in a performant way. Additional page suggestion engines can be added by implementing the PageSuggester interface.

Extension:WikiGrok

Status: Archived
Backend: MySQL
Release status: unmaintained
WikiGrok is a MediaWiki extension written as an early exploration of microcontributions on mobile devices. Based on its configuration, it prompts the user to confirm already-existing claims in Wikidata about the article subject, or to confirm new claims generated from linked properties. Strategies for adding questions are modeled as "Campaigns" and added by subclassing the abstract Campaign class. WikiGrok was discontinued and the extension was archived following user testing.

Recommendation API

Status: Active (running in Wikimedia production)
Backend: n/a
Stability: unstable
The recommendation API is a service-template-node-based service written to support the GapFinder project. Its goal is to provide editors with personalized recommendations of editing tasks. It currently supports recommending articles for translation that exist in one language but not in another (based on the associated Wikidata item), or "missing" articles for addition in the specified language. To get translation recommendations, it requests a set of pages from the MediaWiki API (using either the provided seed title or the set of most-viewed articles), then performs a follow-up query to WDQS to filter those that already have an article in the target language.

After looking these over, I think building on the GettingStarted extension is a promising way forward. It provides a lot of the needed scaffolding, and has run in Wikimedia production for years. Furthermore, its concept of storing pregenerated page suggestion sets in Redis fits our needs well here, where the SQL queries needed to support the product requirements run far too long to perform on-demand.

WikiGrok is built for a rather different use case, and I am skeptical about the ability of the recommendation API/WDQS to handle the more complex queries needed here on-demand.

One argument against building this new functionality into GettingStarted is that the feature under construction here isn't aimed at new users, but rather at providing further contribution suggestions to established editors who have shown a certain level of interest and expertise. This could be addressed by breaking out GettingStarted's backend suggestion creation and storage component into a more generic standalone extension on which it would depend; we would use the new, more generic task suggestion extension directly here.

I'll outline a strawman proposal based on GettingStarted and Redis in detail in a comment to follow.

Thanks for documenting your findings Michael 👍

The first stage of the App Editor Tasks feature requires two API endpoints, which return the following:

A random entity in Wikidata with a Wikipedia article but not a description in the requested language
A random entity in Wikidata with a description in language A but not language B (where Wikipedia articles corresponding to the entity exist in both languages)

@Tgr's example query above returns the set of Wikidata entities with Wikipedia articles but not descriptions in a given language. Changing the final IS NULL to IS NOT NULL returns the set of Wikidata entities with both Wikipedia articles and descriptions in a given language. Based on my testing on analytics-store.eqiad.wmnet, these queries each regularly take ~15 minutes or more to run.

Running both of these queries with a maintenance script for each language of interest and storing the resulting entity IDs as sets in Redis (e.g., <lang>-has-desc and <lang>-no-desc) would allow us to support both of these requirements in O(1) time via SRANDMEMBER in the general case. (We could also build into the maintenance script whatever logic is necessary to use local descriptions rather than Wikidata descriptions for enwiki.)

For (1), we would simply call SRANDMEMBER on the set of entities/articles with no description in the specified language.

For (2), we would attempt to call SRANDMEMBER on a lazily-created set comprised of the intersection of <langA>-has-desc and <langB>-no-desc. If it doesn't exist yet, we create it with SINTERSTORE and retry. (On my laptop, generating this set from two large sets (es-has-desc and en-no-desc) takes approximately 0.5 seconds.)

Based on my testing, this reliably produces high-quality results.

Potential issues:

This will take a substantial amount of memory. For example, as of last Friday, es-has-desc contains 1,036,527 members and requires more than 42 MB to store in Redis. (en-has-desc (generated based on Wikidata) contains 5,776,537 members and requires ~296 MB to store in Redis, and could easily be of similar or even greater size when using local descriptions.) Between the initial sets and the lazily generated intersection sets, overall memory usage could easily reach several GB. I don't know if this is a problem in our prod Redis installation (and don't know who to ask about that). Central to the issue here is that large sets do not have the benefit of some of the optimizations that small sets and other data types do in Redis (see here under the heading "Redis Sets" for discussion). I'd welcome suggested alternate representations of the data in Redis that would support the required queries with similar performance but with better memory efficiency.
This requires deciding on a strategy for keeping the sets stored in Redis up to date.
- The simplest way would be simply to regenerate the primary sets periodically (every 24h?) and delete any derived intersection sets involving the updated language upon regeneration. But if going this route, since any given item's state of having or not having a description may have changed since the set was generated, we would likely have to re-check suggested items against the DB immediately before serving.
- Alternatively, after initial generation, we could update all relevant existing sets on the fly each time a Wikidata entity is edited. This would require determining how frequently Wikidata is edited and whether the resulting Redis write load from such an update strategy would be acceptable.

• Mholloway updated the task description. (Show Details)Jan 7 2019, 5:06 PM

• Mholloway added a project: Wikipedia-Android-App-Backlog.

• Mholloway updated the task description. (Show Details)Jan 7 2019, 5:30 PM

SBisson subscribed.Jan 9 2019, 5:08 PM

T158239: Improve GettingStarted data storage strategy is probably worth keeping in mind here.

@Joe I am considering expanding the usage of Redis for storing pre-generated editor task suggestion sets via the GettingStarted extension. The plan is described in detail in T206504#4859812 above. A key difference is that the amount of data we plan to store is much larger than the extension stores at present. If the estimates in T158239#3223921 are still accurate, GettingStarted currently stores ~2,500 page IDs (~40KB); while with the proposed expansion we would be storing sets of Wikidata entity IDs with up to millions of members and requiring hundreds of MB to store for outliers on the large end. All told, I would expect to require between 5 and 10 GB as a rough estimate.

Will this level of usage be a problem in our current Redis setup? Are there alternatives we should be considering? I know there is a long-standing open task (T158239) to improve GettingStarted's data storage strategy, so maybe this is also a good time to revisit that. Thank you!

Restricted Application added a project: Growth-Team. · View Herald TranscriptJan 10 2019, 8:08 PM

@Dbrant Do you happen to have statistics on the current rate of in-app description editing? That would be useful to know here, too.

• Mholloway updated the task description. (Show Details)Jan 10 2019, 8:25 PM

• Mholloway updated the task description. (Show Details)

Here is what we would like to receive, ideally, in these API responses:

For the endpoint that returns an article that's missing a description, the standard "summary" response is good, but it would also be nice to get the wikibase number of the article. It's true that Wikidata can look up an article by title, but having the Q-number would be more concise, and there are certain Wikidata APIs that can only accept Q-numbers. In fact (off topic) it might be a good idea to augment the current /page/summary endpoint with the wikibase number (if it's not there already?).
For the endpoint that returns an article that needs its description translated from A to B, it would be nice to get the summary in *both* languages A and B, which would include the current description in the source language A. (and the wikibase number as well)
If it's simple enough to return *multiple* candidates (for either endpoint) in a single response, that would be great. Or perhaps the ability to specify the number of candidates as part of the request URL.

The number of description edits made from the app is currently on the order of ~500 per day.

Dbrant moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.Jan 14 2019, 2:51 PM

@Dbrant I'm sort of surprised we haven't been providing the Wikibase ID in the summary all along, but it turns out we've only been including it in the mobile-sections lead. I've got a patch up to add it to the summary, too. https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mobileapps/+/484297/

I updated the testing API running at edit-actions.wmflabs.org in response to your suggestion above to include summaries for both langs A and B in that response. A new v2 version is added that changes the request semantics a bit to better match the other scenario, so that the request domain language is always the language in which the summary is needed, and the "description exists in lang B" piece can be specified with an optional srcLang query parameter. Summaries are then provided in the src and dst keys for the pages associated with the entity for both languages.

For example: https://edit-actions.wmflabs.org/es.wikipedia.org/v2/needs/description?srcLang=de

Without the srcLang parameter it just returns a summary for an article that needs a description in that language, the same as in v1, and the original v1 endpoints are also still available.

I held off on returning multiple candidates for now, since with the current algorithm we'd be returning a more or less random number of results per request, and I'm not sure how useful that is. I'll update for that when we've finalized the suggestion storage backend.

• Mholloway updated the task description. (Show Details)Jan 14 2019, 10:39 PM

Maybe it would be worth coming up with an abstract specification for the queue API (it seems clear that we need some kind of queue API and can't always generate tasks real-time) and choose based on that whether it should be driven by Redis or, say, Kafka or MySQL. What are the use case expectations fetching a random element from the queue stands for? Ensuring no two users get the same data? Ensuring the same user does not get the same data again and again? What is the language filter a translation feature is interested in? A specific source and target language? Any source and target that's within some set of languages that the user speaks? Specific target (the wiki the user wants to improve) but a set of potential sources? What other task features can we anticipate that do not fit into the "different task types in different queues" model? (E.g. article topics?)

I wonder why redis- I understand the need for caching, but recently x1 section was expanded to accommodate reading lists needs, and 10 GB is small compared to the reading lists and cx-translation (in-progress translation) needs, which is kind of the same amount of data. Not against using other technology- but this looks very similar to the above mentioned features, or the pre-cached Special:* list pages ? Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).

@Tgr I don't really think we need to order our candidate sets as queues, though we could. We want to serve users with random members of pregenerated sets of candidate Wikibase entities. We don't want to serve a candidate that no longer meets the selection criteria. We'd prefer not to serve the same item to more than one user around the same time, lest their edits collide. We'd prefer not to serve the same item to the same user over and over.

Given the size of the pregenerated sets here (up to millions of members), if we are truly serving random members, I don't anticipate edit collisions or repeats being a serious problem. Removing items from the set upon serving them (or popping them off the queue) eliminates the possibility of serving a candidate item that's already been served, but also makes the served item unavailable even if the user ultimately decides not to make an edit. Leaving the item in the set reduces the risk of depleting our candidate sets unnecessarily but means we have to lazily check the item to be served for validity first, and try again with a new random item if necessary. And we'd likely have to do that in any case even if removing candidates on serving, since the candidate might have been edited since pregeneration anyway despite not having been served by this API. (FWIW, Redis provides functions for retrieving random members of stored sets both with and without removing the chosen member, so this particular consideration wouldn't really counsel either for or against using Redis in any case, except maybe in its favor insofar as it would make it really easy to switch strategies if we run into problems with one or the other.)

I suppose moving a served candidate to the back of a queue without removing it would also reduce the risk of quickly re-serving the same candidate to the same user or another in a short time; in any case, both the risk of this happening when serving random candidates from a large set, and the overhead of ordering so as to mitigate the risk, seem pretty small. It probably doesn't really matter.

I am not confident in my ability to predict what the product folks are going to want tomorrow, but I would think that most of the scenarios you described could be supported with some simple PHP logic around a data store that provides candidates based on single languages X and Y, or simply handled on the client.

@Tgr @jcrespo I mainly looked at Redis because it's what GettingStarted is already using, and it turned out that it would also work really well for what we want to do here. I don't have any emotional attachment to Redis if we can achieve the same results with another tool, but the thing that makes it especially nice here, that I don't know how we'd replicate in (say) MySQL, is that Redis can very quickly calculate the intersection of two large sets, and persist the result as a new set. This allows us to lazily create derived sets for entities meeting criteria like "has a description defined in Spanish but not German," rather than having to pregenerate and store candidate sets for every possible language pair, which would mean pregenerating and storing tens of thousands of candidate sets (assuming the set of supported languages isn't restricted), most of which wouldn't actually be used.

JTannerWMF moved this task from Inbox to External on the Growth-Team board.Jan 17 2019, 9:10 PM

kostajh mentioned this in T213992: Homepage: technical investigation.Jan 18 2019, 7:49 PM

Change 485988 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/extensions/GettingStarted@master] Add maintenance script to populate DB tables with entity description info

https://gerrit.wikimedia.org/r/485988

gerritbot added a project: Patch-For-Review.Jan 23 2019, 12:38 AM

• Mholloway updated the task description. (Show Details)Jan 23 2019, 12:47 AM

Just a general comment:

before we decide what storage engine we want to use for a particular software, we should basically respond the following questions:

What are the latency requirements?
What is the data access model? (relational or key-value)
What are your persistence requirements for this data?
How is this data going to be accessed? Read-Write? Read-Only?
Does the data need to be written only when editing or via batch population - or rather be populated at runtime? (this is important for multi-dc support)
Does the data need to be consistent everywhere?
What is the total data size, the average record size, and the query rate you expect?

and then we can help you pick one that best fits your needs, based on your answers to these questions.

Also, given it's a MediaWiki extension, I guess we should focus more on the storage interface you use in MediaWiki (I guess it's MainObjectStash if you want to use "redis") as its backend can, and will, change.

So, while I wait for responses to my above questions, I don't think Redis is the right answer to your needs, at least in the context of us moving to multi-dc.

In T206504#4901292, @Joe wrote:

before we decide what storage engine we want to use for a particular software, we should basically respond the following questions:

What are the latency requirements?

It's not unusual for the apps to wait up to several hundreds of milliseconds for a response from RESTBase endpoints serving MCS-generated content, and I think comparable latency here is acceptable. For example, as of now, the /feed/featured response to external requests over the past 24 h has had an average p50 latency of 401 ms, and an average p95 latency of 880 ms.

As of now, I am considering p50 latency of < 500 ms and p95 < 1 s to be the absolute minimum acceptable performance here, and aiming for something more like p50 < 250 ms and p95 < 500 ms (or better, of course).

(I should note that the total latency will also include whatever additional work is necessary to gather the additional info described in T206504#4873268; in the existing testing prototype that's done by fetching and returning RESTBase page summaries for the wiki articles associated with the candidate Wikidata entities, though providing the required info in the form of RESTBase page summaries isn't a requirement per se.)

What is the data access model? (relational or key-value)

I could imagine a design working with either model, depending on what's realistically available. Redis is fundamentally a key-value store, but provides data types (here, sets) that offer useful operations. The data could also be structured relationally, as is being discussed in the current patch set, or it could simply be stored somewhere as a (very) large set of simple keys with boolean values (though that's probably not the best option).

What are your persistence requirements for this data?

it doesn't need to survive catastrophic system failure, or even a service restart. It should not be evicted simply to free up resources for something else.

How is this data going to be accessed? Read-Write? Read-Only?

Does the data need to be written only when editing or via batch population - or rather be populated at runtime? (this is important for multi-dc support)

I envision this data being populated and then updated only periodically (perhaps daily, or even less frequently, depending partly on how long it takes the underlying queries to run.) At runtime, it would be read-only.

I had also thought about trying to actively update the stored data at runtime in response to Wikidata edits, but this would be more complicated and probably isn't necessary.

Does the data need to be consistent everywhere?

It should, but only eventually (not immediately).

What is the total data size, the average record size, and the query rate you expect?

Total/avg record size depends on answers above, so I'll have to defer until there's consensus on those. That being said, it looks like the data being stored will be some combination of Wikibase entity IDs (minus the 'Q' prefix) and booleans/tinyints.

As for the query rate, per T206504#4873276, around 500 Wikidata description edits are made per day in the Android app. This data will be fetched to populate a view shown only on demand to "experienced" editors, so I would expect the query rate to be the same or lower.

and then we can help you pick one that best fits your needs, based on your answers to these questions.

Also, given it's a MediaWiki extension, I guess we should focus more on the storage interface you use in MediaWiki (I guess it's MainObjectStash if you want to use "redis") as its backend can, and will, change.

I welcome all advice on translating concrete requirements to MediaWiki storage interfaces!

In T206504#4903862, @Mholloway wrote:

In T206504#4901292, @Joe wrote:

Does the data need to be consistent everywhere?

It should, but only eventually (not immediately).

Actually, upon further reflection, I think I'd go further and say that consistency across DCs is not a requirement here, so long as we are able to store and periodically update the data on a per-DC basis.

In T206504#4901292, @Joe wrote:

Also, given it's a MediaWiki extension, I guess we should focus more on the storage interface you use in MediaWiki (I guess it's MainObjectStash if you want to use "redis") as its backend can, and will, change.

MediaWikiServices::getInstance()->getMainObjectStash() looks like it comes closest among MW's object cache interfaces to what I'm looking for, based on its description in the ObjectCache class comments. But the existing use of Redis in MediaWiki-extensions-GettingStarted isn't as a generic object cache; it's relying on functionality unique to Redis, specifically the set data type and SRANDMEMBER function. I'd planned to do likewise, or alternatively to use its list data type and associated functions to implement a queue.

Change 485988 abandoned by Mholloway:
Add maintenance script to populate DB tables with entity description info

Reason:
I'm going to start a fresh patch set coding up what you've outlined in your Jan 28 5:54 PM comment.

(BTW, I think the article_exists column is redundant since it should always be true for every item in the table, so I'll leave it out.)

https://gerrit.wikimedia.org/r/485988

• Mholloway updated the task description. (Show Details)Feb 20 2019, 10:38 PM

• Mholloway edited projects, added WikimediaEditorTasks; removed Patch-For-Review.Feb 24 2019, 6:12 PM

• Mholloway removed projects: Growth-Team, MediaWiki-extensions-GettingStarted.

• Mholloway moved this task from Doing to Code Review on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Mar 12 2019, 2:56 PM

• Mholloway moved this task from Code Review to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Mar 14 2019, 9:40 PM

• Mholloway closed this task as Resolved.Mar 23 2019, 10:27 PM

In T206504#4885131, @jcrespo wrote:

Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).

I'm confused about the status of Redis in our infrastructure. Just to be clear, is the plan to (eventually) phase it out completely? I'd emphasize that not all current usages of Redis are as a generic key-value store. As noted in T158239#3223921, the GettingStarted extension is using (extremely useful) Redis-specific set functionality for its edit suggestion engine. It will have to undergo significant changes to move to MySQL or a generic object cache interface. I'd planned to do the same or similar to what GettingStarted is doing here, which was my reason for preferring Redis.

IMO it would be a shame to lose access to Redis completely just because it doesn't happen to be suitable for some of the bigger previous use cases like the job queue and session storage. I understand from T212129 that current users of Redis (via getMainObjectStash or otherwise) may have mistaken assumptions about DC setup and the persistence guarantees (or lack thereof) in effect, but that seems solvable.

Also, my understanding is that Redis' cross-DC replication issues could be solved with dynomite, and indeed I see a (recently added) reference to such a setup in the class comments to WANObjectCache.php, so now I'm doubly confused about what's planned for Redis.

To turn this around: given our current infrastructure, if I want to maintain an in-memory queue of Wikibase item IDs in a way that's compatible with an active-active DC setup, what options do I have?

In T206504#5178006, @Mholloway wrote:

In T206504#4885131, @jcrespo wrote:

Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).

I'm confused about the status of Redis in our infrastructure. Just to be clear, is the plan to (eventually) phase it out completely? I'd emphasize that not all current usages of Redis are as a generic key-value store. As noted in T158239#3223921, the GettingStarted extension is using (extremely useful) Redis-specific set functionality for its edit suggestion engine. It will have to undergo significant changes to move to MySQL or a generic object cache interface. I'd planned to do the same or similar to what GettingStarted is doing here, which was my reason for preferring Redis.

No, we're not dismissing Redis. But it can and shall only be used as a dc-local cache, not as either a persistent datastore or expecting cross-dc replication. In the case of GettingStarted, direct access to Redis without the use of a MW storage interface is kind of the problem.

IMO it would be a shame to lose access to Redis completely just because it doesn't happen to be suitable for some of the bigger previous use cases like the job queue and session storage. I understand from T212129 that current users of Redis (via getMainObjectStash or otherwise) may have mistaken assumptions about DC setup and the persistence guarantees (or lack thereof) in effect, but that seems solvable.

We will be probably moving all those use-cases to a multi-dc aware k-v storage, with latencies that are at least one order of magnitude higher than redis. But if we need to read-write from both datacenters, I don't see alternatives. Also it should be noted those usages went via RedisBagOfStuff, which was basically a k-v interface.

Also, my understanding is that Redis' cross-DC replication issues could be solved with dynomite, and indeed I see a (recently added) reference to such a setup in the class comments to WANObjectCache.php, so now I'm doubly confused about what's planned for Redis.

Dynomite is a 100-pound gorilla in terms of complexity and need to understand its operations. I did test it for memcached replication and that function is completely broken. Frankly, I don't see spending effort on making dynomite work as valuable:

If you want a fast, in-memory, replicated storage we have memcached. It doesn't have the fancy data types redis has, but it's hardly impossible to use it for the same purpose.
If you want a less fast, reliable, eventually consistent multi-dc storage, you have cassandra via kask

so I'm not sure the burden of maintaining the N-th storage system would be justified.

To turn this around: given our current infrastructure, if I want to maintain an in-memory queue of Wikibase item IDs in a way that's compatible with an active-active DC setup, what options do I have?

If you don't need cross-dc consistency or replication, Redis is perfectly ok as a tool. I'm just not very happy of the exception GettingStarted is (using a direct connection vs using an existing MW interface).
Frankly, we're trying to uniform the storage interfaces in MediaWiki and its extensions, reiterating the non-compliant behaviour of gettingstarted again seems to go in the opposite direction.

I think it could be argued MW needs a Redis-specific storage interface, or model what we need into an interface that can have a redis and/or a db backend. @daniel might have opinions on this as well.

If you want a fast, in-memory, replicated storage we have memcached. It doesn't have the fancy data types redis has, but it's hardly impossible to use it for the same purpose.
If you want a less fast, reliable, eventually consistent multi-dc storage, you have cassandra via kask

I'd argue there is a third option, memcache + MySQL (replicated), which is what it is used for parser caches (so already used and standarized), in which memcache is used for fast local access and MySQL for persisted and replicated usage.

Thanks for the info, @Joe and @jcrespo. That makes sense about dynomite, I just wasn't sure if it had been considered. I think my next step is to see what I can do with ObjectCache::getMainStashInstance() and a little bit of bookkeeping.

JoeWalsh mentioned this in T224233: Enhance the existing article description suggested edit APIs to use the new approach used by the image caption suggested edit APIs.May 23 2019, 4:02 PM

Create a new endpoint which returns articles in need of a descriptionClosed, ResolvedPublicActions