Page MenuHomePhabricator

Provision search endpoint for SDC. Requirements from Product Team.
Open, Needs TriagePublic

Description

Eventually, we want one query service that contains both Commons and Wikidata data. According to Stas, " unless we get much stronger servers or can improve it by 3x-4x it still might be an issue to update single database from two sources like wikidata & commons - both are rather high traffic from what I understand".
Until we can improve the performance or get stronger servers, we want a separate Commons query service. (You should still be able to retrieve files using queries based on Commons and Wikidata statements.)

Existing requirements
Using SPARQL queries, be able to retrieve files based on any combination of existing statements in Commons or Wikidata, and present them like the WDQS does.

Example queries

  • Give me all featured/high-quality images of paintings in the Louvre by artists from the Netherlands
  • Give me a count of all images of monuments uploaded last year."
  • Give me all images that depict the wd q item for any sort of real life hedgehog animal (but not Sonic the hedgehog or the heraldic hedgehog icon.
  • Give me all images that depict a hedgehog with a licenses statement with the value "CC0"
  • Give me all images that are digital representation of "Starry Night"
  • Give me all artwork with a 'date depicted [p2913]' value of July 4, 1776 but an 'inception [p571]' date before the year 1800 and show a timeline of those artworks
  • Something for patrollers, depending on how Commonists want to structure patrolling statements.

Timeline
As soon as possible

Front facing functionality
It should look and behave exactly like WDQS except with the Commons logo instead of the WD logo, maybe with a different "Run" button color.

Development environment
This one is for @Smalyshev.

Event Timeline

Nuria created this task.Apr 26 2019, 3:10 AM
Nuria renamed this task from Provision sparql endpoint for SDC. Requirements from Product Team. to Provision search endpoint for SDC. Requirements from Product Team..Apr 29 2019, 7:56 PM
Abit updated the task description. (Show Details)Apr 30 2019, 5:40 PM
Abit added a subscriber: Ramsey-WMF.
Abit updated the task description. (Show Details)Apr 30 2019, 5:56 PM
Ramsey-WMF updated the task description. (Show Details)Apr 30 2019, 6:32 PM
Abit updated the task description. (Show Details)Apr 30 2019, 10:35 PM
Abit updated the task description. (Show Details)Apr 30 2019, 10:43 PM
Abit added a subscriber: Abit.

Alright, @Smalyshev and @Ramsey-WMF , I've done my best at starting this ticket. Please adjust as needed :)

Abit added a subscriber: Gehel.Apr 30 2019, 10:58 PM

Answering the questions @Gehel asked in an email:

  • It is still unclear to me if this new service should be purely an external service or if it will also be used as an internal endpoint (maybe with mediawiki as a client?). Experience on WDQS show that we want to keep internal synchronous traffic separated from external / asynchronous traffic. Which means more clusters and more servers if we need to serve both use cases.

We want Commons and Wikidata users to be able to run queries, on a site similar to query.wikidata.org. I don't think that we need people to be able to run queries outside of the WMF projects, for now. (But will probably will want people to be able to do so when we have the ideal combined Wikidata + Commons service.) @Ramsey-WMF, please correct or elaborate :)

  • The current public WDQS endpoint (query.wikidata.org) access is completely free. This is problematic for this kind of service. We should think about a solution to manage access to the service. The usual model we have for those kind of services is to limit access from WMCS only, potentially with authentication linked to WMCS accounts. It is hard to add restriction to a service after the fact, so we might want to get it from the start.

That works for me and @Ramsey-WMF, I think, for this initial iteration of the query service.

  • The volumetry is also unclear to me (dataset size, query load). Do we have any idea on those? (Yes, a rough guesstimate is enough).

It's really hard to say, we don't know what the demand will be yet. How about guesstimate by ballpark analogy? If WDQS is AT&T Park, home of the San Francisco Giants, capacity 42,000 people, Commons query service is Raley Field, home of the Sacramento River Cats, capacity 15,000 people.

Jheald added a subscriber: Jheald.Apr 30 2019, 11:40 PM
dcausse added a subscriber: dcausse.May 2 2019, 6:29 AM

The outcomes of the chat today seem to be:

  1. We're making separate Commons query service, on separate servers
  2. The estimate for the data size is: the number of items will eventually be the same as Wikidata, but the item size probably substantially less. It may take some time to get there but probably no longer than 1-2 years.
  3. The query load from power editors would be probably lower than on Wikidata since the editor count is similar but most Commons editors are uploaders, not SDC editors (at least for now).
  4. The query load from individual consumers might be as high as Wikidata, since more people may be looking for various images for their work, but Wikidata load is dominated by bots now, so the query load is expected to be lower at least initially.
  5. On the first stage of the work, we will set up a server on VPS as a prototype, but then we will have to migrate to production setup, unless we can have some kind of VPS-on-real-hardware setup.
  6. Minimal production config is 6 servers, with the specs close to initial Wikidata servers. We can cut requirements (while simultaneously cutting the level of declared support/reliability - i.e. single cluster, less redundancy, etc.) but that's the minimum for full production.
  7. We will only have public endpoint, no internal endpoint. We can make it require login, but it's just more work so we're not doing it for now.
Abbe98 added a subscriber: Abbe98.May 14 2019, 11:23 AM