Page MenuHomePhabricator

Provision search endpoint for SDC. Requirements from Product Team.
Closed, ResolvedPublic

Assigned To
None
Authored By
Nuria
Apr 26 2019, 3:10 AM
Referenced Files
None
Tokens
"Mountain of Wealth" token, awarded by Husky."Like" token, awarded by Fuzheado."Like" token, awarded by Gamaliel."100" token, awarded by Jane023.

Description

Eventually, we want one query service that contains both Commons and Wikidata data. According to Stas, " unless we get much stronger servers or can improve it by 3x-4x it still might be an issue to update single database from two sources like wikidata & commons - both are rather high traffic from what I understand".
Until we can improve the performance or get stronger servers, we want a separate Commons query service. (You should still be able to retrieve files using queries based on Commons and Wikidata statements.)

Existing requirements
Using SPARQL queries, be able to retrieve files based on any combination of existing statements in Commons or Wikidata, and present them like the WDQS does.

Example queries

  • Give me all featured/high-quality images of paintings in the Louvre by artists from the Netherlands
  • Give me a count of all images of monuments uploaded last year."
  • Give me all images that depict the wd q item for any sort of real life hedgehog animal (but not Sonic the hedgehog or the heraldic hedgehog icon.
  • Give me all images that depict a hedgehog with a licenses statement with the value "CC0"
  • Give me all images that are digital representation of "Starry Night"
  • Give me all artwork with a 'date depicted [p2913]' value of July 4, 1776 but an 'inception [p571]' date before the year 1800 and show a timeline of those artworks
  • Something for patrollers, depending on how Commonists want to structure patrolling statements.

Timeline
As soon as possible

Front facing functionality
It should look and behave exactly like WDQS except with the Commons logo instead of the WD logo, maybe with a different "Run" button color.

Development environment
This one is for @Smalyshev.

Related Objects

Event Timeline

Nuria renamed this task from Provision sparql endpoint for SDC. Requirements from Product Team. to Provision search endpoint for SDC. Requirements from Product Team..Apr 29 2019, 7:56 PM
Abit subscribed.

Alright, @Smalyshev and @Ramsey-WMF , I've done my best at starting this ticket. Please adjust as needed :)

Answering the questions @Gehel asked in an email:

  • It is still unclear to me if this new service should be purely an external service or if it will also be used as an internal endpoint (maybe with mediawiki as a client?). Experience on WDQS show that we want to keep internal synchronous traffic separated from external / asynchronous traffic. Which means more clusters and more servers if we need to serve both use cases.

We want Commons and Wikidata users to be able to run queries, on a site similar to query.wikidata.org. I don't think that we need people to be able to run queries outside of the WMF projects, for now. (But will probably will want people to be able to do so when we have the ideal combined Wikidata + Commons service.) @Ramsey-WMF, please correct or elaborate :)

  • The current public WDQS endpoint (query.wikidata.org) access is completely free. This is problematic for this kind of service. We should think about a solution to manage access to the service. The usual model we have for those kind of services is to limit access from WMCS only, potentially with authentication linked to WMCS accounts. It is hard to add restriction to a service after the fact, so we might want to get it from the start.

That works for me and @Ramsey-WMF, I think, for this initial iteration of the query service.

  • The volumetry is also unclear to me (dataset size, query load). Do we have any idea on those? (Yes, a rough guesstimate is enough).

It's really hard to say, we don't know what the demand will be yet. How about guesstimate by ballpark analogy? If WDQS is AT&T Park, home of the San Francisco Giants, capacity 42,000 people, Commons query service is Raley Field, home of the Sacramento River Cats, capacity 15,000 people.

The outcomes of the chat today seem to be:

  1. We're making separate Commons query service, on separate servers
  2. The estimate for the data size is: the number of items will eventually be the same as Wikidata, but the item size probably substantially less. It may take some time to get there but probably no longer than 1-2 years.
  3. The query load from power editors would be probably lower than on Wikidata since the editor count is similar but most Commons editors are uploaders, not SDC editors (at least for now).
  4. The query load from individual consumers might be as high as Wikidata, since more people may be looking for various images for their work, but Wikidata load is dominated by bots now, so the query load is expected to be lower at least initially.
  5. On the first stage of the work, we will set up a server on VPS as a prototype, but then we will have to migrate to production setup, unless we can have some kind of VPS-on-real-hardware setup.
  6. Minimal production config is 6 servers, with the specs close to initial Wikidata servers. We can cut requirements (while simultaneously cutting the level of declared support/reliability - i.e. single cluster, less redundancy, etc.) but that's the minimum for full production.
  7. We will only have public endpoint, no internal endpoint. We can make it require login, but it's just more work so we're not doing it for now.

Some of the use cases described here are already supported by search (wbstatement keywords, etc...). We are not going to work on a new SPARQL endpoint before we have a scaling strategy for the current WDQS. It looks like the remaining use cases described here might be better served not by a SPARQL endpoint, but by a more specific service.

Some of the use cases described here are already supported by search (wbstatement keywords, etc...). We are not going to work on a new SPARQL endpoint before we have a scaling strategy for the current WDQS. It looks like the remaining use cases described here might be better served not by a SPARQL endpoint, but by a more specific service.

Not providing the promised SPARQL endpoint for Structured data on Commons is effectively pulling the plug on the whole project. You would effectively kill it because the volunteers who are currently investing quite a bit of time on getting the data in, would walk away. What is the point of converting data if I can't get it out? How do I make the maps, the timelines, the bubble graphs, the recently added photos in my area and countless of other things? The SPARQL endpoint offers at least two things the current search doesn't have:

  1. Going through the tree. With search I can only do one level (P180->something), not multiple (P180+->something).
  2. Table based and visual representations. The SPARQL endpoint at https://query.wikidata.org/ offers a whole range of ways to output the data.

What's the timeline for this? This task is almost a year old with state "as soon as possible".

Not providing the promised SPARQL endpoint for Structured data on Commons is effectively pulling the plug on the whole project.

We will still build a tool to get out Commons data, one that lets you navigate the ontology tree and have similar output formats.  Fixing WDQS before it busts is higher priority, so that is what the Search Platform team is working on right now.  In addition, the Commons Query Service is currently blocked because we don't want to build a new query service until we know what the solution for the old one is.

Not providing the promised SPARQL endpoint for Structured data on Commons is effectively pulling the plug on the whole project.

We will still build a tool to get out Commons data, one that lets you navigate the ontology tree and have similar output formats.  Fixing WDQS before it busts is higher priority, so that is what the Search Platform team is working on right now.  In addition, the Commons Query Service is currently blocked because we don't want to build a new query service until we know what the solution for the old one is.

So that basically puts the whole project on hold. For how long? A week? A month? A year? Maybe even longer? As pointed out by several people at https://commons.wikimedia.org/wiki/Commons_talk:Structured_data#On_SPARQL , the Wikidata query service has grown to a completely different scale. You are currently completely blocking something because of something that might happen in the future. Stas already showed with https://sdcquery.wmflabs.org/ that it's easy to set up an initial version.

This is all about priorities. I'm not very happy about the fact that resources are being spend on toys like https://commons.wikimedia.org/wiki/Special:SuggestedTags#popular that give a lot of community backlash, but not on things like this. We have a relatively small group of volunteers working on this now. We can't keep up that it's all going to be great if major parts of the project like the SPARQL service are put on hold indefinitely.

The GLAMs that I interact with also ask very impatiently for a SPARQL endpoint, as a means for themselves and for supporting Wikimedians to check and maintain their own collections on Commons. WDQS is a crucial component of refined batch maintenance and editing tools as well (think PetScan, TABernacle), and it powers structured data-driven lists and galleries, which have been quite important for several SDC GLAM pilots.

Engineering a completely new search facility for Commons Data rather than using SPARQL is a *stupid* *waste* *of* *time* *and* *resources*.

Also it will be very challenging to come up with a solution that can handle trees, and qualifers, and combinations, as well extractin data from Wikidata -- the whole design of SDC rests on items that have their properties in Wikidata statements.

Even the most basic search -- show me items with tags that match this *class* of items on Wikidata requires incorporating the information from Wikidata as to what items on Wikidata have that relation ?item wdt:P31 ?class

SPARQL gives you this for free, through federation. It's also a beautifully clear language. We have an outstanding UI for it, ready to go, out of the box. And it's what the entire community knows and uses on a daily basis, so there'd be no muddle between two systems.

WDQS handles upwards of 5 million queries a day. In contrast CQS is going to be used by a handful of power users, that desperately need it to iteratively evolve and develop the data-modelling -- which doesn't happen at the moment, and has stalled the entire project, because at the moment we are completely blind, because we have no CQS. That's why need for CQS is *critical*, but even so it may only have a few tens of tens of users. The query-load problems WDQS faces *simply* *do* *not* *apply*. CQS is not a system for end-users. It's not what is going to power end-user applications. It does not face the same capacity issues. The updating bottleneck issue is *not* an prority issue for CQS.

But CQS is *critically* needed for the community to be able to see what they are doing, to grow the data model, and to prototype the sort of cool searches and refinements to show off what the data is capable of.

See also comments building up in thread at https://commons.wikimedia.org/wiki/Commons_talk:Structured_data#On_SPARQL

Multichill calls it exactly right:
Not providing the promised SPARQL endpoint for Structured data on Commons is effectively pulling the plug on the whole thing.

[SELF-EDIT: EXPRESSION OF INCREDULITY AND SOME DISPLEASURE THAT WAS HERE, REMOVED]

We needed CQS last year, not this year. It's *far* more important than anything in the GUI. It is an absolute critical priority. If you can't deliver, then hand the ticket back to WMDE and pay them to do it.

I would like to remind everyone in this discussion that we have a Code of Conduct.

I agree with earlier posts and would say I have zero motivation to contribute to SDoC until such a query service is available to me, because working on paintings I find Wikidata workflows much easier to monitor and lacking that for Commons I can't make any overviews or execute comparative checks.

I'm aware of how old this ticket is and the pain it's caused trying to move it forward. There seems to be some miscommunication going on here and it's escalating, I'm going to work with some folks on clarifying intentions.

I'm aware of how old this ticket is and the pain it's caused trying to move it forward. There seems to be some miscommunication going on here and it's escalating, I'm going to work with some folks on clarifying intentions.

That would be great. I think the frustration is due to expectations management. I guess we were imagining that we would just run the same code as Wikidata, so once turned on SDC would have similar GUI as Wikidata and with the same capabilities, including SPARQL interface, constraint verification, support for all the basic types, support for Lua interface, etc. So that should have been the baseline, followed by possible additional features. In the mean time, we are still adding basic features Wikidata had ages ago, like for example It was only today, I was able to use GUI to add units to a "quantity" type property (see T239474). We are also still waiting for SPARQL interface, constraint support, ability to link to statements (T241338), etc.. While we are waiting on those capabilities, SDC development seems to concentrate on new GUI, which is much more clunky than the Wikidata's GUI, and tools like Computer-aided tagging ( which I support but would prefer to see done after we manage to finish with turning on all the features we now expect from wikibase). It is a bit like if we were starting a new language of Wikipedia and first turned on the mediawiki software from 15 years ago and than had to file tickets to add all the features we expect from current Wikipedia.

That would be great. I think the frustration is due to expectations management.

+1. It's pretty hard to stay excited about SDoC when there's no proper way to view the hard work that is being done with editing all images, *especially* when there's no way to take Wikidata items into the same query. What's the point of tagging all the reproductions of the Mona Lisa with a 'depicts' statement if there's no way to do a query on them? Or having a completely separate, other way of doing that?

Sorry the the misunderstanding here. It was never our intention to not provide a SPARQL endpoint for Commons. But given the trouble we have at the moment with WDQS, we are focusing on stabilising our existing services before adding new ones. With the work we are doing on WDQS, we are learning about things we should be doing better (like the rework of the whole update chain from Wikidata to WDQS). We don't want to repeat the same mistakes and we want to start from a sane situation. I understand that the delay is frustrating, but duplicating a service that is not performing well does not sound very appealing to me. Yes, I understand the the expected load on a SPARQL endpoint for Commons is very different than what we currently have on WDQS, but what we see on WDQS are also more fundamental problems that need to be fixed.

Thank you all for your patience!

Yes, there's a cost to you of providing a service based on current WDQS, that then has to be ripped out for a new version based on WDQS 2.

But consider how little cost that change is for users (since what they interact with will be essentially unchanged - SPARQL for SPARQL, like for like), and consider how much they will be able to do with the endpoint in that time. There is so much that the community could move forwards on now, and this is such a blocker for us.

Thank you all for your patience!

Can you give an indication how much longer you're going to test our patience? Weeks? Months? Years?

We recognize how important this is and will be meeting over the next few weeks to determine a plan forward and a timeline. We'll begin providing regular updates here and on the talk thread as we have more information in the next couple of weeks.

Here is a brief update. @Keegan will also be posting an update to the talk thread today.

  • The work to create a SPARQL endpoint for Commons has been re-prioritized. Our teams will be working on it over the next few months and the search team is currently estimating the work involved.
  • The first release will be a beta endpoint that will be updated via weekly dumps. Caveats will include limited performance, expected downtimes, and no interface, naming, or backward compatibility stability guarantees.
  • We do plan to move this to production, but we don't have a timeline on that yet.
  • The SPARQL endpoint for Commons will be restricted via a light and unobtrusive form of authentication, so that we can contact abusive bots / users and block them selectively (as a last resort) when needed. More details on this to come.
  • We want to emphasize that while we do expect a SPARQL endpoint to be part of a medium to long term solution, it will only be part of that solution. Even once the SPARQL endpoint is production-ready, it will still have limitations in terms of timeouts, expensive queries, and federation. Some use cases will need to be migrated, over time, to better solutions — once those solutions exist.
  • The work to create a SPARQL endpoint for Commons has been re-prioritized. Our teams will be working on it over the next few months and the search team is currently estimating the work involved.

I'm super happy and relieved to hear this. Thank you!

  • We want to emphasize that while we do expect a SPARQL endpoint to be part of a medium to long term solution, it will only be part of that solution. Even once the SPARQL endpoint is production-ready, it will still have limitations in terms of timeouts, expensive queries, and federation. Some use cases will need to be migrated, over time, to better solutions — once those solutions exist.

Understood. These concerns do seem to be focused on technical limitations, am I correct? In terms of context, I think in the above thread I haven't expressed clearly enough that a SPARQL endpoint is also quite important for Wikimedia Commons to be considered a serious citizen of the Linked Open Data web, being a currently rather standard style of endpoint for LOD repositories worldwide, including the ability to do federated queries across repositories. It's an important element for Wikimedia projects to be potentially taken seriously in GLAM+ worldwide knowledge infrastructure. For future better solutions (again, I assume, technically more stable?) I hope we'll stay committed to being compliant with what the rest of the semantic web is doing.

As for a SPARQL endpoint being challenging for many end users from an UX perspective - I would say: don't underestimate our end users and community. I have been wildly surprised at how many people in and around the Wikimedia community - not just a narrow group of power users, but I would say also lots of 'regular' contributors! - have mastered SPARQL and actively use it. A lot of it also thanks to the great work on the Wikidata Query Helper and several volunteer-built tools to make building queries easier.

As for a SPARQL endpoint being challenging for many end users from an UX perspective - I would say: don't underestimate our end users and community.

+1. SPARQL is definitely not for the 'average' user, fortunately we don't have many average users on our projects. Also note that a SPARQL endpoint can easily serve as the fundament of tools that are more user-friendly, for example my own VizQuery uses the Wikidata SPARQL service. To adapt that to a future stable SDoC / Commons endpoint would literally be a one-line code change. It already works on the old beta service (if that would still work).

For me the big issue is quality control. As more people and bots are adding more statements, I am concerned about people finding multiple creative ways to store the same type of information, resulting in multiple parallel data models. SPARQL endpoint and property constraints is a way to check if everybody uses the same data model and allows to address issues quickly, but we are still in the dark.

For me the big issue is quality control. As more people and bots are adding more statements, I am concerned about people finding multiple creative ways to store the same type of information, resulting in multiple parallel data models. SPARQL endpoint and property constraints is a way to check if everybody uses the same data model and allows to address issues quickly, but we are still in the dark.

Constraints are live on production, so in theory they'll Just Work® when the endpoint is up.

Here is another brief update:

The SPARQL endpoint for Commons is moving along well. The data has been loaded and federation is working. Up next is work on authentication, automated data reload, and necessary UI updates. We’re hoping to have an initial release in about four more weeks, though surprises may cause additional delays - we’ll keep you updated.

@CBogen cool, thanks for the update. Really looking forward to be able to test something!

FYI: this work is still on track to be completed by the end of this month. We'll continue to provide updates as we get closer.

At this point it looks like we're a little over two weeks out from release:

  • OAuth code is ready but still needs to be merged
  • Deploying configuration changes to the UI in order to customize the title, logo, examples, etc for the Commons Query Service proved to be a little bit more complicated than expected, but is nearly ready
  • The "wd" prefix needs to be changed for Commons for clarity purposes. It is taking longer than expected to do this in the dumps, so we are temporarily doing this at load time, which delayed things a bit.

Sometime next week the data should be loaded in, and after that comes approximately one more week of testing. We're almost there.