Page MenuHomePhabricator

Feedback from Quarry and PAWS users, and other wiki editors affected by the new architecture
Open, HighPublic

Description

See wikitech:News/Wiki_Replicas_2020_Redesign

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

If you do SQL queries to the Wiki Replicas via Quarry, PAWS, or some other way, these changes may impact you:

  • Cross database JOINs will no longer work
    • Unfortunately, this means cross joins from, for instance, enwiki to commonswiki or wikidatawiki are no longer possible
  • Can only query a database if you are connected to it explicitly
    • For example, connecting to the enwiki host to query against avwiki will no longer work

If you think you could be affected by these changes, please post a comment and link to the code you have that does the SQL queries, and we can try and help diagnose if you would be affected and how to change it so that it doesn't break if possible when the new system goes live.

Event Timeline

I've made this task as the POC for feedback, since I'm posting an item to tech news. Any other feedback is also welcome.

It seems that some of my queries on Quarry will be severely affected by this change. What can be done for cases like this?

Hi @Superyetkin, thanks for commenting. The specific examples are very useful in figuring out ways forward.

In the case of that query, AFAICT it is using joins as a way to make multiple queries, but they are not really related, so the migration is fairly straightforward. There are a couple of options I can think about:

  1. Make separate quarry queries one for each database, join the results manually
  2. Make some code to make the separate queries and print all the results together (for example using PAWS, or making a script that you can run locally or on toolforge for example)

Does it make sense? Let me know if you need some help figuring the best way out for you, and if you have other examples that can be problematic

Some generic examples how I am using cross-database joins. Problem is not really if I know how to recreate those using code, but if the cross-database joins are going away they are a lot more complex to implement (=more work to do write the code). They are also a lot slower to run if I need to first fetch all the data locally to do manually the matching as now the work is done transparently by the SQL server.

Find pages without authority control templates but with suitable wikidata-property set.

Find pages without commonscat template in fiwiki but with commons category over 10 files in Wikimedia Commons.

Find users and timestamps of users who have taken photos to the public art project's art list for counting stats.

What I would find useful is that there would be after the migration least some level support for doing the joins between all other databases AND some of tables from Commons, Wikidata in example.

I understand that the largest tables like revision, logging, comments etc needs to go because diskspace constrains but there is still other tables like which are highly useful in crossdatabase joins context.

In example

  • page
  • page_props
  • image_compat
  • imagelinks
  • globalimagelinks
  • category
  • categorylinks
  • pagelinks
  • extermalinks
  • redirects
  • geo_tags
  • wb_*

etc...

Thanks for providing some of your queries @Zache, real examples are very useful.

Some generic examples how I am using cross-database joins. Problem is not really if I know how to recreate those using code, but if the cross-database joins are going away they are a lot more complex to implement (=more work to do write the code).

This is definitely the case, and it is very annoying indeed, but it is necessary to not fully lose the replicas right now.

There are some use cases like things that join with commons that are almost impossible to do off-cluster, because of the size and frequency of updates to the commons DB. This is known and –even if more cumbersome– we would like for those use cases to at least be feasible to code.

They are also a lot slower to run if I need to first fetch all the data locally to do manually the matching as now the work is done transparently by the SQL server.

Indeed, slower, and in some cases infeasible because the servers time out trying to transfer big amounts of query data for download, so not feasible in those cases (like I was talking about above). Then you start having to do pagination and things become unreasonably complex and long to run.

[...]
What I would find useful is that there would be after the migration least some level support for doing the joins between all other databases AND some of tables from Commons, Wikidata in example.

I saw you already commented in the other OLAP task so that's great, we are exploring options there for things we could do. Some experiments and feasibility investigations. If something looks promising we will pursue it as soon as we can.

Here is a more complex example. As far as I can see, rewriting this in pure application logic requires lots of developer work, if such an implementation is feasible at all.