Feedback from Quarry and PAWS users, and other wiki editors affected by the new architecture
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Jhernandez
	Nov 23 2020, 4:04 PM

Description

See wikitech:News/Wiki_Replicas_2020_Redesign

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

If you do SQL queries to the Wiki Replicas via Quarry, PAWS, or some other way, these changes may impact you:

Cross database JOINs will no longer work
- Unfortunately, this means cross joins from, for instance, enwiki to commonswiki or wikidatawiki are no longer possible
Can only query a database if you are connected to it explicitly
- For example, connecting to the enwiki host to query against avwiki will no longer work

If you think you could be affected by these changes, please post a comment and link to the code you have that does the SQL queries, and we can try and help diagnose if you would be affected and how to change it so that it doesn't break if possible when the new system goes live.

Related Objects
Search...

Status	Assigned	Task
Resolved	Marostegui	T233766 labsdb1011 mariadb crashed
		Restricted Task
		Restricted Task
Open	None	T204950 Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users
Open	None	T215858 Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema
Resolved	fnegri	T280152 Mitigate breaking changes from the new Wiki Replicas architecture
		Unknown Object (Task)
Resolved	RobH	T260441 (Need By: ASAP) rack/setup/install clouddb10[13-20]
Resolved	• Bstorm	T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture
Resolved	• Jhernandez	T268498 Feedback from Quarry and PAWS users, and other wiki editors affected by the new architecture

Event Timeline

• Jhernandez created this task.Nov 23 2020, 4:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2020, 4:04 PM

I've made this task as the POC for feedback, since I'm posting an item to tech news. Any other feedback is also welcome.

Superyetkin subscribed.Nov 30 2020, 6:49 PM

• bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Dec 4 2020, 11:19 PM

It seems that some of my queries on Quarry will be severely affected by this change. What can be done for cases like this?

Hi @Superyetkin, thanks for commenting. The specific examples are very useful in figuring out ways forward.

In the case of that query, AFAICT it is using joins as a way to make multiple queries, but they are not really related, so the migration is fairly straightforward. There are a couple of options I can think about:

Make separate quarry queries one for each database, join the results manually
Make some code to make the separate queries and print all the results together (for example using PAWS, or making a script that you can run locally or on toolforge for example)

Does it make sense? Let me know if you need some help figuring the best way out for you, and if you have other examples that can be problematic

• Jhernandez triaged this task as High priority.Jan 8 2021, 7:57 PM

Some generic examples how I am using cross-database joins. Problem is not really if I know how to recreate those using code, but if the cross-database joins are going away they are a lot more complex to implement (=more work to do write the code). They are also a lot slower to run if I need to first fetch all the data locally to do manually the matching as now the work is done transparently by the SQL server.

Find pages without authority control templates but with suitable wikidata-property set.

https://quarry.wmflabs.org/query/51262

Find pages without commonscat template in fiwiki but with commons category over 10 files in Wikimedia Commons.

https://quarry.wmflabs.org/query/51265

Find users and timestamps of users who have taken photos to the public art project's art list for counting stats.

https://quarry.wmflabs.org/query/51269

What I would find useful is that there would be after the migration least some level support for doing the joins between all other databases AND some of tables from Commons, Wikidata in example.

I understand that the largest tables like revision, logging, comments etc needs to go because diskspace constrains but there is still other tables like which are highly useful in crossdatabase joins context.

In example

page
page_props
image_compat
imagelinks
globalimagelinks
category
categorylinks
pagelinks
extermalinks
redirects
geo_tags
wb_*

etc...

Thanks for providing some of your queries @Zache, real examples are very useful.

In T268498#6742640, @Zache wrote:

Some generic examples how I am using cross-database joins. Problem is not really if I know how to recreate those using code, but if the cross-database joins are going away they are a lot more complex to implement (=more work to do write the code).

This is definitely the case, and it is very annoying indeed, but it is necessary to not fully lose the replicas right now.

There are some use cases like things that join with commons that are almost impossible to do off-cluster, because of the size and frequency of updates to the commons DB. This is known and –even if more cumbersome– we would like for those use cases to at least be feasible to code.

They are also a lot slower to run if I need to first fetch all the data locally to do manually the matching as now the work is done transparently by the SQL server.

Indeed, slower, and in some cases infeasible because the servers time out trying to transfer big amounts of query data for download, so not feasible in those cases (like I was talking about above). Then you start having to do pagination and things become unreasonably complex and long to run.

[...]
What I would find useful is that there would be after the migration least some level support for doing the joins between all other databases AND some of tables from Commons, Wikidata in example.

I saw you already commented in the other OLAP task so that's great, we are exploring options there for things we could do. Some experiments and feasibility investigations. If something looks promising we will pursue it as soon as we can.

• Jhernandez added a project: Developer-Advocacy (Jan-Mar 2021).Jan 14 2021, 5:23 PM

Here is a more complex example. As far as I can see, rewriting this in pure application logic requires lots of developer work, if such an implementation is feasible at all.

Count_Count subscribed.Mar 14 2021, 11:22 AM

• Jhernandez moved this task from Jan-Mar 2021 to Apr-Jun 2021 on the Developer-Advocacy board.Apr 1 2021, 4:32 PM

• Jhernandez edited projects, added Developer-Advocacy (Apr-Jun 2021); removed Developer-Advocacy (Jan-Mar 2021).

Chicocvenancio subscribed.Apr 5 2021, 10:20 PM

• Jhernandez removed a project: Developer-Advocacy (Apr-Jun 2021).Jul 1 2021, 5:16 PM

• Jhernandez removed • Jhernandez as the assignee of this task.Aug 26 2021, 3:07 PM

• bd808 closed this task as Resolved.Oct 29 2021, 10:51 PM

• bd808 assigned this task to • Jhernandez.