Page MenuHomePhabricator

Consider granting `CREATE TEMPORARY TABLES` to labsdbuser
Closed, DeclinedPublic

Description

In T156869: Design a method for keeping user-created tables in sync across labsDBs we decided not to allow user created tables on the new Wiki Replica cluster. The reasoning for the decision is sound (can't guarantee replication and that blocks maintenance & load balancing).

There are a class of analytics-style queries that are much easier to do with an intermediary table however rather than a sub-select, union, or in-app join. Would it be reasonable to allow things like create temporary table foo ( foo_id varchar(30) ); to be performed on these servers? The upstream documentation says:

You can use the TEMPORARY keyword when creating a table. A TEMPORARY table is visible only within the current session, and is dropped automatically when the session is closed. This means that two different sessions can use the same temporary table name without conflicting with each other or with an existing non-TEMPORARY table of the same name. (The existing table is hidden until the temporary table is dropped.)

Event Timeline

I am going to put this on hold until the 14th when Jaime is back, because I cannot remember if this GRANT was excluded for some specific reason when we set up the new replicas or just because it was not defined specifically (as users on the other replicas have GRANT ALL PRIVILEGES for their userdbs).

Hi,

We have been discussing this ticket during our meeting and we don't have a clear picture of what problem you are trying to solve here.
Could you give us an example of what is broken and/or what you would like to achieve with these temporary tables?

Thanks!

Marostegui triaged this task as Medium priority.Nov 13 2017, 1:13 PM

An example from a recent created report:

  1. Table 1: Article titles with GROUP BY normalized titles HAVING COUNT(*) > 1
  2. Table 2: All links from disambiguation pages and set-index articles with redirects resolved, sorting and indexing by (target_page, dab_page)
  3. These tables are combined and outputted into the report

The initial version using a single query and was projected to be 12x slower (6-8 hours vs 30 minutes).

@MZMcBride did you have some reports that used TEMPORARY TABLE as well? I've lost track at this point of who should be involved in some of these conversations.

@Dispenser Do you have a link to a code example?, that will be easier to discuss (the link you added is an ongoing discussion without much code, unless I have missed it). We are not arguing against There are a class of analytics-style queries that are much easier to do with an intermediary table, we just want to know an example of code run to provide ways of doing it similarly, if not exactly the same way. A link to a git repo of code "no longer working" would be what we are looking for.

@MZMcBride did you have some reports that used TEMPORARY TABLE as well? I've lost track at this point of who should be involved in some of these conversations.

I briefly researched temporary tables, but figured that if regular tables weren't allowed on the same hosts, temporary tables (in memory or on disk) wouldn't be allowed either. I was hoping T173511: Implement technical details and process for "datasets_p" on wikireplica hosts would solve my problem, so I've stopped researching alternative solutions for now.

@Dispenser Do you have a link to a code example?

missing_entries.sql

I also use temporary tables with LOAD DATA, here's one that reads data from files and JOINs it with recent changes.

@Dispenser but on that missing_entries.sql there are tons of things that wouldn't work with the current and new replicas if I am not missing something.
You are basically using your user (I assume yours) database, which will no longer be there as these hosts are RO.
ie:

DROP TABLE IF EXISTS s52690__p.all_articles, s52690__p.dabs, s52690__p.dabs2page, s52690__p.si, s52690__p.si2page;
CREATE TABLE s52690__p.all_articles

There are no user databases hosts on these new replicas.
Same for the LOAD DATA, they won't work.

@Marostegui The idea is to create those tables (or some more generic) in advance in sanitarium- like with T59617.

@Dispenser but on that missing_entries.sql there are tons of things that wouldn't work with the current and new replicas if I am not missing something.
You are basically using your user (I assume yours) database, which will no longer be there as these hosts are RO.

Presumably, this is only because that's the database to which @Dispensor has CREATE TEMPORARY TABLES access, and any database to which such access was granted would work -- perhaps with a bit more care not to shadow any "real" tables.

ie:

DROP TABLE IF EXISTS s52690__p.all_articles, s52690__p.dabs, s52690__p.dabs2page, s52690__p.si, s52690__p.si2page;
CREATE TABLE s52690__p.all_articles

I don't see that bit in the current code, just:

CREATE TEMPORARY TABLE s52690__p.all_articles

There are no user databases hosts on these new replicas.

It seems that's actually irrelevant; CREATE TEMPORARY TABLE is perfectly happy with non-existent databases.

Upstream says:

TEMPORARY tables have a very loose relationship with databases (schemas). Dropping a database does not automatically drop any TEMPORARY tables created within that database. Also, you can create a TEMPORARY table in a nonexistent database if you qualify the table name with the database name in the CREATE TABLE statement. In this case, all subsequent references to the table must be qualified with the database name.

Same for the LOAD DATA, they won't work.

On what do you base this conclusion? The CREATE TEMPORARY TABLE documentation says "The creating session can perform any operation on the table", and there is nothing to suggest that LOAD DATA LOCAL INFILE does not qualify as such, or that it requires any other privileges. (There is the possibility that /tmp might get full, though.)

@SamB Replica databases are supposed to be read only, and only replication (or an admin for administration purpose only-not data handling) can write there. CREATE TEMPORARY TABLE is a very problematic with replication, as you can see on these examples:

perhaps with a bit more care not to shadow any "real" tables.

And that is why we cannot provide create tables to existing databases, at most we would do it on a separate server and replicate it, as stated on T173511 (and temporary tables do not get replicated on ROW) or giving them to a non-replicated database, e.g. scratch, which is not problem-free.

I would suggest to either subscribe to that ticket and state your problem or rename this task explaining the problem name ("I need temporary tables"), -rather than a specific solution (GRANT)-, as we could see other ways of making either temporary tables or something else work. I honestly would prefer to see those converted to let's say, real, random nane memory tables with binlog disabled; or, if they are small, on application memory- if they are large, precreated with a trigger-like mechanism on sanitarium.

That last part is not that convoluted- if we precreate for you an "all_articles" and other lists with a cron job, we would be saving time on your reports- and not only your session will be able to use it; all other users will benefit from it! Please link the code your run on the description so it can be easily found.

Thanks @jcrespo for explaining. Meta point I see here that we keep coming back to and that any variance from has created large hurdles is Replica databases are supposed to be read only. +1

Yes, CREATE TEMPORARY TABLES can create lag on the replicas for long running queries (locking).

@jcrespo Wow. It sounds like the documentation for temporary tables needs a gigantic "Pitfalls" section describing shortcomings in the implementation(s). And the semantics seemed *perfect* for use with read-only replicas, too :-(.

Declined per @jcrespo's explanation in T179628#3788031.