Page MenuHomePhabricator

SQLite/PostgreSQL native upsert
Open, Needs TriagePublic

Description

It would be convenient, for example in T389028, to use native upsert on non-MySQL DBMSs. There are a number of difficulties which I will document here.

Insert ID

The existing emulation goes to a fair bit of trouble to set the insert ID to the ID of the updated row.

In MySQL, insertId() after a native upsert returns the ID of the updated row. In SQLite, the last insert ID is not modified by an upsert following the update path, so insertId() will give you garbage — this matches MySQL's LAST_INSERT_ID() but not mysql_insert_id(). In PostgreSQL, a native upsert increments the underlying sequence even when it follows the update path, and insertId() returns the updated sequence value.

MariaDB 10.11:

MariaDB> CREATE TABLE t1 (a INTEGER PRIMARY KEY AUTO_INCREMENT, b INTEGER, c INTEGER);
MariaDB> CREATE UNIQUE INDEX b ON t1 (b);
MariaDB> CREATE TABLE t2 (a INTEGER PRIMARY KEY AUTO_INCREMENT, b INTEGER, c INTEGER);
MariaDB> CREATE UNIQUE INDEX b ON t2 (b);

If we update table t2 with an upsert, LAST_INSERT_ID() returns the value from the insert into table t1:

MariaDB> INSERT INTO t2 (b, c) VALUES (1,1);
Query OK, 1 row affected (0.009 sec)

MariaDB> INSERT INTO t1 (b, c) VALUES (1,1);
Query OK, 1 row affected (0.010 sec)

MariaDB> INSERT INTO t1 (b, c) VALUES (2,2);
Query OK, 1 row affected (0.008 sec)

MariaDB> INSERT INTO t1 (b, c) VALUES (3,3);
Query OK, 1 row affected (0.008 sec)

MariaDB> INSERT INTO t2 (b, c) VALUES (1,1) ON DUPLICATE KEY UPDATE c=c+1;
Query OK, 2 rows affected, 1 warning (0.019 sec)

MariaDB> SELECT last_insert_id();
+------------------+
| last_insert_id() |
+------------------+
|                3 |
+------------------+
1 row in set (0.001 sec)

SQLite's matches MariaDB's in this test:

sqlite> INSERT INTO t2 (b, c) VALUES (1,1);
sqlite> INSERT INTO t1 (b, c) VALUES (1,1);
sqlite> INSERT INTO t1 (b, c) VALUES (2,2);
sqlite> INSERT INTO t1 (b, c) VALUES (3,3);
sqlite> INSERT INTO t2 (b, c) VALUES (1,1) ON CONFLICT (b) DO UPDATE SET c=c+1;
sqlite> SELECT last_insert_rowid();
3

Postgres increments the underlying sequence before attempting to insert the row and leaves gaps in the table:

postgres> INSERT INTO t2 (b, c) VALUES (1,1) ON CONFLICT (b) DO UPDATE SET c=t2.c+1;
INSERT 0 1
postgres> INSERT INTO t2 (b, c) VALUES (1,1) ON CONFLICT (b) DO UPDATE SET c=t2.c+1;
INSERT 0 1
postgres> INSERT INTO t2 (b, c) VALUES (1,1) ON CONFLICT (b) DO UPDATE SET c=t2.c+1;
INSERT 0 1
postgres> SELECT lastval();
 lastval 
---------
       3
(1 row)

postgres> INSERT INTO t2 (b, c) VALUES (4,4);
INSERT 0 1
postgres> SELECT * from t2;
 a | b | c 
---+---+---
 1 | 1 | 3
 4 | 4 | 4
(2 rows)

Maybe the simplest solution is to ensure that there is no calling code that depends on the value of insertId() after upsert(). Note that the MariaDB upsert gave a warning on my system:

MariaDB> SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1592
Message: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE  on a table with more than one UNIQUE KEY is unsafe
1 row in set (0.000 sec)

So what is the use case for insertId() after upsert? If you insert with a fixed value for the ID and it conflicts, you know what it was because you specified it. If you insert without specifying an ID and it conflicts on another unique index, that's unsafe for replication.

Field name ambiguity

postgres> INSERT INTO t2 (b) VALUES (1) ON CONFLICT (b) DO UPDATE SET c=c+1;
ERROR:  column reference "c" is ambiguous
LINE 1: ... INTO t2 (b) VALUES (1) ON CONFLICT (b) DO UPDATE SET c=c+1;

The equivalent of this very simple query works in MariaDB and SQLite but fails in Postgres because the excluded and target rows are both available in the expression with the same field names. It seems like a bug. The excluded and target rows are necessarily from the same table, so you can never give an unqualified column name in the expression.

Maybe in PostgreSQL 15+ it will be possible to do something like

MERGE INTO t2 USING (SELECT 1 AS b) AS source ON b=source.b
  WHEN MATCHED THEN UPDATE b=b+1
  WHEN NOT MATCHED THEN INSERT (b) VALUES (b);

EXCLUDED table alias

In SQLite and PostgreSQL, EXCLUDED is reserved in the SET expression and can't be used to reference a table. I couldn't find anything in Code Search using it as a table name.

Event Timeline

Maybe I'm wrong but if I remember correctly these upsert have been written when the native way didn't exist or was not in supported version of PG in MW. I'd vote for use of native upsert for these reasons:

  • The support for PG and sqlite has always been best effort. I remember the time that its schema had around 300 drifts (including missing indexes). If there is some edge case that it won't work, the person using PG can patch mw.
  • I think not using native upsert were just historical (as I said), nothing to do with edge cases or missing functionalities.
  • I tend to discourage using of upsert altogether. I've seen people use upsert for general update and every upsert without insert wastes an auto_increment id causing gaps (and running out of max value faster). This is true even in MySQL/MariaDB. So it's either should be update if found and then insert if not found (~99.99% cases) and if for concurrency reasons, you can't insert, then do upsert which I'm fairly certain it won't happen in PG/Sqlite setups

I updated the task description regarding insertId() on MySQL. I was misled by a documented difference between mysql_insert_id() and LAST_INSERT_ID().

Maybe I'm wrong but if I remember correctly these upsert have been written when the native way didn't exist or was not in supported version of PG in MW.

In 2021, in a Gerrit comment on a change by Aaron which increased the complexity of the upsert emulation, I suggested using native upsert instead. Aaron said:

Aside from the version bump, there are two other problems:
1) The table name cannot be "excluded"/i
2) The "existing" (non excluded) columns must be referenced using <table>.<column> rather than just <column>, which is a breaking change to callers.

Point (2) is the subject of my pgsql-general post. That thread is going as expected.

(2) is the only real problem with native upsert. Code review shows that it will hit users immediately. For example SiteStatsUpdate, SqlBagOStuff, or third party code like the Analytics extension.

I'd vote for use of native upsert for these reasons:

  • The support for PG and sqlite has always been best effort. I remember the time that its schema had around 300 drifts (including missing indexes). If there is some edge case that it won't work, the person using PG can patch mw.

As long as PostgreSQL is voting in CI, we will need an upsert which is reasonably consistent across DBMSs. Otherwise this kind of investigation will be constantly repeated by developers trying to get their code merged. We would push the cost of PostgreSQL support back to developers who have less idea about what is going on.

There might be a few people who are using PostgreSQL on purpose (like wiki.postgresql.org), but none of them appear to contribute to maintaining it. Maintenance at the moment is driven by the need to write code that passes CI.

We know from experience that PG support will bitrot quickly if it is removed from CI.

We could always drop it. But you argued against dropping it at T315396.

  • I tend to discourage using of upsert altogether. I've seen people use upsert for general update and every upsert without insert wastes an auto_increment id causing gaps (and running out of max value faster). This is true even in MySQL/MariaDB. So it's either should be update if found and then insert if not found (~99.99% cases) and if for concurrency reasons, you can't insert, then do upsert which I'm fairly certain it won't happen in PG/Sqlite setups

Fair enough. I tried to write an explanation of why I want to use UPSERT instead of UPDATE/INSERT for T389028 but I found an error in my own reasoning and now it looks like UPDATE/INSERT is good enough, and it didn't cause the bug in the first place, and my explanation at T388743#10635266 was wrong. I really need to write a proper regression test.

EDIT: maybe a little bit wrong but not quite as wrong as that. There's some tricky details.