Let's say you have a table and a data set, and would like to add only those rows in your data set that aren't already in the table. There are hard ways, but here's an easy one.
Let's imagine you have a table like this:
CREATE TABLE foo (id INTEGER, t TEXT, PRIMARY KEY(id, t));
Let's put some data in there:
COPY foo FROM stdin DELIMITER '|';
1|I
2|love
3|the
4|smell
5|of
6|coffee
\.
Good so far. Now we have a mix of old and new information coming in.
Let's say it's:
(1,'I')
(2,'love')
(7,'in')
(8,'the')
(9,'morning')
To put just the new ones in, letting the old ones drop, do this:
INSERT INTO foo(id, t)
SELECT v.*
FROM
(VALUES
(1,'I'),
(2,'love'),
(7,'in'),
(8,'the'),
(9,'morning')
) AS v(id, t)
LEFT JOIN
foo f
ON (
v.id = f.id AND
v.t = f.t
)
WHERE
f.id IS NULL AND
f.t IS NULL;
The LEFT JOIN and NULL conditions make sure that only non-existent rows are in the SELECT, and hence get inserted.
Until next time...
I am on a project where I am importing and merging together several GBs of data into MySQL. I first load the data as a CSV file, which is very fast. But then I hit a wall with the above approach, since the query executes way to slow. Which means that INSERT IGNORE was a rescue for a problem that probably shouldnt have existed in the first place.
Wouldn't it be better just to write an upsert function instead?
create proc foo_upsert (wk_id int, wk_t text) RETURNS null
AS
UPDATE foo
set t = wk_t
WHERE id = wk_id;
IF NOT FOUND THEN
Insert into foo
...
etc..
This would eliminate the risk that depesz mentions, no? Because it would force sequential accesss...
This would force pipelining of the
The UPSERT function doesn't actually do what's needed here, and since it's a per-row check with an exception handler in PL/pgsql, it's not terribly fast, to put it mildly.
Let's talk soon
both have starting situation:
create table q (x int4 primary key);
insert into q select generate_series(1,5);
fail scenario 1:
insert into q (x)
select v. from ( values (1), (7), (7) ) as v (i)
left join q on v.i = q.x where q.x is null;
i.e. not unique list of values to insert.
fail scenario 2:
psql #1: begin;
psql #2: begin;
psql #1: insert into q (x)
select v. from ( values (6), (7) ) as v (i)
left join q on v.i = q.x where q.x is null;
psql #2: insert into q (x)
select v.* from ( values (6), (8) ) as v (i)
left join q on v.i = q.x where q.x is null;
psql #1: commit;
The problem is trying to do this when there is concurrent activity on the target table without aborting because a few rows out of a few million happened to change during the process. I have some ideas, but the only efficient solution would involve modifying the backend. I have some ideas, but none are trivial.
The first fail scenario can probably be prevented by adding GROUP BY v.id, v.t in the WHERE clause.