Monday, February 1. 2010

Adding Only New Rows (INSERT IGNORE, Done Right)

Let's say you have a table and a data set, and would like to add only those rows in your data set that aren't already in the table. There are hard ways, but here's an easy one.

Let's imagine you have a table like this:

CREATE TABLE foo (id INTEGER, t TEXT, PRIMARY KEY(id, t));

Let's put some data in there:

COPY foo FROM stdin DELIMITER '|';

1|I

2|love

3|the

4|smell

5|of

6|coffee

\.

Good so far. Now we have a mix of old and new information coming in.
Let's say it's:

(1,'I')

(2,'love')

(7,'in')

(8,'the')

(9,'morning')

To put just the new ones in, letting the old ones drop, do this:

INSERT INTO foo(id, t)

SELECT v.*

FROM

    (VALUES 

        (1,'I'),

        (2,'love'),

        (7,'in'),

        (8,'the'),

        (9,'morning')

    ) AS v(id, t)

LEFT JOIN

    foo f

    ON (

        v.id = f.id AND

        v.t = f.t

    )

WHERE

    f.id IS NULL AND

    f.t IS NULL;

The LEFT JOIN and NULL conditions make sure that only non-existent rows are in the SELECT, and hence get inserted.

Until next time...

Posted by

David Fetter

at 03:04 | Comments (6) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

What if your fresh data is in a csv file, and you have additions and deletions in it. What would be your suggestion then? cheers

#1 matt harrison (Homepage) on 2010-02-01 04:25 (Reply)

Spot on!

I am on a project where I am importing and merging together several GBs of data into MySQL. I first load the data as a CSV file, which is very fast. But then I hit a wall with the above approach, since the query executes way to slow. Which means that INSERT IGNORE was a rescue for a problem that probably shouldnt have existed in the first place.

#2 Lukas (Homepage) on 2010-02-01 07:25 (Reply)

I'd separate the additions from the deletions, then treat each separately.

#3 David Fetter (Homepage) on 2010-02-01 08:17 (Reply)

I am pretty sure that this method will work safely only in case of sequential access. If ran in parallel, it will raise unique violation sooner or later.

#4 depesz (Homepage) on 2010-02-01 12:16 (Reply)

David,

Wouldn't it be better just to write an upsert function instead?

create proc foo_upsert (wk_id int, wk_t text) RETURNS null
AS

UPDATE foo
set t = wk_t
WHERE id = wk_id;

IF NOT FOUND THEN
Insert into foo
...

etc..

This would eliminate the risk that depesz mentions, no? Because it would force sequential accesss...

This would force pipelining of the

#5 Norman Yamada on 2010-02-01 14:21 (Reply)

If you can create a scenario that does this, let me know, and I'll either put a warning in about SERIALIZABLE or a fix.

#6 David Fetter (Homepage) on 2010-02-01 16:02 (Reply)

Norman,

The UPSERT function doesn't actually do what's needed here, and since it's a per-row check with an exception handler in PL/pgsql, it's not terribly fast, to put it mildly.

Let's talk soon

#7 David Fetter (Homepage) on 2010-02-01 16:05 (Reply)

2 scenarios where it fails:

both have starting situation:

create table q (x int4 primary key);
insert into q select generate_series(1,5);

fail scenario 1:
insert into q (x)
select v. from ( values (1), (7), (7) ) as v (i)
left join q on v.i = q.x where q.x is null;
i.e. not unique list of values to insert.

fail scenario 2:
psql #1: begin;
psql #2: begin;
psql #1: insert into q (x)
select v. from ( values (6), (7) ) as v (i)
left join q on v.i = q.x where q.x is null;
psql #2: insert into q (x)
select v.* from ( values (6), (8) ) as v (i)
left join q on v.i = q.x where q.x is null;
psql #1: commit;

#8 depesz (Homepage) on 2010-02-01 16:17 (Reply)

SERIALIZABLE mode won't help. All it can do is throw an error, aborting the whole batch. It's easy to do that by using a UNIQUE constraint already.

The problem is trying to do this when there is concurrent activity on the target table without aborting because a few rows out of a few million happened to change during the process. I have some ideas, but the only efficient solution would involve modifying the backend. I have some ideas, but none are trivial.

#9 Jeff Davis (Homepage) on 2010-02-01 18:09 (Reply)

no MERGE? I mean, it's complicated and funky, but... it does what UPSERT does and more. Without some of the suck.

#10 Andrew Hammond on 2010-02-02 00:22 (Reply)

I sure hope we get MERGE some day. This is one thing we can do with what's available today

#11 David Fetter (Homepage) on 2010-02-02 05:48 (Reply)

@depesz

The first fail scenario can probably be prevented by adding GROUP BY v.id, v.t in the WHERE clause.

#12 Marc Tardif on 2010-06-21 20:16 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment: Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.

Enter the string from the spam-prevention image above:
: Remember Information? Subscribe to this entry

Blog Administration

Open login screen

Adding Only New Rows (INSERT IGNORE, Done Right)

David Fetter

Monday, February 1. 2010

Adding Only New Rows (INSERT IGNORE, Done Right)

Archives

Subscribe

Blog Administration