Categories
postgresql sql

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.

Query #1:

SELECT id 
FROM subsource_position
WHERE
id NOT IN (SELECT position_id FROM subsource)

This comes back with the following plan:

                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2566.50 rows=101500 width=4)
-> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)

Query #2:

SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;

Plan:

                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
SetOp Except (cost=24760.35..25668.66 rows=95997 width=4)
-> Sort (cost=24760.35..25214.50 rows=181663 width=4)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..6406.26 rows=181663 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4)
-> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4)
-> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)

I have a feeling I’m missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?

Additional data:

=> select count(*) from subsource;
count
-------
85158
(1 row)
=> select count(*) from subsource_position;
count
-------
93261
(1 row)

Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.

Edit 2: I’m using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.

Edit 3: I have an index on both these columns. I haven’t yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.