Why is your dbms not using your index?
So recently I was trying to gain some knowledge about indexes using Postgres, going beyond the tutorials I've done and so I attempted to make some queries faster using b-tree indexes on fish count data.
However, before getting into any complex queries I tried out a basic one:
select * from chinook where year > '2000';
When I used explain analyze on this query it gave me:
----------------------------------------------------------------------------------------------
Seq Scan on yearly_chinook_sum_view (cost=0.00..1.15 rows=4 width=40) (actual time=12.559..12.584 rows=12 loops=1)
Filter: ((year)::text > '2000'::text)
So why is it still doing a sequential scan?
First I looked at my table to make sure I really had added a b-tree index on the table named chinook using \d chinook which confirmed that yes, I have a b-tree index on the year column.
Then I started searching for the issue and found these articles:
http://www.depesz.com/2010/09/09/why-is-my-index-not-being-used/
http://www.postgresonline.com/journal/archives/78-Why-is-my-index-not-being-used.html
There were a few possible reasons why my query wasn't using the index, but the one that seemed most likely to me was that it wasn't efficient due to table size.
Next step, how big is the Chinook table?
https://wiki.postgresql.org/wiki/Disk_Usage
So from Wiki I used this join to find my total disk usage including indexes:
SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND C.relkind <> 'i' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20;
And for Chinook it gave me:
So this is fairly small, but I don't really feel like doing tests to see where the threshold is. What I did test were queries that would return one row versus a larger percentage of rows.
kenai_fishruns=# explain analyze select * from chinook where year = '2009';
----------------------------------------------------------------------------------------------
Index Scan using chinook_year_index on chinook (cost=0.28..9.67 rows=48 width=44) (actual time=30.929..31.151 rows=48 loops=1)
Index Cond: ((year)::text = '2009'::text)
kenai_fishruns=# explain analyze select * from chinook where year > '2009';
----------------------------------------------------------------------------------------------
Seq Scan on chinook (cost=0.00..13.45 rows=203 width=44) (actual time=0.033..1.258 rows=203 loops=1)
Filter: ((year)::text > '2009'::text)
Rows Removed by Filter: 393
So this started pointing me in the right direction, I figured it had to do with the amount of rows returned versus some error in how I indexed my tables, so I googled and stackoverflow came to the rescue again:
http://stackoverflow.com/questions/5203755/why-does-postgresql-perform-sequential-scan-on-indexed-column
Basically the answer to this post says that if a query returns approximately more than 5-10% (roughly) of the total rows, then it will use a sequential scan versus an index scan.
This makes sense with my little test between the queries returning the one row versus the many rows it would be returning for the second query. Unfortunately this stackoverflow post didn't cite where to find this info in the docs, but that's research for another day. At least now I can rest easy knowing the general why behind Postgres choosing to use a sequential scan versus an index scan.