Context:
I am working on a data warehouse that needs to have responsive look ups as it is used as the source for an operator console. The target is to achieve sub 1 Second lookups (ideal would be 100ms).
The size of the data is expected to increase into the millions, but this can be limited as older data is not as relevant.
The tables:
Tables either contain data as an append only log, or act as an identifier as to which records are the “latest” for each resource.
First some information on the relevant table:
activity: has 3.5 million rows, 34 columns (append only log)
activitylatest: has 1.1 million rows, 4 columns (identifier for activity)
round: 3.5 million rows, 17 columns (append only log)
roundlatest: 0.7 million rows, 4 columns (identifier for round)
Indexes are available on all columns (except for large texts) since users may filter with any fields and “updateId” is the primary key for data tables.
The Query
I’m trying to get the data within activity depending on the user’s filtering which may include a single column from round table.
select activity.update_id, ... activity.activity_id, round.round_external_ref from activity join round on activity.round_id = round.round_id join roundlatest on round.update_id = roundlatest.update_id join activitylatest on activity.update_id = activitylatest.update_id where activity.community_id = 1 order by activity.update_id desc limit 1001
Executing this query will deliver the following plan (EXPLAIN ANALYZE
):
Limit (cost=1.71..12367.37 rows=1001 width=2056) (actual time=2324.231..2341.835 rows=1001 loops=1) -> Nested Loop (cost=1.71..14219657.65 rows=1151081 width=2056) (actual time=2324.231..2341.721 rows=1001 loops=1) -> Nested Loop (cost=1.29..6270077.60 rows=5932256 width=2064) (actual time=2324.204..2333.877 rows=5158 loops=1) -> Nested Loop (cost=0.86..3654699.78 rows=1187917 width=2019) (actual time=0.024..1765.586 rows=390452 loops=1) -> Index Only Scan Backward using "IDX_datawarehouse_gameActivitylatest_update_id" on datawarehouse_gameactivitylatest (cost=0.43..60604.50 rows=1199068 width=8) (actual time=0.015..165.819 rows=390452 loops=1) Heap Fetches: 239183 -> Index Scan using "PK_gameActivity_update_id" on datawarehouse_gameactivity (cost=0.43..2.99 rows=1 width=2019) (actual time=0.002..0.003 rows=1 loops=390452) Index Cond: (update_id = datawarehouse_gameactivitylatest.update_id) Filter: (community_id = 1) -> Index Scan using "IDX_datawarehouse_gameRound_round_id" on datawarehouse_gameround (cost=0.43..2.15 rows=5 width=53) (actual time=0.001..0.001 rows=0 loops=390452) Index Cond: (round_id = datawarehouse_gameactivity.round_id) -> Index Only Scan using "IDX_datawarehouse_gameRoundlatest_update_id" on datawarehouse_gameroundlatest (cost=0.42..1.33 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=5158) Index Cond: (update_id = datawarehouse_gameround.update_id) Heap Fetches: 1067 Planning time: 1.153 ms Execution time: 2341.989 ms
The first time running the query it runs for over 30 seconds (up to 60 seconds). Successive runs seem to maintain approx 5 seconds durations. I assume this is due to system caching, I do need to fix both the first and successive queries to sub 1 second.
During my testing I noticed that simply switching ordering from DESC to ASC improves the query performance to approx 2.5 seconds with merge join being used. This has left me puzzled and I’m not sure as to the cause and I do need DESC.
ASC Explain Analyze:
Limit (cost=184.88..10004.46 rows=1001 width=2056) (actual time=0.064..23.341 rows=1001 loops=1) -> Nested Loop (cost=184.88..11301988.03 rows=1152097 width=2056) (actual time=0.063..23.265 rows=1001 loops=1) -> Nested Loop (cost=184.46..3332810.27 rows=5940024 width=2064) (actual time=0.054..14.535 rows=5227 loops=1) -> Merge Join (cost=184.03..710946.78 rows=1189471 width=2019) (actual time=0.046..7.915 rows=1001 loops=1) Merge Cond: (datawarehouse_gameactivity.update_id = datawarehouse_gameactivitylatest.update_id) -> Index Scan using "PK_gameActivity_update_id" on datawarehouse_gameactivity (cost=0.43..630047.01 rows=3522681 width=2019) (actual time=0.010..5.513 rows=3007 loops=1) Filter: (community_id = 1) -> Index Only Scan using "IDX_datawarehouse_gameActivitylatest_update_id" on datawarehouse_gameactivitylatest (cost=0.43..60662.53 rows=1200637 width=8) (actual time=0.007..0.204 rows=1001 loops=1) Heap Fetches: 0 -> Index Scan using "IDX_datawarehouse_gameRound_round_id" on datawarehouse_gameround (cost=0.43..2.15 rows=5 width=53) (actual time=0.002..0.005 rows=5 loops=1001) Index Cond: (round_id = datawarehouse_gameactivity.round_id) -> Index Only Scan using "IDX_datawarehouse_gameRoundlatest_update_id" on datawarehouse_gameroundlatest (cost=0.42..1.33 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=5227) Index Cond: (update_id = datawarehouse_gameround.update_id) Heap Fetches: 88 Planning time: 1.124 ms Execution time: 23.475 ms
(Explain analyse says 23 ms, but actually running the query takes 2.5 seconds according to pgadmin)
We are using PostgreSQL 9.6 on SSD cloud based VMs.