I have a table that, if we look at just the relevant parts, has two columns:
id is an integer, and
raw_data is a text blob. At this point, the table has no constraints or indexes except for an index on
My goal is to deduplicate (by
id) this data and dump it all to plaintext files (on Amazon S3).
Note that any row with the same
id can be assumed to be an exact duplicate (so I only need one, random row’s data per
The table is on an Amazon EC2 RDS database with 2TB of space, 15GB of RAM. I can expand settings if needed, but want this to run over a reasonable time (i.e. max 24-48 hours, preferably faster).
The queries I’m trying to run (but are too slow) are:
SELECT DISTINCT ON (id) id, data FROM table OFFSET <0 through end of table> LIMIT 250000
The first few offsets run within a reasonable time, but quickly becomes unmanageable (at least minutes to return) when the offset hits 10m+.
Since starting, I’ve created that
id index, removed all other constraints and indexes (there’s other columns than I described, but not relevant), set
maintenance_work_mem to 4GB (for creating the
id index), and most recently tried making the
id index a clustered index. But this happened:
cluster id using idx_0; ERROR: could not extend file "base/16390/46741.294": wrote only 4096 of 8192 bytes at block 38558630 HINT: Check free disk space.
1) Is SELECT DISTINCT ON with an OFFSET the right way to do this? Is there a more efficient query for pulling the data?
2) Is there anything else I can do to the DB/table to optimize? Would the clustered index solve my problem? Why is it taking over 1.1TB of extra space to deal with ~800GB of data?
Thanks for any advice!
✓ Extra quality
ExtraProxies brings the best proxy quality for you with our private and reliable proxies
✓ Extra anonymity
Top level of anonymity and 100% safe proxies – this is what you get with every proxy package
✓ Extra speed
1,ooo mb/s proxy servers speed – we are way better than others – just enjoy our proxies!
USA proxy location
We offer premium quality USA private proxies – the most essential proxies you can ever want from USA
Our proxies have TOP level of anonymity + Elite quality, so you are always safe and secure with your proxies
Use your proxies as much as you want – we have no limits for data transfer and bandwidth, unlimited usage!
Superb fast proxy servers with 1,000 mb/s speed – sit back and enjoy your lightning fast private proxies!
99,9% servers uptime
Alive and working proxies all the time – we are taking care of our servers so you can use them without any problems
No usage restrictions
You have freedom to use your proxies with every software, browser or website you want without restrictions
Perfect for SEO
We are 100% friendly with all SEO tasks as well as internet marketing – feel the power with our proxies
Buy more proxies and get better price – we offer various proxy packages with great deals and discounts
We are working 24/7 to bring the best proxy experience for you – we are glad to help and assist you!