I have a Sanic app using asyncpg (0.13) to work with postgresql (9.5). My server now facing a problem which sometimes postgresql seems to be freezed and my app can’t even read from db and it seems like a downtime for the app, this issue happens frequently within around 30mins.
UPDATE1: All my graph is in UTC+7, log is in UTC time.
This is how our CPU looks like when the high IO happens:
The CPU iowait gets high at the points.
When I set a cronjob to capture iotop, I see the wal writer process
is taking a lot of io and also many INSERT processes:
16:30:54 29266 be/4 postgres 0.00 K/s 5077.89 K/s 0.00 % 31.95 % postgres: wal writer process 16:30:54 29307 be/4 postgres 58.96 K/s 117.92 K/s 0.00 % 12.55 % postgres: db db 127.0.0.1(43014) INSERT 16:30:54 29327 be/4 postgres 14.74 K/s 154.77 K/s 0.00 % 8.70 % postgres: db db 127.0.0.1(43648) INSERT 16:30:54 29359 be/4 postgres 0.00 K/s 176.88 K/s 0.00 % 2.84 % postgres: db db 127.0.0.1(45220) idle 16:30:54 29298 be/4 postgres 243.21 K/s 1562.43 K/s 0.00 % 0.82 % postgres: db db 127.0.0.1(42868) INSERT 16:30:54 29345 be/4 postgres 58.96 K/s 154.77 K/s 0.00 % 0.72 % postgres: db db 127.0.0.1(45134) INSERT 16:30:54 29354 be/4 postgres 73.70 K/s 397.98 K/s 0.00 % 0.51 % postgres: db db 127.0.0.1(45178) INSERT 16:30:54 29318 be/4 postgres 0.00 K/s 316.91 K/s 0.00 % 0.10 % postgres: db db 127.0.0.1(43498) INSERT 16:30:54 775 be/3 root 0.00 K/s 18.42 K/s 0.00 % 0.09 % [jbd2/sda3-8] 16:30:54 29344 be/4 postgres 0.00 K/s 302.17 K/s 0.00 % 0.08 % postgres: db db 127.0.0.1(45118) INSERT 16:30:54 29367 be/4 postgres 0.00 K/s 117.92 K/s 0.00 % 0.07 % postgres: db db 127.0.0.1(45342) INSERT 16:30:54 29350 be/4 postgres 0.00 K/s 95.81 K/s 0.00 % 0.06 % postgres: db db 127.0.0.1(45154) INSERT 16:30:54 29348 be/4 postgres 0.00 K/s 265.32 K/s 0.00 % 0.05 % postgres: db db 127.0.0.1(45148) INSERT 16:30:54 29352 be/4 postgres 7.37 K/s 331.65 K/s 0.00 % 0.04 % postgres: db db 127.0.0.1(45164) INSERT 16:30:54 29355 be/4 postgres 0.00 K/s 213.73 K/s 0.00 % 0.04 % postgres: db db 127.0.0.1(45183) INSERT .... 16:30:55 29341 be/4 postgres 0.00 K/s 14183.25 K/s 0.00 % 64.19 % postgres: db db 127.0.0.1(45084) SELECT 16:30:55 29319 be/4 postgres 0.00 K/s 966.20 K/s 0.00 % 9.50 % postgres: db db 127.0.0.1(43506) idle 16:30:55 29298 be/4 postgres 66.38 K/s 7.38 K/s 0.00 % 0.21 % postgres: db db 127.0.0.1(42868) INSERT 16:30:55 29313 be/4 postgres 0.00 K/s 1386.61 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(43276) idle 16:30:56 29378 be/4 postgres 0.00 K/s 12833.52 K/s 0.00 % 45.54 % postgres: db db 127.0.0.1(45522) idle 16:30:56 29266 be/4 postgres 0.00 K/s 8291.20 K/s 0.00 % 39.46 % postgres: wal writer process 16:30:56 29354 be/4 postgres 138.31 K/s 14.56 K/s 0.00 % 25.38 % postgres: db db 127.0.0.1(45178) INSERT 16:30:56 29307 be/4 postgres 21.84 K/s 0.00 K/s 0.00 % 2.76 % postgres: db db 127.0.0.1(43014) INSERT 16:30:56 29345 be/4 postgres 36.40 K/s 7.28 K/s 0.00 % 0.33 % postgres: db db 127.0.0.1(45134) INSERT 16:30:56 29298 be/4 postgres 21.84 K/s 14.56 K/s 0.00 % 0.12 % postgres: db db 127.0.0.1(42868) INSERT 16:30:56 29308 be/4 postgres 7.28 K/s 14.56 K/s 0.00 % 0.08 % postgres: db db 127.0.0.1(43016) INSERT 16:30:57 29378 be/4 postgres 0.00 K/s 7622.73 K/s 0.00 % 41.67 % postgres: db db 127.0.0.1(45522) idle 16:30:57 29266 be/4 postgres 0.00 K/s 7475.71 K/s 0.00 % 39.79 % postgres: wal writer process 16:30:57 29371 be/4 postgres 0.00 K/s 7438.96 K/s 0.00 % 11.22 % postgres: db db 127.0.0.1(45422) INSERT 16:30:57 29354 be/4 postgres 246.25 K/s 117.61 K/s 0.00 % 2.71 % postgres: db db 127.0.0.1(45178) INSERT 16:30:57 29308 be/4 postgres 36.75 K/s 29.40 K/s 0.00 % 1.23 % postgres: db db 127.0.0.1(43016) INSERT 16:30:57 29345 be/4 postgres 14.70 K/s 22.05 K/s 0.00 % 0.68 % postgres: db db 127.0.0.1(45134) INSERT 16:30:57 29288 be/4 postgres 22.05 K/s 7.35 K/s 0.00 % 0.08 % postgres: db db 127.0.0.1(42840) INSERT 16:30:58 29371 be/4 postgres 0.00 K/s 3445.89 K/s 0.00 % 33.64 % postgres: db db 127.0.0.1(45422) idle 16:30:58 29345 be/4 postgres 29.71 K/s 22.28 K/s 0.00 % 27.62 % postgres: db db 127.0.0.1(45134) INSERT 16:30:58 29266 be/4 postgres 0.00 K/s 4262.81 K/s 0.00 % 27.22 % postgres: wal writer process 16:30:59 29335 be/4 postgres 0.00 K/s 11094.36 K/s 0.00 % 56.84 % postgres: db db 127.0.0.1(44674) idle 16:30:59 29266 be/4 postgres 0.00 K/s 4193.67 K/s 0.00 % 14.18 % postgres: wal writer process 16:30:59 29314 be/4 postgres 177.51 K/s 0.00 K/s 0.00 % 0.80 % postgres: db db 127.0.0.1(43390) INSERT 16:30:59 29353 be/4 postgres 147.92 K/s 44.38 K/s 0.00 % 0.47 % postgres: db db 127.0.0.1(45174) INSERT 16:30:59 29355 be/4 postgres 22.19 K/s 0.00 K/s 0.00 % 0.38 % postgres: db db 127.0.0.1(45183) INSERT 16:30:59 29352 be/4 postgres 59.17 K/s 22.19 K/s 0.00 % 0.14 % postgres: db db 127.0.0.1(45164) INSERT 16:30:59 775 be/3 root 0.00 K/s 22.19 K/s 0.00 % 0.04 % [jbd2/sda3-8] 16:31:00 29266 be/4 postgres 0.00 K/s 6199.25 K/s 0.00 % 64.44 % postgres: wal writer process 16:31:00 29294 be/4 postgres 189.18 K/s 3383.39 K/s 0.00 % 1.06 % postgres: db db 127.0.0.1(42864) INSERT 16:31:00 29316 be/4 postgres 69.12 K/s 552.98 K/s 0.00 % 0.42 % postgres: db db 127.0.0.1(43478) INSERT 16:31:00 29307 be/4 postgres 7.28 K/s 734.89 K/s 0.00 % 0.03 % postgres: db db 127.0.0.1(43014) INSERT 16:31:00 29324 be/4 postgres 0.00 K/s 254.66 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(43574) INSERT 16:31:00 29354 be/4 postgres 0.00 K/s 312.87 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(45178) INSERT 16:31:00 29373 be/4 postgres 0.00 K/s 254.66 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(45454) INSERT 16:31:00 29284 be/4 postgres 0.00 K/s 371.08 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(42806) INSERT 16:31:00 29318 be/4 postgres 0.00 K/s 232.84 K/s 0.00 % 0.00 % postgres: db db 127.0.0.1(43498) INSERT ...(skipped as many times repeat like above)
Enable the slow query logs, I see many INSERT queries are being executed with duration is very high (this is just two sample log entries, I see many in slow query log):
2018-01-15 18:33:02 UTC [2335-9] db@db LOG: duration: 35436.589 ms execute __asyncpg_stmt_53__: INSERT INTO my_table (site_id, source, language, query, data) VALUES ($ 1, $ 2, $ 3, $ 4, $ 5) ON CONFLICT DO NOTHING; 2018-01-15 18:33:02 UTC [2335-10] db@db DETAIL: parameters: $ 1 = '121', $ 2 = 'source1', $ 3 = 'en', $ 4 = 'query1', $ 5 = '["data1", "data2", "data3", "data4", "data5", "data6", "data7", "data8", "data9"]' 2018-01-15 18:33:02 UTC [3564-1] db@db LOG: duration: 35109.508 ms execute __asyncpg_stmt_10__: INSERT INTO my_table (site_id, source, language, query, data) VALUES ($ 1, $ 2, $ 3, $ 4, $ 5) ON CONFLICT DO NOTHING; 2018-01-15 18:33:02 UTC [3564-2] db@db DETAIL: parameters: $ 1 = '16', $ 2 = 'source1', $ 3 = 'en', $ 4 = 'query2', $ 5 = '["fdata1", "fdata2", "fdata3"]'
I’ve tried with some changes to wal settings but no luck.
My server configuration:
- Ubuntu 16.04 LTS
- CPU: 16 Cores
- RAM: 48 GB
- SSD: 2 x 1TB (RAID)
Current configuration of postgresql.conf:
data_directory = '/var/lib/postgresql/9.5/main' hba_file = '/etc/postgresql/9.5/main/pg_hba.conf' ident_file = '/etc/postgresql/9.5/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/9.5-main.pid' listen_addresses = '*' port = 5432 max_connections = 600 unix_socket_directories = '/var/run/postgresql' ssl = true ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem' ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key' shared_buffers = 12GB # min 128kB work_mem = 1024MB # min 64kB dynamic_shared_memory_type = posix max_wal_size = 2GB min_wal_size = 1GB checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - 1.0 effective_cache_size = 36GB logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern, log_min_duration_statement = 800 log_checkpoints = on log_line_prefix = '%t [%p-%l] %q%u@%d ' log_statement = 'none' log_timezone = 'localtime' stats_temp_directory = '/var/run/postgresql/9.5-main.pg_stat_tmp' datestyle = 'iso, mdy' timezone = 'localtime' lc_messages = 'en_US' lc_monetary = 'en_US' lc_numeric = 'en_US' lc_time = 'en_US' default_text_search_config = 'pg_catalog.english'
UPDATE1: update graph of Disk IO and checkpoint log at latest time. Timezone of graph is UTC+7, in logs is UTC
Disk IO
Checkpoint logs:
2018-01-16 01:08:31 UTC [2327-167] LOG: checkpoint starting: time 2018-01-16 01:12:01 UTC [2327-168] LOG: checkpoint complete: wrote 13099 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=209.579 s, sync=0.023 s, total=209.610 s; sync files=24, longest=0.007 s, average=0.000 s; distance=94894 kB, estimate=637235 kB 2018-01-16 01:13:31 UTC [2327-169] LOG: checkpoint starting: time 2018-01-16 01:17:01 UTC [2327-170] LOG: checkpoint complete: wrote 12166 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=209.925 s, sync=0.183 s, total=210.116 s; sync files=23, longest=0.093 s, average=0.007 s; distance=87555 kB, estimate=582267 kB 2018-01-16 01:18:31 UTC [2327-171] LOG: checkpoint starting: time 2018-01-16 01:22:01 UTC [2327-172] LOG: checkpoint complete: wrote 13424 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=209.778 s, sync=0.318 s, total=210.105 s; sync files=23, longest=0.271 s, average=0.013 s; distance=97418 kB, estimate=533782 kB 2018-01-16 01:23:31 UTC [2327-173] LOG: checkpoint starting: time 2018-01-16 01:27:01 UTC [2327-174] LOG: checkpoint complete: wrote 14004 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=209.586 s, sync=0.019 s, total=209.614 s; sync files=23, longest=0.003 s, average=0.000 s; distance=100902 kB, estimate=490494 kB 2018-01-16 01:28:31 UTC [2327-175] LOG: checkpoint starting: time 2018-01-16 01:31:55 UTC [2327-176] LOG: checkpoint complete: wrote 13732 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=201.798 s, sync=1.494 s, total=204.044 s; sync files=23, longest=0.928 s, average=0.064 s; distance=99437 kB, estimate=451388 kB 2018-01-16 01:33:31 UTC [2327-177] LOG: checkpoint starting: time 2018-01-16 01:37:01 UTC [2327-178] LOG: checkpoint complete: wrote 38289 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=209.782 s, sync=0.746 s, total=210.560 s; sync files=25, longest=0.705 s, average=0.029 s; distance=777938 kB, estimate=777938 kB 2018-01-16 01:38:31 UTC [2327-179] LOG: checkpoint starting: time 2018-01-16 01:42:01 UTC [2327-180] LOG: checkpoint complete: wrote 13577 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 47 recycled; write=209.081 s, sync=0.127 s, total=209.243 s; sync files=24, longest=0.081 s, average=0.005 s; distance=99142 kB, estimate=710058 kB 2018-01-16 01:43:31 UTC [2327-181] LOG: checkpoint starting: time 2018-01-16 01:47:01 UTC [2327-182] LOG: checkpoint complete: wrote 14056 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=209.812 s, sync=0.532 s, total=210.354 s; sync files=23, longest=0.493 s, average=0.023 s; distance=102243 kB, estimate=649277 kB 2018-01-16 01:48:31 UTC [2327-183] LOG: checkpoint starting: time 2018-01-16 01:52:01 UTC [2327-184] LOG: checkpoint complete: wrote 13446 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=209.405 s, sync=0.019 s, total=209.435 s; sync files=23, longest=0.006 s, average=0.000 s; distance=97228 kB, estimate=594072 kB 2018-01-16 01:53:31 UTC [2327-185] LOG: checkpoint starting: time
Any suggestions to solve this issue?
Thanks,