We have a csv
file with 2000 rows and a database table1
about 2 million rows and another table2
with 60000 rows.
We need to query from table1
based on parameters from the csv file, so each row in csv
should execute one select query. Initially we tried to execute the following query within a for loop
in the application:
SELECT table1.c1, table1.c2, ST_Distance_Sphere(point(csv.c2[i], csv.c3[i]), point(table1.c3, table1.c4))*0.2 AS length, table1.c5 FROM table1 WHERE table1.c1 IN ( SELECT DISTINCT table2.c1 FROM table2 LEFT JOIN table2.c1=table1.c1 WHERE table2.c2=1 AND table2.c3 BETWEEN 1000 AND 2000 ) HAVING length < csv.c4[i] AND table1.c5 BETWEEN date("start_date") AND date("end_date") ORDER BY table.c1
the csv.c1[i]
i
is actually the loop index. Because of the 2000 round trips to MySQL server, it takes a long time to complete this. It takes about 16 hours just to get through the queries.
So, I wrote the SP below to avoid looping and simply call this SP so can do the looping in MySQL server:
CREATE PROCEDURE sp() BEGIN DECLARE arg0 VARCHAR(255); DECLARE arg1 FLOAT; DECLARE arg2 FLOAT; DECLARE arg3 FLOAT; DECLARE cur1 CURSOR FOR SELECT * FROM csv_based_table; OPEN cur1; read_loop: LOOP FETCH cur1 INTO arg0, arg1, arg2, arg3; SELECT col0, col1, ST_Distance_Sphere(point(arg1, arg2),point(col3,col4))*2 AS length, arg0 FROM important_table1 WHERE col0 IN ( SELECT DISTINCT c0 FROM important_table2 LEFT JOIN c0 ON col0 = c0 WHERE c1=1 and c2 BETWEEN 1000 AND 2000) HAVING distance < arg3 AND col5 BETWEEN date("start_date") AND date("end_date") ORDER BY arg0, col0; END LOOP; CLOSE cur1; END;
So, instead of using the csv file, we create a database table for that csv file and run the above stored procedure. Problem is this turned out to be slower than the original 2000 iterations loop. In a test over 100 csv rows, for loop-based original solution finished in 317.6 seconds while the stored procedure took 321.5 seconds just for the SP itself. Why is this happening and how can I optimize this?