Hope you can help me out with this one because it is really slow. Is there a way to do this without loading the whole .csv into memory?
The thing is… I have files containing timeseries data with 10 columns. First column is a datetime, last an integer, and the rest are floats
I am trying to join two .csv files together. The filenames are:
- Myfile_1withdata
- Myfile_1withdata1
- Myfile_2withdata
- Myfile_2withdata1
- Myfile5_1withdata
- Myfile5_1withdata1
etc…
The files with a “1” at the end is the new file containing updated data that I want to add (append) to files without 1 at the end like “Myfile5_1withdata.csv”
Files can weight up to 500MB and there are many of them and it takes a long time to finish this process… Can it be faster?
Currently I have tried accomplish this by doing:
import inspect import pandas as pd import glob, os currentpath = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) type_names = {'1withdata':"super",'2withdata':"extra"} file_names = ["Myfile","Myfile5"] for a in file_names: for x in type_names.keys(): results = pd.DataFrame([]) for counter, file in enumerate(glob.glob(str(a)+'_'+str(x)+"*")): namedf = pd.read_csv(file, index_col=0,skiprows=0,dtype=str, usecols=[0,1,2,3,4,5,6,7,8,9],float_precision='high') results = results.append(namedf) # Dataframe with data of all file_names files with the same type_names key print("saving: ",a,x) results = results[~results.index.duplicated(keep='last')] #Remove duplicate row (last row with incomplete timeseries data) results.to_csv(a+'_'+str(x)+'.csv') print("DONE!") #Cleanup by deleting data files with updated data (the ones ending with numbers) files = [file for file in glob.glob(currentpath+"//*.csv") if file[-5:-4].isdigit() == True] for file in files: os.remove(file)