I’m looking to parse CSV files containing multiple tables using Python3’s csv
module.
These complex CSVs are not unlike the toy example below. My goal is to make an idiom for picking out any one table using a known header row.
Complex CSV file toy.csv
:
lists, fruits, books, forks, rope, gum 4, 2, 3, 0, 2, 2 Manhattan Produce Market id, fruit, color 1, orange, orange 2, apple, red Books id, book, pages 1, Webster’s Dictionary, 1000 2, Tony the Towtruck, 20 3, The Twelfth Night, 144 Rope id, rope, length, diameter, color 1, hemp, 12-feet, .5, green 2, sisal, 50-feet, .125, brown Kings County Candy id, flavor, color, big-league 1, grape, purple, yes 2, mega mango, yellow-orange, no
Each table is preceded by a title (except for a garbage table at the start). I save the previous row, and when I match the correct table header, I add the title as a new column.
import csv, re header = [] #doesn't need to be list, but I'm thinking ahead table = [] with open('toy.csv', 'r') as blob: reader = csv.reader(blob) curr = reader.__next__() while True: prev = curr try: curr = reader.__next__() except StopIteration: break if not ['id', ' book', ' pages'] == curr: continue else: header.append(prev) table.append(['title'] + curr) while True: try: curr = reader.__next__() if curr == []: break else: table.append(header[0] + curr) except StopIteration: break
The first part is to make an idiom which I can simply repeat for each table I want to extract. Later, I will combine the tables into one super-table filling NANs where the table headers don’t match.
[['title', 'id', ' book', ' pages'], ['Books', '1', ' Webster’s Dictionary', ' 1000'], ['Books', '2', ' Tony the Towtruck', ' 20'], ['Books', '3', ' The Twelfth Night', ' 144']]
This is my first post in Code Review (I hope I’m asking an appropriate question). The code is based on a SE post, How do I match the line before matched pattern using regular expression.
Happy to hear your suggestions to make the code more compact, idiomatic, and fit for my goals.