I have a Python script that classifies messages. Each message carries a score, which is tallied up later in the script. It looks for messages that look alike, and puts them into a category.
items =  cats = collections.OrderedDict() # Do some preparation with the messages punctuation_re = re.compile(r"([^\w\s.])") multiple_spaces_re = re.compile(r"\s+") # Load the messages from the CSV with open('sentences.csv', encoding='utf-8') as f: reader = csv.DictReader(f, fieldnames=["message", "score"]) for row in reader: match_desc = re.sub('(\s+)(a|an|and|the|is|are|or|to|on|under|in|about|at|have|had|was|were)(\s+)', '', match_desc) match_desc = re.sub(punctuation_re, "", match_desc) match_desc = re.sub(multiple_spaces_re, " ", match_desc) row["match_desc"] = match_desc.strip().lower() items.append(row) # Match the items for item in sorted(items, key=lambda x: len(x["match_desc"])): hasMatch = False if len(item["match_desc"]) <= 5: # Don't bother continue elif len(item["match_desc"]) > 15: # If the string is longer than 15 characters, use Levenshtein distance for k, v in reversed(cats.items()): # If string length difference is greater than maximum Levinshtein distance allowed, skip if (abs(len(item["match_desc"]) - len(k)) > math.ceil(len(item["match_desc"]) * 0.2)): continue # Else compute the Levenshtein distance if item["match_desc"] == k or Levenshtein.distance(item["match_desc"], k) < math.ceil(len(item["match_desc"]) * 0.2): cats[k].append(item) hasMatch = True break else: # Use exact match if item["match_desc"] in cats.keys(): # Exact match cats[item["match_desc"]].append(item) continue if not hasMatch: cats[item["match_desc"]] = [item]
The problem is this script is $ O(n^2)$ , and struggles with large datasets as the script will need to go through more items to find a match. For example, a 70000-message dataset with varying lengths of messages takes 20 minutes on my machine (Python 3.6.3).
Some speedups I’ve included are:
- Discarding messages that are too short
- Only run Levenshtein distance if the string’s lengths are similar (I use python-Levenshtein which is written in C)
OrderedDict()and reversing it so that it’s more likely to exit early (as the list of messages is sorted, it’s more likely to find a match looking at what’s recently inserted)
Are there any avenues of speedups possible that I’m missing?
✓ Extra quality
ExtraProxies brings the best proxy quality for you with our private and reliable proxies
✓ Extra anonymity
Top level of anonymity and 100% safe proxies – this is what you get with every proxy package
✓ Extra speed
1,ooo mb/s proxy servers speed – we are way better than others – just enjoy our proxies!
USA proxy location
We offer premium quality USA private proxies – the most essential proxies you can ever want from USA
Our proxies have TOP level of anonymity + Elite quality, so you are always safe and secure with your proxies
Use your proxies as much as you want – we have no limits for data transfer and bandwidth, unlimited usage!
Superb fast proxy servers with 1,000 mb/s speed – sit back and enjoy your lightning fast private proxies!
99,9% servers uptime
Alive and working proxies all the time – we are taking care of our servers so you can use them without any problems
No usage restrictions
You have freedom to use your proxies with every software, browser or website you want without restrictions
Perfect for SEO
We are 100% friendly with all SEO tasks as well as internet marketing – feel the power with our proxies
Buy more proxies and get better price – we offer various proxy packages with great deals and discounts
We are working 24/7 to bring the best proxy experience for you – we are glad to help and assist you!