I’ve been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as
Twenty Newsgroups and
Iris) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I’m tricking myself into thinking I’m doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categorieslist, which is passed in to sklearn’s
train_test_split()to hold out 40% of the dataset as test data
- Transform the training and testing data to
- Train the classifier
- Evaluate the classifier’s predictions with the test data
The code below works and I’m currently averaging around
0.7 accuracy (so obviously, there’s some improvement still needed).
The main areas I feel might be lacking are the way in which I’m bringing the names of the
categories in and the way I’m dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
import sklearn import numpy as np from glob import glob from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier from sklearn import metrics from sklearn.pipeline import Pipeline # Get paths to labelled data rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/") categories =  # Extract the folder paths, reduce down to the label and append to the categories list for i in rawFolderPaths: string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','') category = string1.strip('/') categories.append(category) # Load the data print ('\nLoading the dataset...\n') docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML", description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42) # Split the dataset into training and testing sets print ('\nBuilding out hold-out test sample...\n') X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4) # THE TRAINING DATA # Transform the training data into tfidf vectors print ('\nTransforming the training data...\n') count_vect = CountVectorizer(stop_words='english') X_train_counts = count_vect.fit_transform(raw_documents=X_train) tfidf_transformer = TfidfTransformer(use_idf=True) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # THE TEST DATA # Transform the test data into tfidf vectors print ('\nTransforming the test data...\n') count_vect = CountVectorizer(stop_words='english') X_test_counts = count_vect.fit_transform(raw_documents=X_test) tfidf_transformer = TfidfTransformer(use_idf=True) X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts) print (X_test_tfidf.shape) docs_test = X_test # Construct the classifier pipeline text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer(use_idf=True)), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, verbose=1)), ]) # Fit the model to the training data text_clf.fit(X_train, y_train) # Run the test data into the model predicted = text_clf.predict(docs_test) # Calculate mean accuracy of predictions print (np.mean(predicted == y_test)) # Generate labelled performance metrics print(metrics.classification_report(y_test, predicted, target_names=docs_to_train.target_names))
✓ Extra quality
ExtraProxies brings the best proxy quality for you with our private and reliable proxies
✓ Extra anonymity
Top level of anonymity and 100% safe proxies – this is what you get with every proxy package
✓ Extra speed
1,ooo mb/s proxy servers speed – we are way better than others – just enjoy our proxies!
USA proxy location
We offer premium quality USA private proxies – the most essential proxies you can ever want from USA
Our proxies have TOP level of anonymity + Elite quality, so you are always safe and secure with your proxies
Use your proxies as much as you want – we have no limits for data transfer and bandwidth, unlimited usage!
Superb fast proxy servers with 1,000 mb/s speed – sit back and enjoy your lightning fast private proxies!
99,9% servers uptime
Alive and working proxies all the time – we are taking care of our servers so you can use them without any problems
No usage restrictions
You have freedom to use your proxies with every software, browser or website you want without restrictions
Perfect for SEO
We are 100% friendly with all SEO tasks as well as internet marketing – feel the power with our proxies
Buy more proxies and get better price – we offer various proxy packages with great deals and discounts
We are working 24/7 to bring the best proxy experience for you – we are glad to help and assist you!