I’ve been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups
and Iris
) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I’m tricking myself into thinking I’m doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categories
list, which is passed in to sklearn’sload_files()
function. - Use
train_test_split()
to hold out 40% of the dataset as test data - Transform the training and testing data to
tf-idf
- Train the classifier
- Evaluate the classifier’s predictions with the test data
The code below works and I’m currently averaging around 0.7
accuracy (so obviously, there’s some improvement still needed).
The main areas I feel might be lacking are the way in which I’m bringing the names of the categories
in and the way I’m dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
The code
import sklearn import numpy as np from glob import glob from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier from sklearn import metrics from sklearn.pipeline import Pipeline # Get paths to labelled data rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/") categories = [] # Extract the folder paths, reduce down to the label and append to the categories list for i in rawFolderPaths: string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','') category = string1.strip('/') categories.append(category) # Load the data print ('\nLoading the dataset...\n') docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML", description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42) # Split the dataset into training and testing sets print ('\nBuilding out hold-out test sample...\n') X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4) # THE TRAINING DATA # Transform the training data into tfidf vectors print ('\nTransforming the training data...\n') count_vect = CountVectorizer(stop_words='english') X_train_counts = count_vect.fit_transform(raw_documents=X_train) tfidf_transformer = TfidfTransformer(use_idf=True) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # THE TEST DATA # Transform the test data into tfidf vectors print ('\nTransforming the test data...\n') count_vect = CountVectorizer(stop_words='english') X_test_counts = count_vect.fit_transform(raw_documents=X_test) tfidf_transformer = TfidfTransformer(use_idf=True) X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts) print (X_test_tfidf.shape) docs_test = X_test # Construct the classifier pipeline text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer(use_idf=True)), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, verbose=1)), ]) # Fit the model to the training data text_clf.fit(X_train, y_train) # Run the test data into the model predicted = text_clf.predict(docs_test) # Calculate mean accuracy of predictions print (np.mean(predicted == y_test)) # Generate labelled performance metrics print(metrics.classification_report(y_test, predicted, target_names=docs_to_train.target_names))