My question is – should knn classifier accuracy decrease if the size of the test part of the dataset is increased ? In my opinion – it should decrease significantly, because of that there are less samples to train on. However, in my example it doesen’t. So I need to find out if it’s something wrong in me script, or it’s my misundersteanding of how does the knn classifier works.
I’ve been doing python machine learning tutorials by Sendex. In one of them, I create a script for k nearest neighbour classifier:
https://pythonprogramming.net/testing-our-k-nearest-neighbors-machine-learning-tutorial/?completed=/coding-k-nearest-neighbors-machine-learning-tutorial/
The dataset I’ve been working with is Breast Cancer Data Set:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
Here how this script looks like:
import numpy as np import matplotlib.pyplot as plt import warnings from math import sqrt from collections import Counter import pandas as pd import random def knn_clf(data, predict, k=3): if len(data) >= k: warnings.warn('k is set to a value less than total voting groups') distances = [] for group in data: for features in data[group]: euclidian_distance = sqrt( (features[0] - predict[0])**2 + (features[1] - predict[1])**2 ) distances.append([euclidian_distance, group]) # sorted(distances) - distances items [euclidian_distance, group] are sorted from shortest to longest # distance_group_pair is [euclidian_distance, group] list whithin distances list votes = [ distance_group_pair[1] for distance_group_pair in sorted(distances)[:k] ] vote_result = Counter(votes).most_common(1)[0][0] return vote_result def get_data(): df = pd.read_csv('data/breast-cancer-wisconsin.csv') df.replace('?', -9999, inplace=True) df.drop(['id'], 1, inplace=True) full_data = df.astype(float).values.tolist() return full_data def train_test_split(data, test_size=0.2): random.shuffle(data) train_set = { 2:[], 4:[] } test_set = { 2:[], 4:[] } train_data = data[:-int(test_size*len(data))] test_data = data[-int(test_size*len(data)):] for sample in train_data: train_set[sample[-1]].append(sample[:-1]) for sample in test_data: test_set[sample[-1]].append(sample[:-1]) return train_set, test_set def test_clf(train_set, test_set): correct = 0 total = 0 for group in test_set: for data in test_set[group]: vote = knn_clf(train_set, data, k=5) if group == vote: correct += 1 total += 1 accuracy = correct / total return accuracy
Then I’ve decided to try different test size values:
if __name__ == '__main__': for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]: cancer_data = get_data() train_set, test_set = train_test_split(cancer_data, test_size=i) accuracy = test_clf(train_set, test_set) print('test size: ', i) print('accuracy: ', accuracy) print('------------------------------------\n')
And – strange thing – accuracy is always above 90%, even if the size of the test part of the dataset is 80%.
test size: 0.1 accuracy: 0.9565217391304348 ------------------------------------ test size: 0.2 accuracy: 0.9424460431654677 ------------------------------------ test size: 0.3 accuracy: 0.9521531100478469 ------------------------------------ test size: 0.4 accuracy: 0.9498207885304659 ------------------------------------ test size: 0.5 accuracy: 0.9111747851002865 ------------------------------------ test size: 0.6 accuracy: 0.9451073985680191 ------------------------------------ test size: 0.7 accuracy: 0.9263803680981595 ------------------------------------ test size: 0.8 accuracy: 0.9373881932021467 ------------------------------------
This seems unlogical, and can someone exlain, why accuracy is not falling when the test part size in increased ?