I have quite the long question that stems from a feeling of desperation with this problem. To start, I am writing my thesis and it is now long overdue, reasons are plentiful but procrastination (as well as being abandoned in a group of two) might be a big one. Let me start with describing what I am trying to accomplish, and why.
This is a story, so to the impatient out there, before I start, please excuse me, I feel the backstory need to be told in order to explain what I am after.
The thesis is supposed to be about classification of sensor data, so far so, good, right? The professor who gave us this idea is not great at explaining his needs but it is for a proposed edge controller in the field of IoT, where the classification is supposed to be autonomous. The only problem is that it is one of a few projects that was supposed to cover the DAP (Decision making, Action and Prediction process (yet another unneeded acronym)) this edge controller should work by in order to be able to automate some processes based on the incoming sensor data.
After changing lanes a few times and after being abandoned in my group my professor has agreed to simply allow the thesis to be about classification of data, plain and simple.
But there was no data to work from, so I have had to write code to generate dummy data from a plausible scenario to be able to test various classification methods. Mind you I have NO idea how to structure such data in order to fit any existing classifier so I had to just try to make the data as realistic as possible. The data looks something like this:
After the “subway station opens for business” the people counter goes up. Temperatures have a fluctuation and the people counter resets every “day”
Tue Jan 02 01:30:00 CET 2018 10 C 18 C closed 0 people Tue Jan 02 02:00:00 CET 2018 8 C 18 C closed 0 people Tue Jan 02 04:30:00 CET 2018 8 C 18 C open 90 people Tue Jan 02 05:00:00 CET 2018 8 C 18 C open 200 people
After formatting this to be numerical data only, it looks like this:
1514761260000,10,18,0,0 1514761320000,10,18,0,0 1514761380000,10,18,0,0
This data is for every minute for one week, in total (24 x 60) x 7 = 10080 rows of data and 5 columns. I think this translates to 5 classes and 10080 samples for each class? Please correct me if I am wrong!!!
Then to the question.. I need to classify this data
Vague, right? Well, I have read up on classification and touched upon Recurrent Neural Networks with Long-Short Term Memory, Convolutional Neural Networks, as well as Support Vector Matrices, K Nearest Neighbors, but I am getting blinder and blinder and blinder as desperation increases.
I have spent a long time with c++ and a framework called TinyRNN (https://github.com/peterrudenko/tiny-rnn) only to realise that I am trying to reinvent the wheel and when asking for help (finally) the author made me realise that his RNN implementation lacked a soft-max layer that I apparently need…?
So I went to Python and sci-kit learn and Keras as a few well documented classification frameworks, but what to do now?
Mind you, I am a complete beginner with Python, one weeks ‘training’ under the belt. I have C++ and Java experience.
I have written the below code snippet in Python to produce a training set and a testing set in, both wide form and long form data, (since I have no idea what the classifiers in sci-kit learn expects). I have created equal 2d-arrays with all the labels of the data and now I don’t really know what else to do next.
import numpy as numpy import pandas as pandas from sklearn.model_selection import train_test_split as ttsplit scenarioDataWide = pandas.read_csv('dataWide.csv', header=None) scenarioDataLong = pandas.read_csv('dataLong.csv', header=None) targetWideList = [["time"],["outTemp"],["inTemp"],["isOpen"],["pplCounter"]] targetWideArr = numpy.array(targetWideList) targetWide = numpy.repeat(targetWideArr,10080,1) targetLongList = ["time","outTemp","inTemp","isOpen","pplCounter"] targetLongArr = numpy.array(targetLongList) targetLong = numpy.tile(targetLongArr,(10080,1)) dataTrain, dataTest, targetTrain, targetTest = ttsplit(scenarioDataLong, targetLong)
In what way would some of you approach this problem?
Does my data need more preprocessing? What kind of preprocessing is most critical?
What kind of classifier would be up to the task of ‘separating the data streams’ and be able to correctly classify them?
I found this article testing different classifiers, but the code provided does not work with my type of data and I get errors about the data structure and the ‘multi output, multi class’ nature of my data.
https://www.kaggle.com/jeffd23/10-classifier-showdown-in-scikit-learn#
I hope you understand the dilemma. I am at a point of :”The more I read, the less I understand.”
Thank you all for being here and helping out. Please tell me if you need any more information and I will provide everything.