I am trying to implement Q learning for Adaptive PID controller into a line follower mobile robot which I have modeled in Simulink. This is what the control architecture looks like: .
The Q learning (during the training stage) tries different actions and updates the Q table based on the outcome of the action. Obviously the model is expected to do badly in the beginning (the robot gets off the track) and it should eventually learn the PID controller gains that will keep the robot in the track.
This is not a problem when working with a set of data only. For example, one may program a tic-tac-toe game and the Q learning algorithm will be trying different moves and observe the outcome of its action. Doing this in an iterative manner (updating the Q table in every iteration) results with the optimal policy (which means that the player never losses the game). However doing this in Simulink is slightly different because when the robot gets off the track the session is terminated instead of updating the Q table and trying again. When the session is terminated, we get back to where we started and hence the algorithm does not actually learn anything.
So my question: Is it possible to store the value of the Q table (which is essentially a matrix) and start over without having to terminate the simulation?