Even though Markov Decision Processes are the most famous mathematical structure used to model an environment in reinforcement learning, there are other types of possible models for RL environment which act as extensions to vanilla MDPs. This section concerns itself to defining these extensions, and making links between them.

The Q-learning algorithm was first introduced by (Watkins1989), and is arguably one of the most famous, most studied and most widely implemented methods in the entire field. Given an MDP, Q-learning aims to calculate the corresponding optimal action value function , following the principle of optimality and the proof of existence...

Every RL algorithm attempts to learn an optimal policy for a given environment . So far, there is not a single algorithm which is used in every single environment to find an optimal policy. The choice of algorithm depends on many factors, such as the nature of the environment, the...