diff --git a/contrib/machine-learning/reinforcement-learning.md b/contrib/machine-learning/reinforcement-learning.md index 760d530..fcc26d6 100644 --- a/contrib/machine-learning/reinforcement-learning.md +++ b/contrib/machine-learning/reinforcement-learning.md @@ -113,13 +113,14 @@ Q-Learning is a model-free algorithm used in reinforcement learning to learn the - Choose an action using an exploration strategy (e.g., epsilon-greedy). - Take the action, observe the reward and the next state. - Update the Q-value of the current state-action pair using the Bellman equation: - - where: - - is the Q-value of state and action . - - is the observed reward. - - is the next state. - - is the learning rate. - - is the discount factor. + $$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) $$ + + where: + - $Q(s, a)$ is the Q-value of state $s$ and action $a$. + - $r$ is the observed reward. + - $s'$ is the next state. + - $\alpha$ is the learning rate. + - $\gamma$ is the discount factor. 3. Until convergence or a maximum number of episodes. ### SARSA @@ -128,13 +129,13 @@ SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference alg #### Algorithm: 1. Initialize Q-values arbitrarily for all state-action pairs. 2. Repeat for each episode: - - Initialize the environment state . - - Choose an action using the current policy (e.g., epsilon-greedy). + - Initialize the environment state $s$. + - Choose an action $a$ using the current policy (e.g., epsilon-greedy). - Repeat for each timestep: - - Take action , observe the reward and the next state . - - Choose the next action using the current policy. + - Take action $a$, observe the reward $r$ and the next state $s'$. + - Choose the next action $a'$ using the current policy. - Update the Q-value of the current state-action pair using the SARSA update rule: - + $$Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma Q(s', a') - Q(s, a) \right)$$ 3. Until convergence or a maximum number of episodes. ### REINFORCE Algorithm: @@ -299,12 +300,3 @@ for i in range(num_rows): Congratulations on completing your journey through this comprehensive guide to reinforcement learning! Armed with this knowledge, you are well-equipped to dive deeper into the exciting world of RL, whether it's for gaming, robotics, finance, healthcare, or any other domain. Keep exploring, experimenting, and learning, and remember, the only limit to what you can achieve with reinforcement learning is your imagination. *Happy coding, and may your RL adventures be rewarding!* - -$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) $$ - -where: -- $Q(s, a)$ is the Q-value of state $s$ and action $a$. -- $r$ is the observed reward. -- $s'$ is the next state. -- $\alpha$ is the learning rate. -- $\gamma$ is the discount factor.