From da09e2332256a373a44ce980a55d7be171a96811 Mon Sep 17 00:00:00 2001
From: Ojaswi Chopra <ojaswichopra06@gmail.com>
Date: Sun, 9 Jun 2024 13:17:23 +0530
Subject: [PATCH] Minor Changes

---
 .../reinforcement-learning.md                 | 20 +++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/contrib/machine-learning/reinforcement-learning.md b/contrib/machine-learning/reinforcement-learning.md
index b470ed3..25dc442 100644
--- a/contrib/machine-learning/reinforcement-learning.md
+++ b/contrib/machine-learning/reinforcement-learning.md
@@ -116,7 +116,9 @@ Q-Learning is a model-free algorithm used in reinforcement learning to learn the
    - Choose an action using an exploration strategy (e.g., epsilon-greedy).
    - Take the action, observe the reward and the next state.
    - Update the Q-value of the current state-action pair using the Bellman equation:
-     \[ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) \]
+     ```latex
+     Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)
+     ```
      where:
      - \( Q(s, a) \) is the Q-value of state \( s \) and action \( a \).
      - \( r \) is the observed reward.
@@ -139,13 +141,17 @@ Deep Q-Networks (DQN) extend Q-learning to high-dimensional state spaces using d
      - Store the transition \( (s, a, r, s') \) in the replay memory.
      - Sample a minibatch of transitions from the replay memory.
      - Compute the target Q-value for each transition:
-       \[ y_j = \begin{cases} r_j & \text{if episode terminates at step } j+1 \\
-                               r_j + \gamma \max_{a'} Q(s', a'; \theta^-) & \text{otherwise} \end{cases} \]
+       ```latex
+       y_j = \begin{cases} r_j & \text{if episode terminates at step } j+1 \\
+                               r_j + \gamma \max_{a'} Q(s', a'; \theta^-) & \text{otherwise} \end{cases}
+       ```
      where:
      - \( \theta^- \) represents the parameters of the target network.
      - \( y_j \) is the target Q-value for the \( j \)th transition.
    - Update the Q-network parameters by minimizing the temporal difference loss:
-     \[ \mathcal{L}(\theta) = \frac{1}{N} \sum_{j} (y_j - Q(s_j, a_j; \theta))^2 \]
+     ```latex
+     \mathcal{L}(\theta) = \frac{1}{N} \sum_{j} (y_j - Q(s_j, a_j; \theta))^2
+     ```
 4. Until convergence or a maximum number of episodes.
 
 ### SARSA
@@ -160,7 +166,9 @@ SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference alg
      - Take action \( a \), observe the reward \( r \) and the next state \( s' \).
      - Choose the next action \( a' \) using the current policy.
      - Update the Q-value of the current state-action pair using the SARSA update rule:
-       \[ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma Q(s', a') - Q(s, a) \right) \]
+       ```latex
+       Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma Q(s', a') - Q(s, a) \right)
+       ```
 3. Until convergence or a maximum number of episodes.
 
 ### REINFORCE Algorithm:
@@ -321,7 +329,7 @@ for i in range(num_rows):
         print(f"State ({i}, {j}):", Q[i, j])
 ```
 
-### Conclusion
+## Conclusion
 Congratulations on completing your journey through this comprehensive guide to reinforcement learning! Armed with this knowledge, you are well-equipped to dive deeper into the exciting world of RL, whether it's for gaming, robotics, finance, healthcare, or any other domain. Keep exploring, experimenting, and learning, and remember, the only limit to what you can achieve with reinforcement learning is your imagination.
 
 *Happy coding, and may your RL adventures be rewarding!*
\ No newline at end of file