diff --git a/contrib/machine-learning/assets/rl-components.png b/contrib/machine-learning/assets/rl-components.png
deleted file mode 100644
index cb359cf..0000000
Binary files a/contrib/machine-learning/assets/rl-components.png and /dev/null differ
diff --git a/contrib/machine-learning/reinforcement-learning.md b/contrib/machine-learning/reinforcement-learning.md
index bab38a0..0046ae4 100644
--- a/contrib/machine-learning/reinforcement-learning.md
+++ b/contrib/machine-learning/reinforcement-learning.md
@@ -57,8 +57,6 @@ Reinforcement learning involves determining the best actions to take in various
## Key Concepts and Terminology
-
-
### Agent
Agent is a system or entity that learns to make decisions by interacting with an environment. The agent improves its performance by trial and error, receiving feedback from the environment in the form of rewards or punishments.
@@ -118,46 +116,24 @@ Q-Learning is a model-free algorithm used in reinforcement learning to learn the
- Update the Q-value of the current state-action pair using the Bellman equation:
where:
- -
is the Q-value of state
and action
.
+ -
is the Q-value of state
and action
.
-
is the observed reward.
-
is the next state.
-
is the learning rate.
-
is the discount factor.
3. Until convergence or a maximum number of episodes.
-### Deep Q-Networks (DQN)
-Deep Q-Networks (DQN) extend Q-learning to high-dimensional state spaces using deep neural networks to approximate the Q-function. It uses experience replay and target networks to improve stability and convergence.
-
-#### Algorithm:
-1. Initialize the Q-network with random weights.
-2. Initialize a target network with the same weights as the Q-network.
-3. Repeat for each episode:
- - Initialize the environment state
.
- - Repeat for each timestep:
- - With probability
, choose a random action. Otherwise, select the action with the highest Q-value according to the Q-network.
- - Take the chosen action, observe the reward \( r \) and the next state \( s' \).
- - Store the transition \( (s, a, r, s') \) in the replay memory.
- - Sample a minibatch of transitions from the replay memory.
- - Compute the target Q-value for each transition:
-
- where:
- - \( \theta^- \) represents the parameters of the target network.
- - \( y_j \) is the target Q-value for the \( j \)th transition.
- - Update the Q-network parameters by minimizing the temporal difference loss:
-
-4. Until convergence or a maximum number of episodes.
-
### SARSA
SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference algorithm used for learning the Q-function. Unlike Q-learning, SARSA directly updates the Q-values based on the current policy.
#### Algorithm:
1. Initialize Q-values arbitrarily for all state-action pairs.
2. Repeat for each episode:
- - Initialize the environment state \( s \).
- - Choose an action \( a \) using the current policy (e.g., epsilon-greedy).
+ - Initialize the environment state
.
+ - Choose an action
using the current policy (e.g., epsilon-greedy).
- Repeat for each timestep:
- - Take action \( a \), observe the reward \( r \) and the next state \( s' \).
- - Choose the next action \( a' \) using the current policy.
+ - Take action
, observe the reward
and the next state
.
+ - Choose the next action
using the current policy.
- Update the Q-value of the current state-action pair using the SARSA update rule:
3. Until convergence or a maximum number of episodes.