diff --git a/contrib/machine-learning/reinforcement-learning.md b/contrib/machine-learning/reinforcement-learning.md index e55881a..bab38a0 100644 --- a/contrib/machine-learning/reinforcement-learning.md +++ b/contrib/machine-learning/reinforcement-learning.md @@ -116,13 +116,13 @@ Q-Learning is a model-free algorithm used in reinforcement learning to learn the - Choose an action using an exploration strategy (e.g., epsilon-greedy). - Take the action, observe the reward and the next state. - Update the Q-value of the current state-action pair using the Bellman equation: - + where: - - \( Q(s, a) \) is the Q-value of state \( s \) and action \( a \). - - \( r \) is the observed reward. - - \( s' \) is the next state. - - \( \alpha \) is the learning rate. - - \( \gamma \) is the discount factor. + - is the Q-value of state and action. + - is the observed reward. + - is the next state. + - is the learning rate. + - is the discount factor. 3. Until convergence or a maximum number of episodes. ### Deep Q-Networks (DQN) @@ -132,9 +132,9 @@ Deep Q-Networks (DQN) extend Q-learning to high-dimensional state spaces using d 1. Initialize the Q-network with random weights. 2. Initialize a target network with the same weights as the Q-network. 3. Repeat for each episode: - - Initialize the environment state \( s \). + - Initialize the environment state . - Repeat for each timestep: - - With probability \( \epsilon \), choose a random action. Otherwise, select the action with the highest Q-value according to the Q-network. + - With probability , choose a random action. Otherwise, select the action with the highest Q-value according to the Q-network. - Take the chosen action, observe the reward \( r \) and the next state \( s' \). - Store the transition \( (s, a, r, s') \) in the replay memory. - Sample a minibatch of transitions from the replay memory.