From b230028124bcc03f56ed8f87ebeb437a38bce21c Mon Sep 17 00:00:00 2001 From: SAM <60264918+SAM-DEV007@users.noreply.github.com> Date: Thu, 6 Jun 2024 14:17:40 +0530 Subject: [PATCH] Update transformers.md Mathematical equations are converted to LaTex mathematical equations. --- contrib/machine-learning/transformers.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/contrib/machine-learning/transformers.md b/contrib/machine-learning/transformers.md index 95ccada..5a27688 100644 --- a/contrib/machine-learning/transformers.md +++ b/contrib/machine-learning/transformers.md @@ -20,9 +20,9 @@ The decoder is also composed of a stack of identical layers. In addition to the ### Attention #### Scaled Dot-Product Attention -The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt d_k$ , and apply a softmax function to obtain the weights on the values. +The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt {d_k}$ , and apply a softmax function to obtain the weights on the values. -> Attention(Q, K, V) = softmax(QKT / √dk) * V +$$Attention(Q, K, V) = softmax(\dfrac{QK^T}{\sqrt{d_k}}) \times V$$ #### Multi-Head Attention Instead of performing a single attention function with $d_{model}$-dimensional keys, values and queries, it is beneficial to linearly project the queries, keys and values h times with different, learned linear projections to $d_k$ , $d_k$ and $d_v$ dimensions, respectively. @@ -30,11 +30,11 @@ Instead of performing a single attention function with $d_{model}$-dimensional k Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. -> MultiHead(Q, K, V) = Concat(head1, ..., headh) * WO +$$MultiHead(Q, K, V) = Concat(head_1, _{...}, head_h) \times W^O$$ where, -> headi = Attention(Q * WiQ, K * WiK, V * WiV) +$$head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$ where the projections are parameter matrices. @@ -42,20 +42,22 @@ where the projections are parameter matrices. It may be necessary to cut out attention links between some word-pairs. For example, the decoder for token position $t$ should not have access to token position $t+1$. -> MaskedAttention(Q, K, V) = softmax(M + (QKT / √dk)) * V +$$MaskedAttention(Q, K, V) = softmax(M + \dfrac{QK^T}{\sqrt{d_k}}) \times V$$ ### Feed-Forward Network Each of the layers in the encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. -> FFN(x) = (max(0, (x * W1) + b1) * W2) + b2 + +$$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$$ ### Positional Encoding A positional encoding is a fixed-size vector representation that encapsulates the relative positions of tokens within a target sequence: it provides the transformer model with information about where the words are in the input sequence. The sine and cosine functions of different frequencies: -> PE(pos,2i) = sin(pos/100002i/dmodel) -> PE(pos,2i+1) = cos(pos/100002i/dmodel) +$$PE(pos,2i) = \sin({\dfrac{pos}{10000^{\dfrac{2i}{d_{model}}}}})$$ + +$$PE(pos,2i) = \cos({\dfrac{pos}{10000^{\dfrac{2i}{d_{model}}}}})$$ ## Implementation ### Theory