[딥러닝 논문 리뷰 시리즈]

노션에서 작성한 글을 옮겼으며, 아래 노션에서 더 깔끔하게 읽으실 수 있습니다.

>>노션 링크<<

Transformer의 각 Layer 의 의미와 동작, 계산 과정을 예시를 통해 이해하기

처음 보면 굉장히 복잡해 보이는 Transformer의 구조.

하나하나 뜯어서 직접 계산 과정을 눈으로 확인하고, 각 부분의 의미와 용도를 알아보며 보다 쉽게 이해할 수 있도록 써 보았다.

아래 대부분의 과정은 Attention Is All You Need 논문의 base model을 기준으로 작성하였다.

전체 Layer의 계산과정을, 계산식 예시를 보며 하나하나 파헤쳐 보자.

Input Embedding

Input Embedding Layer에서는, 여타 다른 NLP 모델에서와 같이 word embedding을 수행한다.

여기에서는 word embedding에 대해 다루는 것이 아니기 때문에, 과정을 간략하게만 설명한다.

Tokenizing

우선 입력된 문장을 Token 단위로 나눈다.

예를 들어, "I love deep learning” 이라는 문장이 있다면,

[”I”, “love”, “deep”, “learning”]으로 변환하게 된다.

Encoding (Index mapping)

위에서 토큰화된 단어 [”I”, “love”, “deep”, “learning”]를, 각 단어마다 특정 숫자로 mapping한다.

예를 들어, 전체 단어 집합의 크기(vocabulary size)가 10,000이라면,

[1, 2, 3, 1424]와 같은 형태(물론 숫자는 다르다.)로 encoding되고,

이를 one-hot encoding을 이용해

$\left[ \begin{array}{c} \text{I} \\ \text{love} \\ \text{deep} \\ \text{learning} \end{array} \right] = \left[ \begin{vmatrix} 1 & 0 & 0 & \cdots & 0 & 0 & 0 \\ 0 & 1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & 0 & 1 & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots & 0 & 0 & 1 \end{vmatrix} \right]_{4 \times 10000}$

와 같이 변환할 수 있다.

Embedding

위의 one-hot encoding된 형태를 이용하여, 각 단어마다 특정 dimension(논문에서는 $d_{model}=512$ ) 으로 embedding하여 변환한다.

위에서의 “I love deep learning”의 경우, one-hot encoding까지 완료되면,

와 같은 형태가 된다.

이를 embedding matrix를 활용해 각 단어를 $d_{model}$ (512) 차원으로 embedding한다.

이 때, embedding matrix는 (vocabulary size)x( $d_{model}$ ) 이다.

$\small \underbrace{ \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} }_{4 \times 10000} \times \underbrace{ \begin{bmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1, 512} \\ w_{2,1} & w_{2,2} & \cdots & w_{2, 512} \\ \vdots & \vdots & \ddots & \vdots \\ w_{10000,1} & w_{10000,2} & \cdots & w_{10000,512} \end{bmatrix} }_{10000 \times 512 \space \text{(embedding matrix)}} \underbrace{ \begin{bmatrix} v_{11} & v_{12} & \cdots & v_{1,512} \\ v_{21} & v_{22} & \cdots & v_{2,512} \\ v_{31} & v_{32} & \cdots & v_{3,512} \\ v_{41} & v_{42} & \cdots & v_{4,512} \end{bmatrix} }_{4 \times 512}$

여기서의 embedding matrix 또한 training의 대상이다. 실제 training 이후 과정에선 lookup 방식을 사용하기도 한다.

이렇게 embedding을 수행하면, (Input 단어 수)x( $d_{model}$ ) 크기의 행렬이 나오게 된다.

$\small \left[\begin{array}{c}\text{I} \\\text{love} \\\text{deep} \\\text{learning}\end{array}\right] =\left[\begin{vmatrix}0.021 & -0.103 & 0.358 & 0.214 & \cdots & -0.142 \\-0.531 & 0.842 & 0.012 & -0.401 & \cdots & 0.095 \\0.182 & 0.631 & -0.274 & 0.005 & \cdots & 0.427 \\0.753 & -0.192 & 0.471 & 0.329 & \cdots & -0.503\end{vmatrix}\right]_{4 \times 512}$

Positional Encoding

일반적으로, 한 문장 내에서 단어의 어순이 의미에 중요한 역할을 한다.

예를 들어, “The dog bites the man.” 과 “The man bites the dog.”의 두 문장의 경우, 모든 단어(토큰)이 동일하고, ‘dog’, ‘man’의 어순만 달라졌음에도 문장의 의미가 전혀 달라진다.

RNN에서는 시계열 데이터의 input을 입력 순서에 따라 처리하므로, 자연스럽게 그 의미를 포함하게 되었다.

그러나 Transformer의 Self Attention에서는 input sequence의 순서 정보를 전혀 고려하지 않기 때문에, Positional Encoding을 통해 각 토큰의 위치 정보를 모델에 제공할 것이다.

“Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

이러한 Positional Encoding에는 일반적으로 주기함수를 이용할 수 있고, 여러 다른 방법이 존재하지만, Transformer 논문에서의 base model에서는 아래와 같이 sin, cos 함수를 이용하였다.

$\begin{align*} PE_{(pos, 2i)} = sin(pos / 10000^{(2i/d_{model})})\\ PE_{(pos, 2i+1)} = cos(pos / 10000^{(2i/d_{model})}) \end{align*}$

여기에서의 $pos$ 는 각 토큰에 대한 위치값이며, (첫 토큰 → 0, 두번째 토큰 → 1 …) $i$ 는 차원에 대한 값이다. (예를 들어, 각 단어에 대한 512차원 중 0, 1번째는 $i=0$ , 2, 3번째는 $i=1$ , …이며, 0, 2번째는 짝수이므로 sin을, 1, 3번째는 홀수이므로 cos를 이용. 아래 식을 참고.)

이를 이용해 Positional Encoding 값을 계산하여, Word Embedding의 결과에 더하는 것을 예시를 통해 확인해 보자.

아래는 위에서 수행했던 “I love deep learning”이라는 문장을 word embedding한 결과이다.

$E = \left[\begin{array}{cccccc} 0.021 & -0.103 & 0.358 & 0.214 & \cdots & -0.142 \\ -0.531 & 0.842 & 0.012 & -0.401 & \cdots & 0.095 \\ 0.182 & 0.631 & -0.274 & 0.005 & \cdots & 0.427 \\ 0.753 & -0.192 & 0.471 & 0.329 & \cdots & -0.503 \end{array}\right]_{4 \times 512}$

“I love deep learning”에서, positional index값은 순서대로 [0, 1, 2, 3]이 된다.

따라서, “I”에 대한 Positional Encoding 값은 다음과 같다.

$PE(0) = \left[\begin{array}{cccccc} \sin(0) & \cos(0) & \sin(0) & \cos(0) & \cdots & \cos(0) \end{array}\right]_{1 \times 512}$

마찬가지로, “love”에 대한 Positional Encoding 값은 다음과 같다.

$PE(1) = \left[\begin{array}{cccccc} \sin\left(\frac{1}{10000^{2*0/512}}\right) & \cos\left(\frac{1}{10000^{2*0/512}}\right) & \sin\left(\frac{1}{10000^{2*1/512}}\right) & \cos\left(\frac{1}{10000^{2*1/512}}\right) & \cdots \end{array}\right]_{1 \times 512}$

이러한 식으로, Positional Encoding을 수행하고 이 값을 더하게 되면, 다음과 같게 된다.

$\tiny E + P = \left[\begin{array}{cccccc} 0.021 + \sin(0) & -0.103 + \cos(0) & 0.358 + \sin(0) & 0.214 + \cos(0) & \cdots & -0.142 + \cos(0) \\ -0.531 + \sin\left(\frac{1}{10000^{2*0/512}}\right) & 0.842 + \cos\left(\frac{1}{10000^{2*0/512}}\right) & 0.012 + \sin\left(\frac{1}{10000^{2*1/512}}\right) & -0.401 + \cos\left(\frac{1}{10000^{2*1/512}}\right) & \cdots & 0.095 + \cos\left(\frac{1}{10000^{2*255/512}}\right) \\ 0.182 + \sin\left(\frac{2}{10000^{2*0/512}}\right) & 0.631 + \cos\left(\frac{2}{10000^{2*0/512}}\right) & -0.274 + \sin\left(\frac{2}{10000^{2*1/512}}\right) & 0.005 + \cos\left(\frac{2}{10000^{2*1/512}}\right) & \cdots & 0.427 + \cos\left(\frac{2}{10000^{2*255/512}}\right) \\ 0.753 + \sin\left(\frac{3}{10000^{2*0/512}}\right) & -0.192 + \cos\left(\frac{3}{10000^{2*0/512}}\right) & 0.471 + \sin\left(\frac{3}{10000^{2*1/512}}\right) & 0.329 + \cos\left(\frac{3}{10000^{2*1/512}}\right) & \cdots & -0.503 + \cos\left(\frac{3}{10000^{2*255/512}}\right) \end{array}\right]_{4 \times 512}$

$\scriptsize E+P = \left[\begin{array}{cccccc} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ -0.531 + 0.8415 & 0.842 + 0.5403 & 0.012 + 0.8402 & -0.401 + 0.5407 & \cdots & 0.095 + 0.6967 \\ 0.182 + 0.9093 & 0.631 - 0.4161 & -0.274 + 0.9092 & 0.005 - 0.4201 & \cdots & 0.427 - 0.0328 \\ 0.753 + 0.1411 & -0.192 - 0.9899 & 0.471 + 0.1411 & 0.329 - 0.9896 & \cdots & -0.503 - 0.7374 \end{array}\right]_{4 \times 512}$

$\scriptsize E+P= \left[\begin{array}{cccccc} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ 0.3105 & 1.3823 & 0.8522 & 0.1397 & \cdots & 0.7917 \\ 1.0913 & 0.2149 & 0.6352 & -0.4151 & \cdots & 0.3942 \\ 0.8941 & -1.1819 & 0.6121 & -0.6606 & \cdots & -1.2404 \end{array}\right]_{4 \times 512}$

여기까지가 “I love deep learning”이라는 문장을, word embedding layer( $d_{model}=512$ )를 거쳐, postional encoding을 수행하고, 이를 summation 한 결과이다.

Self Attention (Multi-head attention)

“Motivating our use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.”

논문에서 인용된 위의 구절처럼, Trasnformer에서는 Self-Attention이라는 개념을 도입했다.

Self Attention을 수행하기 위한 구조는 위와 같다. Multi-head Attention 대신, Single-head Attention으로 먼저 설명하겠다.

Single Head Attention (in Scaled Dot-Product Attention)

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

위에서 계산하였던, Input Embedding + Positional Encoding의 결과를 3개로 복사한다. (각각, Q, K, V가 된다.)

$\scriptsize E+P= X = \left[\begin{array}{cccccc} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ 0.3105 & 1.3823 & 0.8522 & 0.1397 & \cdots & 0.7917 \\ 1.0913 & 0.2149 & 0.6352 & -0.4151 & \cdots & 0.3942 \\ 0.8941 & -1.1819 & 0.6121 & -0.6606 & \cdots & -1.2404 \end{array}\right]_{4 \times 512}$

이 결과( $X$ )를 3개로 복사하고, 각각 Weight를 곱해 Q(Query), K(Key), V(Value) 행렬을 만들게 된다.

이 때, Q, K, V의 dimension인 $d_q, d_k, d_v$ 는 모두 64(논문 base model 기준)이고, 따라서 Q, K, V를 생성하기 위한 각각의 weight 행렬인 $W_Q, W_K, W_V$ 는 (512)x(64) 차원의 matrix이다. $X$ 에 weight들을 곱하게 되면, (4x512)x(512x64)이므로, 4x64의 Query, Key, Value matrix가 나오게 된다.

실제 계산 과정을 따라가 보자.

$\scriptsize W_Q = \begin{bmatrix} 0.1 & 0.2 & \cdots & 0.4 \\ 0.3 & 0.1 & \cdots & 0.5 \\ \vdots & \vdots & \ddots & \vdots \\ 0.6 & 0.7 & \cdots & 0.8 \end{bmatrix}_{512 \times 64}, W_K = \begin{bmatrix} 0.4 & 0.2 & \cdots & 0.1 \\ 0.3 & 0.5 & \cdots & 0.3 \\ \vdots & \vdots & \ddots & \vdots \\ 0.8 & 0.6 & \cdots & 0.9 \end{bmatrix}_{512 \times 64}, W_V = \begin{bmatrix} 0.2 & 0.4 & \cdots & 0.3 \\ 0.1 & 0.3 & \cdots & 0.6 \\ \vdots & \vdots & \ddots & \vdots \\ 0.7 & 0.9 & \cdots & 0.5 \end{bmatrix}_{512 \times 64}$

위와 같은 Weight $W_Q, W_K, W_V$ 를 아래와 같이 각각 $X$ 에 곱해준다.

이러한 방식으로 Query, Key, Value matrix를 구한다. (여기에서부터는 실제 숫자로 계산하는 것이 이해에 크게 도움이 되지 않을 것 같으므로, 대체한다.)

$\tiny Q = X \times W_Q = \begin{bmatrix} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ 0.3105 & 1.3823 & 0.8522 & 0.1397 & \cdots & 0.7917 \\ 1.0913 & 0.2149 & 0.6352 & -0.4151 & \cdots & 0.3942 \\ 0.8941 & -1.1819 & 0.6121 & -0.6606 & \cdots & -1.2404 \end{bmatrix}_{4 \times 512} \times \begin{bmatrix} 0.1 & 0.2 & \cdots & 0.4 \\ 0.3 & 0.1 & \cdots & 0.5 \\ \vdots & \vdots & \ddots & \vdots \\ 0.6 & 0.7 & \cdots & 0.8 \end{bmatrix}_{512 \times 64} = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1,64} \\ q_{21} & q_{22} & \cdots & q_{2,64} \\ q_{31} & q_{32} & \cdots & q_{3,64} \\ q_{41} & q_{42} & \cdots & q_{4,64} \end{bmatrix}_{4 \times 64}$

$\tiny K = X \times W_K = \begin{bmatrix} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ 0.3105 & 1.3823 & 0.8522 & 0.1397 & \cdots & 0.7917 \\ 1.0913 & 0.2149 & 0.6352 & -0.4151 & \cdots & 0.3942 \\ 0.8941 & -1.1819 & 0.6121 & -0.6606 & \cdots & -1.2404 \end{bmatrix}_{4 \times 512} \times \begin{bmatrix} 0.4 & 0.2 & \cdots & 0.1 \\ 0.3 & 0.5 & \cdots & 0.3 \\ \vdots & \vdots & \ddots & \vdots \\ 0.8 & 0.6 & \cdots & 0.9 \end{bmatrix}_{512 \times 64} = \begin{bmatrix} k_{11} & k_{12} & \cdots & k_{1,64} \\ k_{21} & k_{22} & \cdots & k_{2,64} \\ k_{31} & k_{32} & \cdots & k_{3,64} \\ k_{41} & k_{42} & \cdots & k_{4,64} \end{bmatrix}_{4 \times 64}$

$\tiny V = X \times W_V = \begin{bmatrix} 0.021 & 0.897 & 0.358 & 1.214 & \cdots & 0.858 \\ 0.3105 & 1.3823 & 0.8522 & 0.1397 & \cdots & 0.7917 \\ 1.0913 & 0.2149 & 0.6352 & -0.4151 & \cdots & 0.3942 \\ 0.8941 & -1.1819 & 0.6121 & -0.6606 & \cdots & -1.2404 \end{bmatrix}_{4 \times 512} \times \begin{bmatrix} 0.2 & 0.4 & \cdots & 0.3 \\ 0.1 & 0.3 & \cdots & 0.6 \\ \vdots & \vdots & \ddots & \vdots \\ 0.7 & 0.9 & \cdots & 0.5 \end{bmatrix}_{512 \times 64} = \begin{bmatrix} v_{11} & v_{12} & \cdots & v_{1,64} \\ v_{21} & v_{22} & \cdots & v_{2,64} \\ v_{31} & v_{32} & \cdots & v_{3,64} \\ v_{41} & v_{42} & \cdots & v_{4,64} \end{bmatrix}_{4 \times 64}$

이제 Q, K, V 행렬을 구했으므로, Q, K에 대한 MatMul (행렬곱)을 수행한다.

이 값이 Attention Score이다.

행렬곱을 수행한다는 것은, 각 row와 column vector사이의 dot product를 수행한다는 것이다. 각 단어에 해당하는 vector를 dot product했을 때 높은 값이 나왔다는 것은, 기하학적으로 생각했을 때 전체 공간 내에서 두 단어에 해당하는 vector의 방향이 비슷하다는 것을 의미한다. (vector의 dot product를 수행했을 때, $cos\theta$ 가 등장한다. 그러므로 두 vector 사이각이 작을 수록 dot product 결과값이 커진다는 것은 이미 알고 있을 것이다.) 두 vector의 방향이 비슷하다는 것은 결국 의미적으로는, 두 단어의 의미적(또는 다른 연관)인 연관성이 높다는 것을 의미하는 것이다. 그렇기 때문에, Query와 Key matrix의 행렬곱을 수행하여 Query에서의 각 단어와, Key에서의 각 단어에 대한 word vector의 dot product를 수행한 결과가 Attention Score가 되는 것.

이 행렬의 값들이, 한 문장 안에서 각각의 단어와 단어 사이의 연관성을 의미하는 것이다. (예를 들어, $s_{12}$ 는 “I”와 “love” 사이의 연관성을 나타내는 값이 된다.)

$\scriptsize QK^T = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1,64} \\ q_{21} & q_{22} & \cdots & q_{2,64} \\ q_{31} & q_{32} & \cdots & q_{3,64} \\ q_{41} & q_{42} & \cdots & q_{4,64} \end{bmatrix}_{4 \times 64} \times \begin{bmatrix} k*{11} & k_{21} & k_{31} & k_{41} \\ k_{12} & k_{22} & k_{32} & k_{42} \\ \vdots & \vdots & \vdots & \vdots \\ k_{1,64} & k_{2,64} & k_{3,64} & k_{4,64} \end{bmatrix}_{64 \times 4} = \begin{bmatrix} s*{11} & s_{12} & s_{13} & s_{14} \\ s_{21} & s_{22} & s_{23} & s_{24} \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & s_{42} & s_{43} & s_{44} \end{bmatrix}_{4 \times 4}$

이제 이 값을 scaling한다. 논문에서는 $\sqrt{d_k}$ 로 모든 값을 나누어주는 방식으로 scaling하였다.

$d_k = 64$ 이므로, 여기에서는 모든 $s$ 값을 $8$ 로 나누어주면 된다.

scaling을 수행하는 이유는, 학습 과정에서의 gradient값을 조절(softmax 이전 값이 너무 크지 않도록)하거나, Multi-head Attention에서 각 head의 출력 차이가 크지 않도록 하는 등의 목적이 있다.

64개의 element-product 값을 모두 더하기 때문에 당연히 값이 매우 클 것이다. 이대로 softmax를 적용하게 되면, exponential을 적용하는 과정에서 큰 값에 매우 민감하게 반응하여 확률 분포가 특정한 한 값으로 쏠리게 될 수 있다. 따라서, 관련된 값(차원)인 $d_k$ 를 이용해 scaling을 수행한다.

그렇게 scaling된 값에 softmax를 적용한다.

이 때 softmax는 각 row별로 적용하며, 계산된 결과는 Query의 각 단어별로, 다른 단어들에 가지는 상대적 중요도를 확률(softmax)로써 변환해줄 수 있게 된다.

$\scriptsize \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) = \text{softmax}\left(\frac{1}{\sqrt{64}} \times \begin{bmatrix} s_{11} & s_{12} & s_{13} & s_{14} \\ s_{21} & s_{22} & s_{23} & s_{24} \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & s_{42} & s_{43} & s_{44} \end{bmatrix}\right) = \begin{bmatrix} \hat{s}*{11} & \hat{s}*{12} & \hat{s}*{13} & \hat{s}*{14} \\ \hat{s}*{21} & \hat{s}*{22} & \hat{s}*{23} & \hat{s}*{24} \\ \hat{s}*{31} & \hat{s}*{32} & \hat{s}*{33} & \hat{s}*{34} \\ \hat{s}*{41} & \hat{s}*{42} & \hat{s}*{43} & \hat{s}*{44} \end{bmatrix}_{4 \times 4}$

마지막으로 이 값에 Value 행렬을 곱해준다.

$\scriptsize \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V = \begin{bmatrix} \hat{s}*{11} & \hat{s}*{12} & \hat{s}*{13} & \hat{s}*{14} \\ \hat{s}*{21} & \hat{s}*{22} & \hat{s}*{23} & \hat{s}*{24} \\ \hat{s}*{31} & \hat{s}*{32} & \hat{s}*{33} & \hat{s}*{34} \\ \hat{s}*{41} & \hat{s}*{42} & \hat{s}*{43} & \hat{s}*{44} \end{bmatrix}_{4 \times 4} \times \begin{bmatrix} v*{11} & v_{12} & \cdots & v_{1,64} \\ v_{21} & v_{22} & \cdots & v_{2,64} \\ v_{31} & v_{32} & \cdots & v_{3,64} \\ v_{41} & v_{42} & \cdots & v_{4,64} \end{bmatrix}_{4 \times 64} = \begin{bmatrix} o*{11} & o_{12} & \cdots & o_{1,64} \\ o_{21} & o_{22} & \cdots & o_{2,64} \\ o_{31} & o_{32} & \cdots & o_{3,64} \\ o_{41} & o_{42} & \cdots & o_{4,64} \end{bmatrix}_{4 \times 64}$

이렇게 구해진 결과가 Attention Value 이며, Attention에서의 최종 output이 된다.

Value matrix의 값들은, Sequence 각 token(단어)의 “meaning”, “information”, “context” 등을 숫자로 치환한 값이라고 생각하면 된다. softmax를 거친 matrix에서 각 단어와 단어 사이의 attetion 강도 정보를 포함하고 있으므로, 이를 실제 단어의 의미를 지니고 있는 Value matrix에 행렬곱을 수행하여 Attention Value값을 도출하게 된다.

Masking

하나 설명하지 않고 지나간 것이 있는데, softmax를 적용하기 전, Masking을 수행할 수 있다.

특정 단어에 대한 Attention을 수행하지 않겠다는 것인데, 예를 들어 “I love deep learning”이라는 문장에서 “I”라는 단어에 대해 “deep”, “learning”으로의 attention을 무시하려면, $\small [s_{11} \ s_{12} \ -\infty \ -\infty]$ 과 같이 무시하려는 단어에 대해 $-\infty$ 로 masking하면 된다.

( $-\infty$ 로 마스킹하게 되면, softmax를 적용했을 때 0에 가까워지기 때문.)

이러한 과정은 보통 Decoder 측의 Self Attention 과정에서, 현재 출력 단어 이후의 단어들에 대해서 masking을 하도록 하는 데 이용한다.

(Decoder의 목적은, 출력할 문장에 대해 한 번에 한 단어씩 출력하는 것이 목표이기 때문에, 아직 출력되지 않은 뒤쪽의 단어에 대해 Attention을 수행하는 것은 Cheating과 다름없는 행위가 된다. 그러므로 뒤쪽의 단어들에 대해 Masking을 수행해 주는 것이다.)

$\scriptsize \text{Masked}\left(\frac{QK^T}{\sqrt{d_k}}\right) = \begin{bmatrix} s_{11} & s_{12} & -\infty & -\infty \\ s_{21} & s_{22} & s_{23} & -\infty \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & -\infty & -\infty & s_{44} \end{bmatrix}_{4 \times 4}$

$\scriptsize \text{softmax}\left(\text{Masked}\left(\frac{QK^T}{\sqrt{d_k}}\right)\right) = \text{softmax}\left( \begin{bmatrix} s*{11} & s_{12} & -\infty & -\infty \\ s_{21} & s_{22} & s_{23} & -\infty \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & -\infty & -\infty & s_{44} \end{bmatrix} \right) = \begin{bmatrix} \hat{s}*{11} & \hat{s}*{12} & 0 & 0 \\ \hat{s}*{21} & \hat{s}*{22} & \hat{s}*{23} & 0 \\ \hat{s}*{31} & \hat{s}*{32} & \hat{s}*{33} & \hat{s}*{34} \\ \hat{s}*{41} & 0 & 0 & \hat{s}*{44} \end{bmatrix}_{4 \times 4}$

Multi-head Attention

실제로 논문에서는 위와 같은 방식의 Single-head로 구현하지 않고, Multi-head로 Attention을 구현하였다.

$\small \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \\ \text{where} \quad \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$

위에서는 $d_q, d_k, d_v$ 값을 64로 하여 한 번에 계산했고, 그렇게 나온 Query, Key, Value matrix의 dimension은 각각 4x64였다.

이렇게 계산한 하나의 과정이, 하나의 Single-head Attention 과정이며, 같은 방법으로 head의 개수만큼 이를 반복하여, 마지막 Attention Value(Attention 과정의 최종 결과 matrix)를 모두 concat하는 것이 Multi-head Attention의 과정이다.

직접 계산 과정을 보며 이해해 보자.

먼저 논문에서 사용한 base model의 head 개수는 8개이다. ( $h=8$ )

이에 따라 하나의 head에서의, $d^k_i=d^v_i=d_{model}/h = 64$ 이다. (위에서의 Single-head Attention과 동일하다.)

Single-head Attention과 마찬가지의 방식으로 1번 head에서의 Key, Value matrix를 계산하고, Scaled Dot-Product Attention 과정을 그대로 수행한다.

그렇게 되면 각각 4x64 dimension의 Attention Value matrix가 8개 도출되고, 이를 concat하여 4x512 (4x(64x8))의 Attention Value matrix로 돌려놓는다.

$\scriptsize \text{head}1 = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1,64} \\ a_{21} & a_{22} & \cdots & a_{2,64} \\ a_{31} & a_{32} & \cdots & a_{3,64} \\ a_{41} & a_{42} & \cdots & a_{4,64} \end{bmatrix}_{4 \times 64}, \quad \text{head}2 = \begin{bmatrix} b*{11} & b_{12} & \cdots & b_{1,64} \\ b_{21} & b_{22} & \cdots & b_{2,64} \\ b_{31} & b_{32} & \cdots & b_{3,64} \\ b_{41} & b_{42} & \cdots & b_{4,64} \end{bmatrix}_{4 \times 64}, \ldots,\ \text{head}8 = \begin{bmatrix} h_{11} & h_{12} & \cdots & h_{1,64} \\ h_{21} & h_{22} & \cdots & h_{2,64} \\ h_{31} & h_{32} & \cdots & h_{3,64} \\ h_{41} & h_{42} & \cdots & h_{4,64} \end{bmatrix}_{4 \times 64}$

$\scriptsize \text{Concat}(\text{head}1, \text{head}2, \ldots, \text{head}8) = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1,64} & b_{11} & b_{12} & \cdots & b_{1,64} & \cdots & h_{11} & h_{12} & \cdots & h_{1,64} \\ a_{21} & a_{22} & \cdots & a_{2,64} & b_{21} & b_{22} & \cdots & b_{2,64} & \cdots & h_{21} & h_{22} & \cdots & h_{2,64} \\ a_{31} & a_{32} & \cdots & a_{3,64} & b_{31} & b_{32} & \cdots & b_{3,64} & \cdots & h_{31} & h_{32} & \cdots & h_{3,64} \\ a_{41} & a_{42} & \cdots & a_{4,64} & b_{41} & b_{42} & \cdots & b_{4,64} & \cdots & h_{41} & h_{42} & \cdots & h_{4,64} \end{bmatrix}_{4 \times 512}$

이렇게 Concat된 최종 Attention Value matrix를, Fully Connected Layer를 거쳐 Multi-head Attention의 최종 output을 도출한다.

(이 과정은 그냥 Fully Connected Layer (Dense Layer)를 거치는 과정일 뿐이므로, 깊게 생각하지 않아도 된다. 선형 변환을 한 번 더 거칠 뿐이다. 여기에서는 512x512의 weight matrix를 이용하여 dimension이 그대로 유지된다.)

$\scriptsize \text{Concat}(\text{head}1, \text{head}2, \ldots, \text{head}8) = \begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1,512} \\ c_{21} & c_{22} & \cdots & c_{2,512} \\ c_{31} & c_{32} & \cdots & c_{3,512} \\ c_{41} & c_{42} & \cdots & c_{4,512} \end{bmatrix}_{4 \times 512}$

$\scriptsize\text{Output} = \text{Concat}(\text{head}1, \text{head}2, \ldots, \text{head}8) \times W^O + b$

$\scriptsize = \begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1,512} \\ c_{21} & c_{22} & \cdots & c_{2,512} \\ c_{31} & c_{32} & \cdots & c_{3,512} \\ c_{41} & c_{42} & \cdots & c_{4,512} \end{bmatrix}_{4 \times 512} \times \begin{bmatrix} w*{11} & w_{12} & \cdots & w_{1,512} \\ w_{21} & w_{22} & \cdots & w_{2,512} \\ \vdots & \vdots & \ddots & \vdots \\ w_{512,1} & w_{512,2} & \cdots & w_{512,512} \end{bmatrix}_{64 \times 512} + \begin{bmatrix} b*{11} & b_{12} & \cdots & b_{1,512} \\ b_{21} & b_{22} & \cdots & b_{2,512} \\ b_{31} & b_{32} & \cdots & b_{3,512} \\ b_{41} & b_{42} & \cdots & b_{4,512} \end{bmatrix}_{4 \times 512}$

$\scriptsize = \begin{bmatrix} o_{11} & o_{12} & \cdots & o_{1,512} \\ o_{21} & o_{22} & \cdots & o_{2,512} \\ o_{31} & o_{32} & \cdots & o_{3,512} \\ o_{41} & o_{42} & \cdots & o_{4,512} \end{bmatrix}_{4 \times 512}$

이제 Multi-head Attention Layer에서의 모든 과정이 끝났다.

이렇게 Single Head가 아닌 Multi-head Attention을 이용하는 의의는, Attention 과정을 여러 개의 head로 나누어 학습함으로써 각각의 head에서 weight를 다르게 학습하는 것이다. 결과적으로 각 head마다 다른 의미의 attention을 수행할 수 있는 능력을 갖게 된다. (각 head마다 입력 시퀀스 데이터에서의 서로 다른 특징, 패턴에 집중하게 되는 것이다. 예를 들어, 한 head에서는 각 단어의 문법적 관계, 다른 head에서는 의미적 관계에 집중하게 되는 것.)

최종적으로, 이 Layer에서의 Input과 Output의 dimension은 동일하다.

Query, Key, Value matrix 각각의 의미를 간단히 정리해 보자면 이런 식이다. Query → “질문”. 현재 집중하고 있는 항목, 찾고자 하는 정보. (ex. 번역하려는 토큰) Key → 각 토큰(입력)의 identifier. “label”과 같은 느낌의 역할. Query와 곱해 score 계산. Value → 단어가 가지는 실제 information, context. 이 숫자를 Query/Key로 구한 score를 통해 weighted sum을 계산하는 방식으로 결과를 도출하는 방식.

Add(Residual Connection) & Normalization

$\text {Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$

Multihead Attention을 거친 이후, Residual Learning(Add)를 거치고, 그 값에 Normalization을 수행한다.

Residual Learning은 Gradient Vanishing 문제를 완화하고, Attention Layer를 거치기 이전의 정보를 보존하는 역할을 하며, Normalization은 각 층의 출력을 안정화하여 Vanishing Gradient 문제를 완화하고, overfitting을 방지하는 등의 역할을 수행한다.

$\small M = \begin{bmatrix} m_{11} & m_{12} & \cdots & m_{1,512} \\ m_{21} & m_{22} & \cdots & m_{2,512} \\ m_{31} & m_{32} & \cdots & m_{3,512} \\ m_{41} & m_{42} & \cdots & m_{4,512} \end{bmatrix}_{4 \times 512}$

Matrix M을 Multihead Attention Layer의 output이라 하면,

Residual Learning 및 Normalization을 아래와 같이 수행할 수 있다. (그저 Multihead Attention의 Input을 다시 더해주고, row별로 normalization을 수행하는 것뿐이다.)

$\small \text{Normalized Output} = \text{LayerNorm}(\text{Add\_Output}) =$

$\small \text{LayerNorm} \left( \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1,512} \\ x_{21} & x_{22} & \cdots & x_{2,512} \\ x_{31} & x_{32} & \cdots & x_{3,512} \\ x_{41} & x_{42} & \cdots & x_{4,512} \end{bmatrix} + \begin{bmatrix} m_{11} & m_{12} & \cdots & m_{1,512} \\ m_{21} & m_{22} & \cdots & m_{2,512} \\ m_{31} & m_{32} & \cdots & m_{3,512} \\ m_{41} & m_{42} & \cdots & m_{4,512} \end{bmatrix} \right)_{4 \times 512}$

FeedForward

$\text{FFN}(x) = \text{max}(0,xW_1+b_1)W_2+b_2$

각 Encoder와 Decoder Layer 전체에 비선형성(non-linear)을 추가해준다.

식을 보면 알 수 있듯, Fully Connected Layer - ReLU - Fully Connected Layer를 순서대로 거치며,

첫 번째 Fully Connected Layer에서는 $d_{model}=512$ 에서 $d_{ff}=2048$ 차원으로 mapping한다.

이후 ReLU를 거치고, 두 번째 Fully Connected Layer에서 다시 $d_{model}=512$ 차원으로 축소시켜, input과 output의 dimension이 동일하다.

(위의 예시를 그대로 인용하면, 4x512 → 4x2048 → 4x512차원이 되는 것이다. 간단한 Dense, ReLU Layer를 거치는 과정이므로 수식은 생략한다.)

Encoder, Decoder

그림에서 표현된 것처럼, Encoder와 Decoder 전체 Layer를 한 번만 사용하는 것이 아니라, 각각 같은 Layer를 여러 번 거치게 한다.

논문에서는 Encoder, Decoder 모두 $N=6$ 으로, 각각의 전체적인 Encoder, Decoder Layer를 6번씩 반복한다.

Decoder의 Masked Multi-head Attention에 대해서는 위에서 설명하였고, Decoder 측의 두 번째 Multi-head Attention인 Cross Attention에 대해서 짧게 짚고 넘어가자.

Cross Attention에서는 Query는 Decoder에서 그대로 타고 넘어와, 당시 Layer에서의 Self Masked Multi-head Attention을 수행한 이후의 값을 Query에 사용하게 되며, Key, Value는 N층으로 쌓인 Encoder의 최종 output을 가져와 사용한다.