# 머신 러닝 교과서 3판

# 16장 - 순환 신경망으로 순차 데이터 모델링 (2/2)

**아래 링크를 통해 이 노트북을 주피터 노트북 뷰어(nbviewer.jupyter.org)로 보거나 구글 코랩(colab.research.google.com)에서 실행할 수 있습니다.**

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://nbviewer.org/github/rickiepark/python-machine-learning-book-3rd-edition/blob/master/ch16/ch16_part2.ipynb"><img src="https://jupyter.org/assets/share.png" width="60" />주피터 노트북 뷰어로 보기</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/rickiepark/python-machine-learning-book-3rd-edition/blob/master/ch16/ch16_part2.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />구글 코랩(Colab)에서 실행하기</a>
  </td>
</table>

### 목차

- 텐서플로로 시퀀스 모델링을 위한 RNN 구현하기
    - 두 번째 프로젝트: 텐서플로로 글자 단위 언어 모델 구현
        - 데이터셋 전처리
        - 문자 수준의 RNN 모델 만들기
        - 평가 단계 - 새로운 텍스트 생성
- 트랜스포머 모델을 사용한 언어 이해
    - 셀프 어텐션 메카니즘 이해하기
        - 셀프 어텐션 기본 구조
        - 쿼리, 키, 값 가중치를 가진 셀프 어텐션 메카니즘
    - 멀티-헤드 어텐션과 트랜스포머 블록
- 요약

In [1]:
from IPython.display import Image

## 두 번째 프로젝트: 텐서플로로 글자 단위 언어 모델 구현

In [2]:
Image(url='https://git.io/JLdVE', width=700)

### 데이터셋 전처리

In [3]:
# 코랩에서 실행할 경우 다음 코드를 실행해 주세요.
!wget https://raw.githubusercontent.com/rickiepark/python-machine-learning-book-3rd-edition/master/ch16/1268-0.txt

--2023-11-10 06:06:00--  https://raw.githubusercontent.com/rickiepark/python-machine-learning-book-3rd-edition/master/ch16/1268-0.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1171600 (1.1M) [text/plain]
Saving to: ‘1268-0.txt’


2023-11-10 06:06:00 (18.0 MB/s) - ‘1268-0.txt’ saved [1171600/1171600]



In [4]:
import numpy as np


## 텍스트 읽고 전처리하기
with open('1268-0.txt', 'r', encoding='UTF8') as fp:
    text=fp.read()

start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
print(start_indx, end_indx)

text = text[start_indx:end_indx]
char_set = set(text)
print('전체 길이:', len(text))
print('고유한 문자:', len(char_set))

567 1112917
전체 길이: 1112350
고유한 문자: 80


In [5]:
Image(url='https://git.io/JLdVz', width=700)

In [6]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32)

print('인코딩된 텍스트 크기: ', text_encoded.shape)

print(text[:15], '     == 인코딩 ==> ', text_encoded[:15])
print(text_encoded[15:21], ' == 디코딩 ==> ', ''.join(char_array[text_encoded[15:21]]))

인코딩된 텍스트 크기:  (1112350,)
THE MYSTERIOUS       == 인코딩 ==>  [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28]  == 디코딩 ==>  ISLAND


In [7]:
Image(url='https://git.io/JLdVV', width=700)

In [8]:
import tensorflow as tf


ds_text_encoded = tf.data.Dataset.from_tensor_slices(text_encoded)

for ex in ds_text_encoded.take(5):
    print('{} -> {}'.format(ex.numpy(), char_array[ex.numpy()]))

44 -> T
32 -> H
29 -> E
1 ->  
37 -> M


In [9]:
seq_length = 40
chunk_size = seq_length + 1

ds_chunks = ds_text_encoded.batch(chunk_size, drop_remainder=True)

## inspection:
for seq in ds_chunks.take(1):
    input_seq = seq[:seq_length].numpy()
    target = seq[seq_length].numpy()
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])),
          ' -> ', repr(''.join(char_array[target])))

[44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6
  6  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51]  ->  74
'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'  ->  'y'


In [10]:
Image(url='https://git.io/JLdVr', width=700)

In [11]:
## x & y를 나누기 위한 함수를 정의합니다
def split_input_target(chunk):
    input_seq = chunk[:-1]
    target_seq = chunk[1:]
    return input_seq, target_seq

ds_sequences = ds_chunks.map(split_input_target)

## 확인:
for example in ds_sequences.take(2):
    print('입력 (x):', repr(''.join(char_array[example[0].numpy()])))
    print('타깃 (y):', repr(''.join(char_array[example[1].numpy()])))
    print()

입력 (x): 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
타깃 (y): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

입력 (x): ' Anthony Matonak, and Trevor Carlson\n\n\n\n'
타깃 (y): 'Anthony Matonak, and Trevor Carlson\n\n\n\n\n'



In [12]:
# 배치 크기
BATCH_SIZE = 64
BUFFER_SIZE = 10000

tf.random.set_seed(1)
ds = ds_sequences.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)# drop_remainder=True)

ds

<_BatchDataset element_spec=(TensorSpec(shape=(None, 40), dtype=tf.int32, name=None), TensorSpec(shape=(None, 40), dtype=tf.int32, name=None))>

### 문자 수준의 RNN 모델 만들기

In [13]:
def build_model(vocab_size, embedding_dim, rnn_units):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim),
        tf.keras.layers.LSTM(
            rnn_units, return_sequences=True),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model


charset_size = len(char_array)
embedding_dim = 256
rnn_units = 512

tf.random.set_seed(1)

model = build_model(
    vocab_size = charset_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 256)         20480     
                                                                 
 lstm (LSTM)                 (None, None, 512)         1574912   
                                                                 
 dense (Dense)               (None, None, 80)          41040     
                                                                 
Total params: 1636432 (6.24 MB)
Trainable params: 1636432 (6.24 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [14]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True
    ))

model.fit(ds, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7badc8bcbb50>

### 평가 단계 - 새로운 텍스트 생성

In [15]:
tf.random.set_seed(1)

logits = [[1.0, 1.0, 1.0]]
print('확률:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())

확률: [0.33333334 0.33333334 0.33333334]
array([[1, 2, 0, 1, 0, 1, 1, 2, 1, 1]])


In [16]:
tf.random.set_seed(1)

logits = [[1.0, 1.0, 3.0]]
print('확률:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())

확률: [0.10650698 0.10650698 0.78698605]
array([[2, 2, 0, 2, 2, 2, 2, 2, 1, 2]])


In [17]:
def sample(model, starting_str,
           len_generated_text=500,
           max_input_length=40,
           scale_factor=1.0):
    encoded_input = [char2int[s] for s in starting_str]
    encoded_input = tf.reshape(encoded_input, (1, -1))

    generated_str = starting_str

    model.reset_states()
    for i in range(len_generated_text):
        logits = model(encoded_input)
        logits = tf.squeeze(logits, 0)

        scaled_logits = logits * scale_factor
        new_char_indx = tf.random.categorical(
            scaled_logits, num_samples=1)

        new_char_indx = tf.squeeze(new_char_indx)[-1].numpy()

        generated_str += str(char_array[new_char_indx])

        new_char_indx = tf.expand_dims([new_char_indx], 0)
        encoded_input = tf.concat(
            [encoded_input, new_char_indx],
            axis=1)
        encoded_input = encoded_input[:, -max_input_length:]

    return generated_str

tf.random.set_seed(1)
print(sample(model, starting_str='The island'))

The island had explore the island,
and fleamed necessary.

But Herbert, the reporter thought, there baromed oceans of those water from a long point.
The wind heart that Herbert dresing
the night, or comb,” answered the engineer
approved the sources destroyed, replesieg, was abandoned, which
the convicts were able to die read told them!”

“Oh! follow?”
said Pencroft, “but Neb, I
were!” cried Herbert. “We have only a fireplace,” observed Cyrus Harding.

The cord seek the projects of times, and on the report


* **예측 가능성 대 무작위성**

In [18]:
logits = np.array([[1.0, 1.0, 3.0]])

print('스케일 조정 전의 확률: ', tf.math.softmax(logits).numpy()[0])

print('0.5배 조정 후 확률:  ', tf.math.softmax(0.5*logits).numpy()[0])

print('0.1배 조정 후 확률:  ', tf.math.softmax(0.1*logits).numpy()[0])

스케일 조정 전의 확률:  [0.10650698 0.10650698 0.78698604]
0.5배 조정 후 확률:   [0.21194156 0.21194156 0.57611688]
0.1배 조정 후 확률:   [0.31042377 0.31042377 0.37915245]


In [19]:
tf.random.set_seed(1)
print(sample(model, starting_str='The island',
             scale_factor=2.0))

The island had been placed in the water. His continued to five o’clock, the reporter thought that the convicts had passed the body of the water which were the passage of which the sulphate of an incident, the colonists threw the horizon, and only fell back to the exploration of the colonists. The wind from the colonists had not been torn by his former seep to the south
with the exploration of the day before. The latter protected the palisade; but the colonists were the projects of the lake, some of the co


In [20]:
tf.random.set_seed(1)
print(sample(model, starting_str='The island',
             scale_factor=0.5))

The island had egelyed holligge! By the Hap would passed, formedly never, I
rejoyp,--“To “you wild be go five enes or enohmovery pohitat;
unforwind hiry wish heat, what symuring rasila
wimmed hearor, projeeved
up,
listengr.
He

When you,” proocle, five fires, rocies ngltog.
However’s,! Then,
force izedicarents? Theyt? look in
LlEchummons
be no ofeer?”
said Pencroft.

PAcam Daybreak, rib pushem.
Sun. He hall
as oiley oursquice, obstrug,-- effer fros, ungen warse;-, obtaw
by the rooms, always somber
to puri


# 트랜스포머 모델을 사용한 언어 이해

## 셀프 어텐션 메카니즘 이해하기

### 셀프 어텐션 기본 구조

In [21]:
Image(url='https://git.io/JLdVo', width=700)

### 쿼리, 키, 값 가중치를 가진 셀프 어텐션 메카니즘

## 멀티-헤드 어텐션과 트랜스포머 블록

In [22]:
Image(url='https://git.io/JLdV6', width=700)