English to Katakana with Sequence to Sequence in PyTorch
This my second article reposted in Bloggie.
In the previous article, I wrote about translating English words into Katakana using Sequence-to-Sequence learning in Tensorflow (Keras). For this article, I describe how to implement the same model in PyTorch.
Note: This example is written in Python 3.7 and PyTorch 1.1
All data and code are available on Github.
Data Preparation
We will use the Japanese-English name pairs dataset similar to the previous article. More details about Japanese Katakana and the dataset can be read there.
data = pd.read_csv('../data/joined_titles.csv', header=None)
data = data.sample(frac=1, random_state=0)
data_english = [s.lower() for s in data[0]] # ['dorogobuzh', 'gail hopkins', ‘novatek’, ...]
data_japanese = [s.lower() for s in data[1]] # ['ドロゴブージ', 'ゲイル・ホプキンス', ‘ノヴァテク’, ...]
We will also need to apply the same data transformation:
- Build encoding dictionary (characters to IDs)
- Encode or Transform the names into the sequences of IDs
- Also append PADDING characters (0’s) at the end to make equal-length sequences
Those are already implemented as build_characters_encoding
and transform
functions in the katakana/encoding.py
INPUT_LENGTH = 20
OUTPUT_LENGTH = 20
# Building Dictionary
english_encoding, english_decoding, english_dict_size = encoding.build_characters_encoding(data_english)
japanese_encoding, japanese_decoding, japanese_dict_size = encoding.build_characters_encoding(data_japanese)
# Data Transformation
encoded_input = encoding.transform(english_encoding, data_english, vector_size=INPUT_LENGTH)
encoded_output = encoding.transform(japanese_encoding, data_japanese, vector_size=OUTPUT_LENGTH)
print(encoded_input[0])
# > [13 24 12 24 7 24 18 35 47 28 0 0 0 0 0 0 0 0 0 0]
print(encoded_output[0])
# > [85 50 17 65 21 58 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Sequence-to-Sequence in PyTorch
Encoder
We implement the encoder as a PyTorch’s Module. The encoder consists of embedding
(Embedding) and lstm
(LSTM). The module embeds the input with embedding, pass the embedded input into the LSTM, then the module’s output is the final time step of LSTM output.
class Encoder(Module):
def __init__(self, input_dict_size=english_dict_size):
super(Encoder, self).__init__()
self.embedding = Embedding(input_dict_size, 64)
self.lstm = LSTM(64, 64, batch_first=True)
def forward(self, encoder_input_sequences):
embedded = self.embedding(encoded_input)
output, _ = self.lstm(embedded)
return output[:, -1]
Note: we set batch_first=True
to make Pytorch's LSTM taking input with dimensions (batch_size, sequnece_size, vector_size) similar to TensorFlow's LSTM.
Decoder
Here, we also implement the decoder as a PyTorch’s Module. The module consists of embedding
, lstm
, and linear
(Linear or Dense). It takes two inputs, decoder_input_sequences
and encoder_output
.
Similar to the encoder, the decoder embeds input the sequence and pass the embeded sequence to LSTM. However, this time, we initialize the LSTM's state with encoder's output. The LSTM's output are then passed into the linear layer to produce the final output.
class Decoder(Module):
def __init__(self, output_dict_size=japanese_dict_size):
super(Decoder, self).__init__()
self.embedding = Embedding(output_dict_size, 64)
self.lstm = LSTM(64, 64, batch_first=True)
self.linear = Linear(64, output_dict_size)
def forward(self, encoder_output, decoder_input_sequences):
encoder_output = encoder_output.unsqueeze(0)
embedded = self.embedding(decoder_input_sequences)
output, _ = self.lstm(embedded, [encoder_output, encoder_output])
output = self.linear(output)
return output
Note: Unlike TensorFlow’s version, we don't apply Softmax activation to the final output to make it easier to apply CrossEntropyLoss (see "Training the model"). Applying the Softmax also won't change the result when we use the decoder to generate the output greedily (see "Testing the model”).
Putting them together
Finally, we combine Encoder and Decoder into the PyTorch’s Module for Sequence-to-Sequence learning.
class Seq2Seq(Module):
def __init__(self):
super(Seq2Seq, self).__init__()
self.encoder = Encoder()
self.decoder = Decoder()
def forward(self, encoder_input_sequences, decoder_input_sequences):
encoder_output = self.encoder(encoder_input_sequences)
decoder_output = self.decoder(encoder_output, decoder_input_sequences)
return decoder_output
We also need to prepare the training data by padding the output with START character to make the decoder input.
# (Expected) Decoder Output
training_decoder_output = encoded_output
# Encoder Input
training_encoder_input = encoded_input
# Decoder Input
training_decoder_input = np.zeros_like(encoded_input)
training_decoder_input[:, 1:] = encoded_training_output[:,:-1]
training_decoder_input[:, 0] = encoding.CHAR_CODE_START
Training the model
PyTorch doesn’t provide out-of-the-box function similar to TensorFlow or Keras's Model.fit()
.
We need to implement the training function that:
- Shuffle the training dataset and split the data into batches
- For each batch:
- Initialize optimizer by calling
optimizer.zero_grad()
to clear all gradient from previous iteration - Run the model to generate output, and calculate
loss
by comparing the expected and generated output using a certain criteria (in this case, CrossEntropyLoss) - Run
loss.backward()
to generate the gradient according to the loss, andoptimizer.step()
to let the optimizer update the model.
- Initialize optimizer by calling
def train_epoch(model, optimizer,
batch_size=64,
criterion=CrossEntropyLoss(),
encoder_input=training_encoder_input,
decoder_input=training_decoder_input,
decoder_output=training_decoder_output):
# re-shuffle the training_data:
permutation = np.random.permutation(encoder_input.shape[0])
encoder_input = encoder_input[permutation]
decoder_input = decoder_input[permutation]
decoder_output = decoder_output[permutation]
epoch_loss = 0
iteration_count = 0
for begin_index in range(0, len(encoder_input), batch_size):
end_index = begin_index + batch_size
iteration_count += 1
# Get training batch
encoder_input_step = torch.tensor(encoder_input[begin_index:end_index])
decoder_input_step = torch.tensor(decoder_input[begin_index:end_index])
decoder_output_step = torch.tensor(decoder_output[begin_index:end_index])
optimizer.zero_grad()
# Run the model and calculate loss
output = model(encoder_input_step, decoder_input_step)
output = output.view(-1, output.shape[-1])
target = decoder_output_step.view(-1)
loss = criterion(output, target)
# Generate gradient and let optimizer update model
loss.backward()
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / iteration_count
I have found that we can get reasonably good model with the default Adam optimizer after 20-30 epochs of training (around an hour on CPUs, or a few minutes on GPUs).
def train(model, optimizer, n_epoch=30):
for i in range(1, n_epoch + 1):
print('Epoch %i' % i)
loss = train_epoch(model, optimizer)
print('> Training Loss', loss)
model = Seq2Seq()
optimizer = Adam(model.parameters())
train(model, optimizer, n_epoch=30)
Testing the model
Applying the trained PyTorch’s Sequence-to-Sequence model to write Katakana is very similar to the TensorFlow’s. More details explanation can be found in the previous article.
Starting with the encoder input and the decoder input with only one START character, we will keep having the model generate the next output character, update the decoder input, and use the updated decoder input.
def generate_output(input_sequence):
decoder_input = np.zeros(shape=(len(input_sequence), OUTPUT_LENGTH), dtype='int')
decoder_input[:,0] = encoding.CHAR_CODE_START
encoder_input = torch.tensor(input_sequence)
decoder_input = torch.tensor(decoder_input)
for i in range(1, OUTPUT_LENGTH):
model.cpu()
output = model(encoder_input, decoder_input)
output = output.argmax(dim=2)
decoder_input[:,i] = output[:,i-1]
return decoder_input[:,1:].detach().numpy()
def to_katakana(text):
input_sequence = encoding.transform(english_encoding, [text.lower()], 20)
output_sequence = generate_output(input_sequence)
return encoding.decode(japanese_decoding, output_sequence[0])
After testing the model with some English words, I was able to reproduce the results the similar to the TensorFlow version.
- James : ジェームズ
- John : ジョン
- Robert : ロベルト
- Computer : コンプター (correctly, コンピューター)
- Taxi : タクシ (correctly, タクシー)
Wanasit T
Clap to support the author, help others find it, and make your opinion count.