English to Katakana with Sequence to Sequence in PyTorch

Wanasit T avatar
Wanasit T

This my second article reposted in Bloggie.

In the previous article, I wrote about translating English words into Katakana using Sequence-to-Sequence learning in Tensorflow (Keras). For this article, I describe how to implement the same model in PyTorch.

Note: This example is written in Python 3.7 and PyTorch 1.1

All data and code are available on Github.

Data Preparation

We will use the Japanese-English name pairs dataset similar to the previous article. More details about Japanese Katakana and the dataset can be read there.

data = pd.read_csv('../data/joined_titles.csv', header=None)
data = data.sample(frac=1, random_state=0)

data_english = [s.lower() for s in data[0]] # ['dorogobuzh', 'gail hopkins', ‘novatek’, ...]
data_japanese = [s.lower() for s in data[1]] # ['ドロゴブージ', 'ゲイル・ホプキンス', ‘ノヴァテク’, ...]

We will also need to apply the same data transformation:

  • Build encoding dictionary (characters to IDs)
  • Encode or Transform the names into the sequences of IDs
  • Also append PADDING characters (0’s) at the end to make equal-length sequences

Those are already implemented as build_characters_encoding and transform functions in the katakana/encoding.py


# Building Dictionary
english_encoding, english_decoding, english_dict_size = encoding.build_characters_encoding(data_english)
japanese_encoding, japanese_decoding, japanese_dict_size = encoding.build_characters_encoding(data_japanese)

# Data Transformation
encoded_input = encoding.transform(english_encoding, data_english, vector_size=INPUT_LENGTH)
encoded_output = encoding.transform(japanese_encoding, data_japanese, vector_size=OUTPUT_LENGTH)

# > [13 24 12 24  7 24 18 35 47 28  0  0  0  0  0  0  0  0  0  0]

# > [85 50 17 65 21 58  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

Sequence-to-Sequence in PyTorch


We implement the encoder as a PyTorch’s Module. The encoder consists of embedding (Embedding) and lstm (LSTM). The module embeds the input with embedding, pass the embedded input into the LSTM, then the module’s output is the final time step of LSTM output.

class Encoder(Module):
    def __init__(self, input_dict_size=english_dict_size):
        super(Encoder, self).__init__()
        self.embedding = Embedding(input_dict_size, 64)
        self.lstm = LSTM(64, 64, batch_first=True)

    def forward(self, encoder_input_sequences):
        embedded = self.embedding(encoded_input)
        output, _ = self.lstm(embedded)
        return output[:, -1]

Note: we set batch_first=True to make Pytorch's LSTM taking input with dimensions (batch_size, sequnece_size, vector_size) similar to TensorFlow's LSTM.


Here, we also implement the decoder as a PyTorch’s Module. The module consists of embedding, lstm, and linear (Linear or Dense). It takes two inputs, decoder_input_sequences and encoder_output.

Similar to the encoder, the decoder embeds input the sequence and pass the embeded sequence to LSTM. However, this time, we initialize the LSTM's state with encoder's output. The LSTM's output are then passed into the linear layer to produce the final output.

class Decoder(Module):
    def __init__(self, output_dict_size=japanese_dict_size):
        super(Decoder, self).__init__()
        self.embedding = Embedding(output_dict_size, 64)
        self.lstm = LSTM(64, 64, batch_first=True)
        self.linear = Linear(64, output_dict_size)

    def forward(self, encoder_output, decoder_input_sequences):
        encoder_output = encoder_output.unsqueeze(0)
        embedded = self.embedding(decoder_input_sequences)
        output, _ = self.lstm(embedded, [encoder_output, encoder_output])
        output = self.linear(output)
        return output

Note: Unlike TensorFlow’s version, we don't apply Softmax activation to the final output to make it easier to apply CrossEntropyLoss (see "Training the model"). Applying the Softmax also won't change the result when we use the decoder to generate the output greedily (see "Testing the model”).

Putting them together

Finally, we combine Encoder and Decoder into the PyTorch’s Module for Sequence-to-Sequence learning.

class Seq2Seq(Module):
    def __init__(self):
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
    def forward(self, encoder_input_sequences, decoder_input_sequences):
        encoder_output = self.encoder(encoder_input_sequences)
        decoder_output = self.decoder(encoder_output, decoder_input_sequences)
        return decoder_output

We also need to prepare the training data by padding the output with START character to make the decoder input.

# (Expected) Decoder Output
training_decoder_output = encoded_output

# Encoder Input
training_encoder_input = encoded_input

# Decoder Input 
training_decoder_input = np.zeros_like(encoded_input)
training_decoder_input[:, 1:] = encoded_training_output[:,:-1]
training_decoder_input[:, 0] = encoding.CHAR_CODE_START

Training the model

PyTorch doesn’t provide out-of-the-box function similar to TensorFlow or Keras's Model.fit().

We need to implement the training function that:

  • Shuffle the training dataset and split the data into batches
  • For each batch:
    • Initialize optimizer by calling optimizer.zero_grad() to clear all gradient from previous iteration
    • Run the model to generate output, and calculate loss by comparing the expected and generated output using a certain criteria (in this case, CrossEntropyLoss)
    • Run loss.backward() to generate the gradient according to the loss, and optimizer.step() to let the optimizer update the model.
def train_epoch(model, optimizer, 
    # re-shuffle the training_data:
    permutation = np.random.permutation(encoder_input.shape[0])
    encoder_input = encoder_input[permutation]
    decoder_input = decoder_input[permutation]
    decoder_output = decoder_output[permutation]

    epoch_loss = 0
    iteration_count = 0
    for begin_index in range(0, len(encoder_input), batch_size):
        end_index = begin_index + batch_size
        iteration_count += 1
        # Get training batch    
        encoder_input_step = torch.tensor(encoder_input[begin_index:end_index])
        decoder_input_step = torch.tensor(decoder_input[begin_index:end_index])
        decoder_output_step = torch.tensor(decoder_output[begin_index:end_index])
        # Run the model and calculate loss
        output = model(encoder_input_step, decoder_input_step)
        output = output.view(-1, output.shape[-1])
        target = decoder_output_step.view(-1)
        loss = criterion(output, target)
        # Generate gradient and let optimizer update model
        epoch_loss += loss.item()
    return epoch_loss / iteration_count

I have found that we can get reasonably good model with the default Adam optimizer after 20-30 epochs of training (around an hour on CPUs, or a few minutes on GPUs).

def train(model, optimizer, n_epoch=30):
    for i in range(1, n_epoch + 1):
        print('Epoch %i' % i)
        loss = train_epoch(model, optimizer)
        print('> Training Loss', loss)
model = Seq2Seq()
optimizer = Adam(model.parameters())
train(model, optimizer, n_epoch=30)

Testing the model

Applying the trained PyTorch’s Sequence-to-Sequence model to write Katakana is very similar to the TensorFlow’s. More details explanation can be found in the previous article.

Starting with the encoder input and the decoder input with only one START character, we will keep having the model generate the next output character, update the decoder input, and use the updated decoder input.

def generate_output(input_sequence):
    decoder_input = np.zeros(shape=(len(input_sequence), OUTPUT_LENGTH), dtype='int')
    decoder_input[:,0] = encoding.CHAR_CODE_START
    encoder_input = torch.tensor(input_sequence)
    decoder_input = torch.tensor(decoder_input)
    for i in range(1, OUTPUT_LENGTH):
        output = model(encoder_input, decoder_input)
        output = output.argmax(dim=2)
        decoder_input[:,i] = output[:,i-1]
    return decoder_input[:,1:].detach().numpy()

def to_katakana(text):
    input_sequence = encoding.transform(english_encoding, [text.lower()], 20)
    output_sequence = generate_output(input_sequence)
    return encoding.decode(japanese_decoding, output_sequence[0])

After testing the model with some English words, I was able to reproduce the results the similar to the TensorFlow version.

  • James : ジェームズ
  • John : ジョン
  • Robert : ロベルト
  • Computer : コンプター (correctly, コンピューター)
  • Taxi : タクシ (correctly, タクシー)