1 Subword Units

Neural network models use a fixed size vocabulary to handle natrual language. When the potentail vocabulary space is huge, especially for a neural machine translation (NMT) task, there will be too many unknown words to a model.

To deal with such challenge, Sennrich, Haddow, and Birch (2015) propose the idea to break up rare words into subword units for neural network modeling. The first segmentation approach is inspired by the byte pair encoding compression algorithm, or BPE in short.

1.1 Byte Pair Encoding

BPE is originally a data compression algorithm that iteratively replaces the most frequent pair of bytes in a sequence with single unused byte. By maintaining a mapping table of the new byte and the replaced old bytes, we can recover the original message from a compressed representation by reversing the encoding process.

The same idea can be applied at charater-level instead of byte-level on a given corpus, effectively constitute a subword segmentation algorithm.

The procedure to apply BPE to obtain subwords can be summarized as the followings:

  1. Extract word-level vocabulary as the initial vocabulary
  2. Represent the initial vocabulary at character-level for each word, served as the working vocabulary
  3. [Pairing] For each word in the working vocabulary, do character-level BPE and out of all words extract the most frequent pair concatenated as a new subword added into the subword vocabulary
  4. [Merging] Replace the most frequent pair with its single concatenated version in the working vocabulary
  5. Repeat step 3-4 for a given number of times, depending on a desirable subword vocabulary size
  6. The final vocabulary is the initial fullword vocabulary plus the subword vocabulary

Based on the procedure, essentially we do not consider paris across word boundaries to be a potential subword. Number of merge operations is the only hyperparameter for the algorithm.

Since the most frequent subwords will be merged early, common words will remain as one unique symbol in the vocabulary, leaving out rare words splitted into smaller units (subwords).

Here is a toy implementation of BPE applied on a given corpus:

import re
from collections import defaultdict

class BPE:

  def __init__(self, sents, min_cnt=3, verbose=False):
    self.verbose = verbose
    init_vocab = defaultdict(int)
    for sent in sents:
      words = re.split(r"\W+", sent)
      for w in words:
        if w != "":
          init_vocab[w] += 1
    # Create fullword vocabulary.
    self.word_vocab = {k: v for k, v in init_vocab.items() if v >= min_cnt}
    # Insert space between each char in a word for latter ease of merge operation.
    # We directly borrow the idea from https://www.aclweb.org/anthology/P16-1162.
    self.working_vocab = {" ".join(k): v for k, v in self.word_vocab.items()}
    self.subword_vocab = defaultdict(int)
    # Also build a character-level vocabulary as the base subwords.
    self.char_vocab = defaultdict(int)
    for sent in sents:
      for char in list(sent):
        self.char_vocab[char] += 1

  def _find_top_subword(self):
    subword_pairs = defaultdict(int)
    for w, cnt in self.working_vocab.items():
      subw = w.split()
      for i in range(len(subw) - 1):
        # Count bigrams.
        subword_pairs[subw[i], subw[i+1]] += cnt
    top_subw_pair = max(subword_pairs, key=subword_pairs.get)
    top_subw = "".join(top_subw_pair)
    self.subword_vocab[top_subw] = subword_pairs[top_subw_pair]
    if self.verbose:
      print("New subword a