0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

simple BPE

Last updated at Posted at 2021-07-03
import re, collections

def get_stats(vocab):
  pairs = collections.defaultdict(int) 
  for word, freq in vocab.items():
    symbols = word.split()
    for i in range(len(symbols)-1):
      pairs[symbols[i],symbols[i+1]] += freq 
  
  return pairs
  
def merge_vocab(pair, v_in):
  v_out = {}
  bigram = re.escape(' '.join(pair))
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') 
  for word in v_in:
    w_out = p.sub(''.join(pair), word)
    v_out[w_out] = v_in[word] 
  return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6,'w i d e s t </w>':3}
num_merges = 10
for i in range(num_merges):
  print(i)
  pairs = get_stats(vocab)
  best = max(pairs, key=pairs.get) 
  vocab = merge_vocab(best, vocab) 
  print(best)

print(vocab)

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?