Qiita Engineer Festa20242024年7月17日まで開催中！

Pythonで単一換字式暗号(アルファベット、英語)の解読をしてみた。

Last updated at 2024-06-30Posted at 2024-06-30

はじめに

この暗号、解読できますか？

自然言語処理を学んでいて思い出したのがシャーロックホームズの踊る人形であった。
ホームズは優れた状況判断能力を活かし、さらに頻度分析を用いて暗号を解読していた。
私にもPythonを用いて暗号解読できそうな気がしたのでこの記事を作成した。
単純な換字式暗号（アルファベット、英語）の解読を試みる。

（この暗号は下記サイトを利用して作った。そもそも踊る人形にでてくる暗号表記のため解読表が存在するのだが、今回は解読表がないものとして利用する。）

解読できる暗号の想定
・単純な換字式暗号文であること。（図形にはアルファベットが割り振られている）
・文章は簡単な英語であるという想定。
・単語間にちゃんとスペースがある。
・特殊な記号、特殊な固有名詞は使われていないものとする。

手順１：単語一覧の前処理

一致させる単語の一覧を準備する。
今回は単純に一通りの単語が網羅されている単語一覧が欲しかったため、使い勝手の良さそうなこちらからデータセットを拝借した。
https://github.com/kujirahand/EJDict

UTF-8の形式を選び、前処理を行っていく。

with open('ejdict-hand-utf8.txt', 'r', encoding='utf-8') as word_file:
    word_list = word_file.read()
    print(word_list)

結果

このようなデータセットが入っているため加工していく。

import pandas as pd
#扱いやすいようにデータフレーム化する。
df_word_list = pd.read_csv('ejdict-hand-utf8.txt', encoding='utf-8', delimiter='\t')

#今回日本語訳はいらないため削除する。
df_word_list = df_word_list[['ENGLISH']]

#データ型を文字列型に統一する。
df_word_list = df_word_list.astype(str)

#検索に影響しないよう、すべて大文字に変換しておく。
df_word_list.loc[:, 'ENGLISH'] = df_word_list['ENGLISH'].apply(lambda x: x.upper())

#単語ごとに※ユニット番号をつける。
def num_per_word(word):
    char_to_number = {}
    current_number = 1
    result = []
    for char in word:
        if char not in char_to_number:
            char_to_number[char] = current_number
            current_number += 1
        result.append(char_to_number[char])
    return result

_num_per_word = df_word_list['ENGLISH'].apply(num_per_word)
df_word_list['ENCODE'] = _num_per_word

df_word_list

結果

※ユニット番号
以下の条件で振り分けた番号のこと。
1.始めて出てきたアルファベットに1から順に数字を振っていく。
2.同じアルファベットが出てきたら同じ番号を振る。

手順２：文章全体の準備

手順１とはまた別に単語毎に番号を振り分ける。（目視作業）

1.始めて出てきた図形に1から順に数字を振っていく。
2.同じ図形が出てきたら同じ番号を振る。

word_1 = [1]
word_2 = [2, 3]
word_3 = [1, 4, 5, 6, 7, 6, 8, 5, 6, 9]
word_4 = [1, 4]
word_5 = [9, 2, 5, 2]
word_6 = [8, 10, 1, 6, 4, 10, 6]

words = [word_1, word_2, word_3, word_4, word_5, word_6]
sentence = '  '.join(' '.join(map(str, sublist)) for sublist in words)
sentence

結果

'1  2 3  1 4 5 6 7 6 8 5 6 9  1 4  9 2 5 2  8 10 1 6 4 10 6'

3.スペースごとに番号をリストに入れ各単語を想定した変数を作る。
4.最終的に完成させたい文章の箱を作る

手順３：単語毎の準備

手順２とは別に、出現単語毎にユニット番号を振る。（目視作業）

unit_1 = [1]
unit_2 = [1, 2]
unit_3 = [1, 2, 3, 4, 5, 4, 6, 3, 4, 7]
unit_4 = [1, 2]
unit_5 = [1, 2, 3, 2]
unit_6 = [1, 2, 3, 4, 5, 2, 4]

単語一覧と出現単語のユニットが一致するものを抽出する関数を用意する。

def narrow_down (unit_num):
    matched_unit = df_word_list[df_word_list['ENCODE'].apply(lambda x: x == unit_num)]
    return matched_unit

narrow_down(unit_num)

このような結果が得られる

unit毎の解釈

def print_candidate(num, unit_num):
    result = narrow_down(unit_num).shape[0]
    return print(f'word_{num}の候補数は、 {result} 個')

word_nums = [1, 2, 3, 4, 5, 6]
units = [unit_1, unit_2, unit_3, unit_4, unit_5, unit_6]
for i, unit_i in zip(word_nums, units):
    print_candidate(i, unit_i)

結果

word_1の候補数は、 31 個
word_2の候補数は、 307 個
word_3の候補数は、 3 個
word_4の候補数は、 307 個
word_5の候補数は、 176 個
word_6の候補数は、 11 個

候補数を降順にすると、
word_3 < word_6 < word_1 ……となる。
候補数の少ない順にwordに当てはめた数字をさらに絞り込んでいけそうだ。

手順４：暗号解読

候補数の少ないword_3を確認してみる。

print(word_3)
narrow_down(unit_3)['ENGLISH']

結果

[1, 4, 5, 6, 7, 6, 8, 5, 6, 9]
1651     ANTIBIOTIC
21111    INTERESTED
28096    OPTIMISTIC
Name: ENGLISH, dtype: object

この結果をリストに入れておく。

candidate_word_3_1 = list(narrow_down(unit_3).loc[1651, 'ENGLISH'])
candidate_word_3_2 = list(narrow_down(unit_3).loc[21111, 'ENGLISH'])
candidate_word_3_3 = list(narrow_down(unit_3).loc[28096, 'ENGLISH'])

print(candidate_word_3_1)
print(candidate_word_3_2)
print(candidate_word_3_3)

結果

['A', 'N', 'T', 'I', 'B', 'I', 'O', 'T', 'I', 'C']
['I', 'N', 'T', 'E', 'R', 'E', 'S', 'T', 'E', 'D']
['O', 'P', 'T', 'I', 'M', 'I', 'S', 'T', 'I', 'C']

[1, 4, 5, 6, 7, 6, 8, 5, 6, 9]
の単語が
候補１：['A', 'N', 'T', 'I', 'B', 'I', 'O', 'T', 'I', 'C']
候補２：['I', 'N', 'T', 'E', 'R', 'E', 'S', 'T', 'E', 'D']
候補３：['O', 'P', 'T', 'I', 'M', 'I', 'S', 'T', 'I', 'C']
という３つの候補に絞れた。

この候補３つをしらみつぶしていく。

まずは候補１のANTIBIOTICから。

候補１　ANTIBIOTIC

仮の辞書_alpha_wordに分かったアルファベットを入れていく。

dict_candidate_word_3_1 = dict(zip(word_3, candidate_word_3_1))
print(dict_candidate_word_3_1)

結果

{1: 'A', 4: 'N', 5: 'T', 6: 'I', 7: 'B', 8: 'O', 9: 'C'}

_alpha_dict = {1: '未1', 2: '未2', 3: '未3', 4: '未4', 5: '未5', 6: '未6', 7: '未7', 8: '未8', 9: '未9', 10: '未10'}
updates = dict_candidate_word_3

def update_alpha_dict(alpha_dict, updates):
    for key, value in updates.items():
        if key in alpha_dict:
            alpha_dict[key] = value
    return alpha_dict

_alpha_dict = update_alpha_dict(_alpha_dict, updates)
print(_alpha_dict)

結果

{1: 'A', 2: '未2', 3: '未3', 4: 'N', 5: 'T', 6: 'I', 7: 'B', 8: 'O', 9: 'C', 10: '未10'}

次に候補数が２番目に少ないword_6を確認していく。

print(word_6)
print(narrow_down(unit_6)['ENGLISH'])

結果

[8, 10, 1, 6, 4, 10, 6]
2268     ASPERSE
8926     COPYBOY
11925    DOGTROT
15926    FOXTROT
20970    INSTANT
21175    INTERNE
21820    JOGTROT
22854    LEARNER
24828    MEASLES
35363    SCIENCE
39637    SURTOUT
Name: ENGLISH, dtype: object

候補は１６個。
だが、_alpha_dictに入れた数字とアルファベットの対応から絞り込めそうだ。

replaced_words = [_alpha_dict[num] if num in _alpha_dict else num for num in word_6]
print(replaced_words)

['O', '未10', 'A', 'I', 'N', '未10', 'I']

絞り込まれたword_6から正規表現を用いてマッチングする。

import re

pattern = r'^O.*A.*I.*N.*I$' 
matched_rows = narrow_down(unit_6)['ENGLISH'][narrow_down(unit_6)['ENGLISH'].str.match(pattern, flags=re.IGNORECASE)]
print(matched_rows)

Series([], Name: ENGLISH, dtype: object)

候補がないようだ。
固有名詞である可能性もあるが、今回一旦候補１は除外して考えてみる。

候補２　INTERESTED

暗号解読の最初に戻って候補２のINTERESTEDで再考していく。
以下は同様の手順となるため省略する。

print(candidate_word_3_2)

dict_candidate_word_3_2 = dict(zip(word_3, candidate_word_3_2))
print(dict_candidate_word_3_2)

updates = dict_candidate_word_3_2

_alpha_dict = update_alpha_dict(_alpha_dict, updates)
print(_alpha_dict)

replaced_words = [_alpha_dict[num] if num in _alpha_dict else num for num in word_6]
print(replaced_words)

結果

['I', 'N', 'T', 'E', 'R', 'E', 'S', 'T', 'E', 'D']
{1: 'I', 4: 'N', 5: 'T', 6: 'E', 7: 'R', 8: 'S', 9: 'D'}
{1: 'I', 2: '未2', 3: '未3', 4: 'N', 5: 'T', 6: 'E', 7: 'R', 8: 'S', 9: 'D', 10: '未10'}
['S', '未10', 'I', 'E', 'N', '未10', 'E']

pattern = r'^S.*I.*E.*N.*E$' 
matched_rows = narrow_down(unit_6)[narrow_down(unit_6)['ENGLISH'].str.match(pattern, flags=re.IGNORECASE)]
print(matched_rows)

結果

35363  SCIENCE  [1, 2, 3, 4, 5, 2, 4]

今度は先ほどと違い、候補が１つ上がった。
INTERESTED SCIENCE
文章として違和感がないように思うため、このまま進めてみよう。

def replace_numbers(s, mapping):
    words = s.split()
    result = []
    for word in words:
        if word.isdigit() and int(word) in mapping:
            result.append(mapping[int(word)])
        else:
            result.append(word)
    return ' '.join(result)

output_sentence = replace_numbers(sentence, _alpha_dict)
print(output_sentence)

結果

I 未2 未3 I N T E R E S T E D I N D 未2 T 未2 S C I E N C E

pattern = r'D.T'
matched_rows = narrow_down(unit_5)[narrow_down(unit_5)['ENGLISH'].str.match(pattern, flags=re.IGNORECASE)]
print(matched_rows)

結果

10198    DATA  [1, 2, 3, 2]

candidate_word_5 = list(narrow_down(unit_5).loc[10198, 'ENGLISH'])
print(candidate_word_5)

dict_candidate_word_5 = dict(zip(word_5, candidate_word_5))
print(dict_candidate_word_5)

updates = dict_candidate_word_5

_alpha_dict = update_alpha_dict(_alpha_dict, updates)
print(_alpha_dict)

結果

['D', 'A', 'T', 'A']
{9: 'D', 2: 'A', 5: 'T'}
{1: 'I', 2: 'A', 3: '未3', 4: 'N', 5: 'T', 6: 'E', 7: 'R', 8: 'S', 9: 'D', 10: 'C'}

output_sentence = replace_numbers(sentence, _alpha_dict)
print(output_sentence)

結果

I A 未3 I N T E R E S T E D I N D A T A S C I E N C E

最後の１単語となった。
ここまで来たら予測はつくが、一応BERTを使ってみよう。

pip install transformers

from transformers import pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
text = "I [MASK] INTERESTED IN DATA SCIENCE"
results = fill_mask(text)

results_list =[]
for result in results:
    results_list.append(result['token_str'].upper())

results_list

結果

['WAS', 'AM', 'BECAME', 'GOT', 'GET']

matched_word = matched_rows[matched_rows['ENGLISH'].isin(results_list)]
matched_word

結果

1212	AM	[1, 2]
1213	AM	[1, 2]

matched_word = list(narrow_down(unit_2).loc[1212, 'ENGLISH'])
print(matched_word)

dict_mached_word = dict(zip(word_2, matched_word))
print(dict_mached_word)

updates = dict_mached_word

_alpha_dict = update_alpha_dict(_alpha_dict, updates)
print(_alpha_dict)

alpha_dict = _alpha_dict

結果

['A', 'M']
{2: 'A', 3: 'M'}
{1: 'I', 2: 'A', 3: 'M', 4: 'N', 5: 'T', 6: 'E', 7: 'R', 8: 'S', 9: 'D', 10: 'C'}

未知のアルファベットがすべて埋まった。
それでは、暗号文を解読してみよう。

def replace_numbers(s, mapping):
    def replacer(match):
        num = int(match.group())
        return mapping.get(num, match.group())
    
    pattern = re.compile(r'\b\d+\b')
    return pattern.sub(replacer, s)

output_sentence = replace_numbers(sentence, alpha_dict)
print(output_sentence)

結果

I  A M  I N T E R E S T E D  I N  D A T A  S C I E N C E

上手くいったようだ。

課題とまとめ

今回行った暗号はこの方法で解けたが、課題が多く残った。

固有名詞への対応
踊る人形ではエルシーやエルリッジなどの固有名詞が登場する。
ホームズはその暗号が書かれた状況を予測して解読をしていた。
今回の手順で行った解読方法を使うならば、土地や名前などの固有名詞を集めておくとなお解読の幅が広がるだろう。

単語間のアルファベットの重複が少ない場合

こちらは第三者に作ってもらった暗号である。
今回の手順でやってみると、

sentence_1 = "1 2 3 4  3 4  5  6 7 8"

word_1 = [1, 2, 3, 4]
word_2 = [3, 4]
word_3 = [5]
word_4 = [6, 7, 8]

print(sentence)

def print_candidate(num, unit_num):
    result = narrow_down(unit_num).shape[0]
    return print(f'word_{num}の候補数は、 {result} 個')

word_nums = [1, 2, 3, 4]
units = [unit_1, unit_2, unit_3, unit_4]
for i, unit_i in zip(word_nums, units):
    print_candidate(i, unit_i)

結果

word_1の候補数は、 2083 個
word_2の候補数は、 307 個
word_3の候補数は、 31 個
word_4の候補数は、 899 個

と、単語間の重複が少なく絞り込みが困難であった。
(word_3は１文字であるためこの総当たり方法では如何ともしがたい)
さらなる工夫を必要とする。

完全自動化へ
例えば最初のユニット番号の振り分けやINTERESTとSCIENCEの組み合わせがそれらしいと考えたのは私自身である。
これを自動化して最終結果まで一気通貫で解読できるようにしていきたい。
画像処理や自然言語処理あたりで出来そうではあるので更新したい。

冗長になりましたが、ここまでお読みくださりありがとうございました。
コードの書き方や暗号解読に関することなど、ご意見アドバイスありましたらよろしくお願いします。

参考URL

この記事は以下のサイトを参考に執筆した。

古代の暗号をPythonで実装する
こちらのサイトで単一換字暗号というワードを知った。
ホームズも行った頻度分析の方法が記されている。
上手く組み合わせれば課題の解決に繋がるかもしれない。

シャーロック・ホームズ・トピアー踊る人形デモ
暗号作成用に利用。
こちらに暗号解読表があるが、これがないものとして扱った。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up