ああああみたいな入力に警告を出す

Python

Posted at 2025-07-23

ああああみたいな入力に警告を出す

アンケート回答とか、なんらかのテキスト入力を求めた際に「ああああ」みたいな適当すぎる入力を防ぐために警告を出すロジックを作ります

正式なソースコードは最後に記します

ロジック

「文章が短い」「同じ文字の繰り返しが多い」という特徴があるとスコアが大きくなるロジックを作ります。

（（ある文字の出現回数）／全文字数）**2　という計算をすべての文字で行い加算した結果をスコアとします

スコアが一定以上だったら「ちゃんと入力してください」みたいな警告を出す想定です（そんな警告されるアンケート嫌な感じもしますが）

次のコードで実験します。

def sample_code():
    # ABABAC
    char_dict1 = {
        "A": 3,
        "B": 2,
        "C": 1
    }
    # AAAAAAAAAABBBCCC
    char_dict2 = {
        "A": 10,
        "B": 3,
        "C": 3
    }

    # ここで変数を差し替えてスコアを検証する
    char_dict = char_dict1

    count_vector = char_dict.values()
    total = sum(count_vector)
    print(f"total={total}")
    score = 0
    for cnt in count_vector:
        sub_score = (cnt / total) ** 2
        print(f"sub_score={sub_score}")
        score += sub_score
    print(f"score={score}")

実行結果

char_dict1 の実行結果

total=6
sub_score=0.25
sub_score=0.1111111111111111
sub_score=0.027777777777777776
score=0.3888888888888889

char_dict2 の実行結果

total=16
sub_score=0.390625
sub_score=0.03515625
sub_score=0.03515625
score=0.4609375

想定通り、同じ文字の繰り返しが多いとスコアが高くなります。

ソースコード

関数コード

# junk_text_validator/junk_text_validator.py
from typing import List

def text_to_char_dict(text: str) -> dict:
    char_dict = {}
    for c in text:
        if c in char_dict:
            char_dict[c] += 1
        else:
            char_dict[c] = 1
    return char_dict


def char_dict_to_counter_vector(text: str, char_dict: dict) -> List[int]:
    counter_vector = []
    for c in char_dict:
        counter_vector.append(char_dict[c])
    return counter_vector


def text_to_vector(text: str, char_dict: dict) -> List[int]:
    return char_dict_to_counter_vector(text, char_dict)


def calc_char_concentration(counts: List[int]) -> float:
    total = sum(counts)
    if total == 0:
        return 0.0
    y = 0
    for cnt in counts:
        y += (cnt / total) ** 2
    return y


def calculate_junk_score(counts: List[int]) -> float:
    y = calc_char_concentration(counts)
    return y

試験コード

from junk_text_validator import junk_text_validator

ok_sample_sentences = [
    "吾輩は猫である。名前はまだ無い。",
    "国破れて山河あり。",
    "柿くへば鐘が鳴るなり法隆寺",
    "使い勝手が良かった"
]

ng_sample_sentences = [
    "ああああああああああ",
    "111aaabbb",
    "asdfghjkl",
    "なにもないです。",
    "いいえはいええはいはい"
]


def test_ok_samples():
    # pytest -s -vv junk_text_validator_test.py::test_ok_samples
    print("\n\n")
    for sample_text in ok_sample_sentences:
        char_dict = junk_text_validator.text_to_char_dict(sample_text)
        index_vector = junk_text_validator.text_to_vector(sample_text, char_dict)
        entropy = junk_text_validator.calculate_junk_score(index_vector)
        print(f"entropy: {entropy}, text: {sample_text}, index_vector: {index_vector}")


def test_ng_samples():
    # pytest -s -vv junk_text_validator_test.py::test_ng_samples
    print("\n\n")
    for sample_text in ng_sample_sentences:
        char_dict = junk_text_validator.text_to_char_dict(sample_text)
        index_vector = junk_text_validator.text_to_vector(sample_text, char_dict)
        entropy = junk_text_validator.calculate_junk_score(index_vector)
        print(f"entropy: {entropy}, text: {sample_text}, index_vector: {index_vector}")

試験結果(OK期待)

entropy: 0.078125, text: 吾輩は猫である。名前はまだ無い。, index_vector: [1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1]
entropy: 0.1111111111111111, text: 国破れて山河あり。, index_vector: [1, 1, 1, 1, 1, 1, 1, 1, 1]
entropy: 0.07692307692307694, text: 柿くへば鐘が鳴るなり法隆寺, index_vector: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
entropy: 0.1111111111111111, text: 使い勝手が良かった, index_vector: [1, 1, 1, 1, 1, 1, 1, 1, 1]

試験結果(NG期待)

entropy: 1.0, text: ああああああああああ, index_vector: [10]
entropy: 0.3333333333333333, text: 111aaabbb, index_vector: [3, 3, 3]
entropy: 0.1111111111111111, text: asdfghjkl, index_vector: [1, 1, 1, 1, 1, 1, 1, 1, 1]
entropy: 0.15625, text: なにもないです。, index_vector: [2, 1, 1, 1, 1, 1, 1]
entropy: 0.3553719008264462, text: いいえはいええはいはい, index_vector: [5, 3, 3]

試験結果をもって、「スコア０．３以上だったら警告を出す」というような仕様であればおおむね大丈夫そうだなという感じがします。

もっと本気で対策するなら？

n-gram の多様性を見る

検索エンジンなどで使われる n-gram を使用し、繰り返しの多い場合にはＮＧ、ユニーク性が高ければOKみたいな計算式を作る

辞書による分かち書き解析

janomeなどの軽量な分かち書きライブラリを使い、 助詞以外の単語が１つ以下だったらＮＧと判断 という基準だと、だいたいのパターンで納得いく判断かもしれないです？

アナログ判断が必要なので完璧な回答は難しい

なんらかのミームで「goooooooooood!!!!!!」みたいな言葉で褒めるのが流行っていたとします、そうなると本ロジックでは「いい加減な入力である」と判断されてしまい、褒めているのに警告されるという不愉快な事が起きてしまいますね。

結局は人間のやることを人間が判断するので「これだ」という回答はなさそうですが、本記事のやりかたで、少なくとも「ああああああああああああ」みたいな入力は防げるとは思います。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

ああああ みたいな入力に警告を出す

ああああ みたいな入力に警告を出す

ロジック

実行結果

ソースコード

もっと本気で対策するなら？

n-gram の多様性を見る

辞書による分かち書き解析

アナログ判断が必要なので完璧な回答は難しい

ああああみたいな入力に警告を出す

ああああみたいな入力に警告を出す