More than 1 year has passed since last update.

スクレイピング時のエラーメモ「surrogates not allowed」 - 異体字セレクタ(IVS)の削除

Last updated at 2023-02-09Posted at 2022-09-14

背景

スクレイピングしたデータをBQにアップロードする。
そのために改行区切りのJSONを作る。
その際、余計な文字が混入していることでエラーが発生した。

対象の文字列
...\udb40\udd00\u539f...

エラー内容
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 40-41: surrogates not allowed

\udb40や\udd00は異体字セレクタという微妙な字形を区別するための文字コードだった。
surrogates not allowedとのことなのでこれらを無くせば良いっぽい。

https://sheilart.github.io/Azure-Alphant/surrogates.html
これ使えそうだから一覧化してみよう..2048件..なんか多いな

console.js

const elms = Array.from(document.querySelectorAll('article'));
elms.map((elm)=>{
    return elm.querySelector('p:nth-child(6)').textContent;
})

例外的な事象ぽいし、ここまでいらないや。一旦その場しのぎでこの人の判別してるやつをパクろう
IVSのセレクタ文字判別用のクラスを作ってみた

\uDB40
\uDD00
\uDDEF
\uFE00
\uFE0F
\u180B
\u180d

このメソッドで削除後のテキストを得る

text_cleansing.py

import re

def remove_surrogates(text: str):
    return re.sub('\udb40|\udd00|\uddef|\ufe00|\ufe0f|\u180b|\u180d', '', text)

その後メンテナンスが面倒そうだったので ↓ で対応した。

    def to_unicode(text:str):
        return unicodedata.normalize("NFKC", text)