MarkItDownでTypeError with message: safe_cmyk()と出てしまう

Last updated at 2025-04-18Posted at 2025-04-18

MarkItDownは様々なファイルをPDFに変換してくれるので、RAG用のドキュメント化としてはとてもありがたいのですが、変換処理をLambdaに移行している際に一部PDFからMarkdownへの変換でエラーが発生しました。

発生した事象

Error converting to markdown: File conversion failed after 1 attempts:
- PdfConverter threw TypeError with message: safe_cmyk() missing 3 required positional arguments: 'm', 'y', and 'k'

100%発生するわけではありませんが、一部PDFで上記のエラーが発生しました。
例えば、
https://panasonic.jp/dish/products/NP-TZ300.html
の取り扱い説明書

ChatGPTにたずねてみると、

PDF内の画像なりが問題だろうから
CMYKからRGBに変換すれば直るかもね、ただそれはLambdaでは難しいからローカルからやるのがいいかもね。

みたいな回答で、絶望しかけましたが、このままではちょっと困るので、原因を調査してみました。

原因を調査

コードを追ってみると、変換処理のタイミングでエラーが発生するようでした。

   def _convert(
        self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs
    ) -> DocumentConverterResult:
        res: Union[None, DocumentConverterResult] = None

        # Keep track of which converters throw exceptions
        failed_attempts: List[FailedConversionAttempt] = []

        # Create a copy of the page_converters list, sorted by priority.
        # We do this with each call to _convert because the priority of converters may change between calls.
        # The sort is guaranteed to be stable, so converters with the same priority will remain in the same order.
        sorted_registrations = sorted(self._converters, key=lambda x: x.priority)

        # Remember the initial stream position so that we can return to it
        cur_pos = file_stream.tell()

        for stream_info in stream_info_guesses + [StreamInfo()]:
            for converter_registration in sorted_registrations:
                converter = converter_registration.converter
                # Sanity check -- make sure the cur_pos is still the same
                assert (
                    cur_pos == file_stream.tell()
                ), f"File stream position should NOT change between guess iterations"

                _kwargs = {k: v for k, v in kwargs.items()}

。。。。
                # Attempt the conversion
                if _accepts:
                    try:
                        res = converter.convert(file_stream, stream_info, **_kwargs)
                    except Exception:
                        failed_attempts.append(
                            FailedConversionAttempt(
                                converter=converter, exc_info=sys.exc_info()
                            )
                        )
                    finally:
                        file_stream.seek(cur_pos)

                if res is not None:
                    # Normalize the content
                    res.text_content = "\n".join(
                        [line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
                    )
                    res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)
                    return res

        # If we got this far without success, report any exceptions
        if len(failed_attempts) > 0:
            raise FileConversionException(attempts=failed_attempts)

本来、PDFの場合はPdfConverterの部分で処理が変換され、処理が終了するのですが、
エラーのため終了せず、最終的に FileConversionException となってしまって
変換が完了しないという状態になりました。

さらに原因追求　pdfminerが怪しそう

そもそもsafe_cmykとはなんぞやとMarkItDown内を探してみても見つからず、
おそらくPDFを変換するライブラリなんだろうと探してみると

# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
    import pdfminer
    import pdfminer.high_level

pdfminer.six

を使っていることが分かったので、そちらのリポジトリを見てみることに

見つけました。

どうも2日前に更新されており、
安全な型変換のために追加されたようです。

ただ、失敗したPDFを変換処理をし、実際にsafe_cmykの部分の引数を確認してみると
[0.19]
しかなく、エラーとなってしまったようです。

とりあえずの対応

この最新バージョンのみで今のところは発生しているので、

requirements.txt

pdfminer.six==20250327
markitdown[all]==0.1.1

と一つ前のバージョンに戻し、 pip install -r requirements.txt を行ったところ、
無事に失敗していたPDFも変換されるようになりました。

一部のPDFファイルのみに発生する事象ですが、
同じように困っている人がいるかもしれないので備忘として。

また、発生しているPDFをMacのプレビューから別名で保存しても解消されたので、PDFの作り方の問題かもしれません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

MarkItDownでTypeError with message: safe_cmyk()と出てしまう

発生した事象

原因を調査

さらに原因追求 pdfminerが怪しそう

とりあえずの対応

さらに原因追求　pdfminerが怪しそう