More than 3 years have passed since last update.

【Wagtail】RichTextFieldのHTMLから画像のURLリストを取得する

Posted at 2022-11-14

結論

DBのRichTextFieldのHTMLから、最小限のクエリ回数で、imgタグのsrc属性を全件取得したかった。

libs.py

from wagtail.core.rich_text.rewriters import FIND_EMBED_TAG, extract_attrs
from wagtail.images.formats import get_image_format
from wagtail.images.models import Rendition

def get_embed_tag_attrs(html: str) -> List[dict]:
    """ wagtail.core.rich_text.rewriters.EmbedRewriter から一部抜粋 """
    match = FIND_EMBED_TAG.findall(html)
    attrs = [extract_attrs(s) for s in match]
    return attrs

def get_image_url_map(image_ids: Optional[List[int]] = None) -> Dict[tuple, str]:
    """DBアクセスを集約する関数

    Notes
    -----
    - Imageレコードからリレーションをたどらずに、直接Rendtionレコードを取得する。
    - Rendtionレコードのfileカラムではなく、インスタンスのurl属性を呼び出すことで、FQDNを取得する。
    """
    renditions = (
        Rendition.objects.all()
        if image_ids is None
        else Rendition.objects.filter(image_id__in=image_ids)
    )
    return {(r.image_id, r.filter_spec): r.url for r in renditions}

def get_image_urls(html: str) -> List[str]:
    """
    Warnings
    --------
    - 各種レコードが存在しない場合に未対処

    Returns
    -------
    Ex. ["https://xxx.com/media/images/foo.jpg"]
    """
    embed_tag_attrs = get_embed_tag_attrs(html=html)

    image_ids = [int(a["id"]) for a in embed_tag_attrs]
    image_url_map = get_image_url_map(image_ids=image_ids)

    image_urls = []
    for attr in embed_tag_attrs:
        image_format = get_image_format(attr["format"])
        rendition_key = (int(attr["id"]), image_format.filter_spec)
        image_urls.append(image_url_map[rendition_key])

    return image_urls

背景

Wagtailは、DjangoベースのCMSです。

個人開発でWagtailを使ったWebメディアを運用していて、そのなかで任意の記事に含まれる、全ての画像を閲覧できるViewer機能を作った際に、本機能が必要になりました。

詳細

wagtail = 2.14.2

前提として、RichTextFieldでimgタグはembedタグとして保存されます。
他にもいくつかのタグは、Wagtailによって変換され、内部表現としてDBに保存されます。

<!-- imgタグの場合 -->
<embed alt="foo" embedtype="image" format="fullwidth" id="1"/>

これを適切なHTMLに変換するためには、下記の関数を利用できます。

from wagtail.core.rich_text import expand_db_html

expand_db_html(html)

しかしexpand_db_htmlは、内部で1つのimgタグに対して、ImageレコードおよびRenditionレコードの取得でクエリを2回発行しています。
これでは記事内の画像数に比例して、パフォーマンスが悪化してしまうので、対処が必要でした。

wagtail.images.rich_text.__init__.py

class ImageEmbedHandler(EmbedHandler):
    identifier = 'image'

    @staticmethod
    def get_model():
        return get_image_model()

    @classmethod
    def expand_db_attributes(cls, attrs):
        """
        Given a dict of attributes from the <embed> tag, return the real HTML
        representation for use on the front-end.
        """
        try:
            image = cls.get_instance(attrs)  # Imageレコードの取得
        except ObjectDoesNotExist:
            return '<img alt="">'

        image_format = get_image_format(attrs['format'])
        return image_format.image_to_html(image, attrs.get('alt', ''))  # Renditionレコードの取得

よりよいやり方がありましたら、コメント等お待ちしております！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up