LangchainのRSSFeedLoaderで、RSSからドキュメントロード時に以下のエラーが発生した
ソース(rss/test.py)
from langchain_community.document_loaders import RSSFeedLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.inmemory import InMemoryVectorStore
urls = ["https://rss.itmedia.co.jp/rss/2.0/aiplus.xml"]
loader = RSSFeedLoader(urls=urls)
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 400,
chunk_overlap = 0,
length_function = len,
)
index = VectorstoreIndexCreator(
vectorstore_cls=InMemoryVectorStore,
embedding=OpenAIEmbeddings(),
text_splitter=text_splitter,
).from_loaders([loader])
エラーメッセージ
Error processing entry https://www.itmedia.co.jp/business/articles/2406/25/news053.html, exception: newspaper package not found, please install it with `pip install newspaper3k`
newspaper3kをインストールした後でも、同じエラーがでる。
% pip install newspaper3k
% python rss/test.py
Error processing entry https://www.itmedia.co.jp/news/articles/2406/25/news101.html, exception: newspaper package not found, please install it with `pip install newspaper3k`
lxml[html_clean]をインストールすることでエラーが解消した
lxml[html_clean]インストール
% pip install "lxml[html_clean]"
エラー解消
% python rss/test.py
Created a chunk of size 850, which is longer than the specified 400
Created a chunk of size 593, which is longer than the specified 400