PythonAdvent Calendar 2024

多様な形式のファイルを Markdown に変換できるMicrosoft 提供の MarkItDown を試してみた

Posted at 2024-12-15

概要

Microsoft が提供を開始した MarkItDown を試してみましたのでコードとその結果を共有します。MarkItDown では下記のファイルの変換をサポートしているようなのですが、ファイルを用意するのが面倒であったため、ライブラリのテストに利用されているファイルを変換してみます。コード実行環境を Google Colab で実施しています。

PDF (.pdf)

PowerPoint (.pptx)

Word (.docx)

Excel (.xlsx)

Images (EXIF metadata, and OCR)

Audio (EXIF metadata, and speech transcription)

HTML (special handling of Wikipedia, etc.)

Various other text-based formats (csv, json, xml, etc.)

引用元：markitdown/README.md at main · microsoft/markitdown

MarkItDown とは

MarkItDown は、Microsoftが開発したPythonライブラリで、さまざまなファイル形式をMarkdown形式に変換するためのユーティリティツールです。これにより、インデックス作成やテキスト分析などの用途で活用できます。

下記のようなシンプルなコードを記述するだけで Markdown に変換できます。

from markitdown import MarkItDown

markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)

引用元：markitdown/README.md at main · microsoft/markitdown

画像のディスプリクションを生成するために大規模言語モデル（LLM）を使用することも可能なようです。

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4")
result = md.convert("example.jpg")
print(result.text_content)

引用元：markitdown/README.md at main · microsoft/markitdown

検証コードと実行結果

事前準備

markitdown の Clone とディレクトリの移動を実施します。

!git clone https://github.com/microsoft/markitdown.git
!cd /content/markitdown/tests/test_files

import os

# Clone して Git のテスト用ファイルのディレクトリに移動
os.chdir('/content/markitdown/tests/test_files')
print(os.getcwd())

# 現在のディレクトリにあるファイル一覧を表示
for filename in os.listdir(os.getcwd()):
    print(filename)

markitdown のインストールとインスタンス化を実施します。

!pip install markitdown -q

from markitdown import MarkItDown

markitdown = MarkItDown()

1. PowerPoint (.pptx) の変換

テスト用ディレクトリ内にある下記の PowerPoint のファイルを変換します。

引用元：markitdown/tests/test_files/test.pptx at main · microsoft/markitdown

下記が変換コードと実行結果です。実行結果は長いので、折りたたみ可能なセクションとしています。

pptx_file = "test.pptx"

output = markitdown.convert(pptx_file)

print(output.text_content)

出力結果はこちら

<!-- Slide number: 1 -->
# AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu , Gagan Bansal , Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, Chi Wang

<!-- Slide number: 2 -->
# 2cdda5c8-e50e-4db4-b5f0-9722a649f455
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and 04191ea8-5c73-4215-a1d3-1cfb43aaaf12 can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic framework for building diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

![The first page of the AutoGen ArXiv paper.  44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a](Picture4.jpg)

<!-- Slide number: 3 -->
# A table to test parsing:

| ColA | ColB | ColC | ColD | ColE | ColF |
| --- | --- | --- | --- | --- | --- |
| 1 | 2 | 3 | 4 | 5 | 6 |
| 7 | 8 | 9 | 1b92870d-e3b5-4e65-8153-919f4ff45592 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 |

2. Word (.docx)

テスト用ディレクトリ内にある下記の Word のファイルを変換します。

引用元：markitdown/tests/test_files/test.docx at main · microsoft/markitdown

下記が変換コードと実行結果です。

docx_file = "test.docx"

markitdown = MarkItDown()
output = markitdown.convert(docx_file)

print(output.text_content)

出力結果はこちら

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu , Gagan Bansal , Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, Chi Wang

# Abstract

AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic framework for building diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

# Introduction

Large language models (LLMs) are becoming a crucial building block in developing powerful agents that utilize LLMs for reasoning, tool usage, and adapting to new observations (Yao et al., 2022; Xi et al., 2023; Wang et al., 2023b) in many real-world tasks. Given the expanding tasks that could benefit from LLMs and the growing task complexity, an intuitive approach to scale up the power of agents is to use multiple agents that cooperate. Prior work suggests that multiple agents can help encourage divergent thinking (Liang et al., 2023), improve factuality and reasoning (Du et al., 2023), and provide validation (Wu et al., 2023).

## d666f1f7-46cb-42bd-9a39-9a39cf2a509f

In light of the intuition and early evidence of promise, it is intriguing to ask the following question: how can we facilitate the development of LLM applications that could span a broad spectrum of domains and complexities based on the multi-agent approach? Our insight is to use multi-agent conversations to achieve it. There are at least three reasons confirming its general feasibility and utility thanks to recent advances in LLMs: First, because chat optimized LLMs (e.g., GPT-4) show the ability to incorporate feedback, LLM agents can cooperate through conversations with each other or human(s), e.g., a dialog where agents provide and seek reasoning, observations, critiques, and validation. Second, because a single LLM can exhibit a broad range of capabilities (especially when configured with the correct prompt and inference settings), conversations between differently configured agents can help combine these broad LLM capabilities in a modular and complementary manner. Third, LLMs have demonstrated ability to solve complex tasks when the tasks are broken into simpler subtasks. Here is a random UUID in the middle of the paragraph! 314b0a30-5b04-470b-b9f7-eed2c2bec74a Multi-agent conversations can enable this partitioning and integration in an intuitive manner. How can we leverage the above insights and support different applications with the common requirement of coordinating multiple agents, potentially backed by LLMs, humans, or tools exhibiting different capacities? We desire a multi-agent conversation framework with generic abstraction and effective implementation that has the flexibility to satisfy different application needs. Achieving this requires addressing two critical questions: (1) How can we design individual agents that are capable, reusable, customizable, and effective in multi-agent collaboration? (2) How can we develop a straightforward, unified interface that can accommodate a wide range of agent conversation patterns? In practice, applications of varying complexities may need distinct sets of agents with specific capabilities, and may require different conversation patterns, such as single- or multi-turn dialogs, different human involvement modes, and static vs. dynamic conversation. Moreover, developers may prefer the flexibility to program agent interactions in natural language or code. Failing to adequately address these two questions would limit the framework’s scope of applicability and generality.

Here is a random table for .docx parsing test purposes:

| 1 | 2 | 3 | 4 | 5 | 6 |
| --- | --- | --- | --- | --- | --- |
| 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 49e168b7-d2ae-407f-a055-2167576f39a1 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 |

3. Excel (.xlsx)

テスト用ディレクトリ内にある下記の Excel のファイルを変換します。

引用元：markitdown/tests/test_files/test.xlsx at main · microsoft/markitdown

xlsx_file = "test.xlsx"

output = markitdown.convert(xlsx_file)

print(output.text_content)

下記が変換コードと実行結果です。

出力結果はこちら

## Sheet1
| Alpha | Beta | Gamma | Delta |
| --- | --- | --- | --- |
| 89 | 82 | 100 | 12 |
| 76 | 89 | 33 | 42 |
| 60 | 84 | 19 | 19 |
| 7 | 69 | 10 | 17 |
| 87 | 89 | 86 | 54 |
| 23 | 4 | 89 | 25 |
| 70 | 84 | 62 | 59 |
| 83 | 37 | 43 | 21 |
| 71 | 15 | 88 | 32 |
| 20 | 62 | 20 | 67 |
| 67 | 18 | 15 | 48 |
| 42 | 5 | 15 | 67 |
| 58 | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22 | 9 |
| 49 | 93 | 6 | 38 |
| 82 | 28 | 1 | 39 |
| 95 | 55 | 18 | 82 |
| 50 | 46 | 98 | 86 |
| 31 | 46 | 47 | 82 |
| 40 | 65 | 19 | 31 |
| 95 | 65 | 29 | 62 |
| 68 | 57 | 34 | 54 |
| 96 | 66 | 63 | 14 |
| 87 | 93 | 95 | 80 |

## 09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
| --- | --- | --- | --- |
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |

4. HTML

テスト用ディレクトリ内にある下記の htmnl のファイルを変換します。

引用元：markitdown/README.md at main · microsoft/markitdown

html_file = "test_blog.html"

output = markitdown.convert(html_file)

print(output.text_content)

下記が変換コードと実行結果です。

出力結果はこちら

In this experiment, when n > 1, we find the answer with highest votes among all the responses and then select it as the final answer to compare with the ground truth. For example, if n = 5 and 3 of the responses contain a final answer 301 while 2 of the responses contain a final answer 159, we choose 301 as the final answer. This can help with resolving potential errors due to randomness. We use the average accuracy and average inference cost as the metric to evaluate the performance over a dataset. The inference cost of a particular instance is measured by the price per 1K tokens and the number of tokens consumed.

## Experiment Results[](#experiment-results "Direct link to Experiment Results")

The first figure in this blog post shows the average accuracy and average inference cost of each configuration on the level 2 Algebra test set.

Surprisingly, the tuned gpt-3.5-turbo model is selected as a better model and it vastly outperforms untuned gpt-4 in accuracy (92% vs. 70%) with equal or 2.5 times higher inference budget.
The same observation can be obtained on the level 3 Algebra test set.

![level 3 algebra](/autogen/assets/images/level3algebra-94e87a683ac8832ac7ae6f41f30131a4.png)

However, the selected model changes on level 4 Algebra.

![level 4 algebra](/autogen/assets/images/level4algebra-492beb22490df30d6cc258f061912dcd.png)

This time gpt-4 is selected as the best model. The tuned gpt-4 achieves much higher accuracy (56% vs. 44%) and lower cost than the untuned gpt-4.
On level 5 the result is similar.

![level 5 algebra](/autogen/assets/images/level5algebra-8fba701551334296d08580b4b489fe56.png)

We can see that AutoGen has found different optimal model and inference parameters for each subset of a particular level, which shows that these parameters matter in cost-sensitive LLM applications and need to be carefully tuned or adapted.

An example notebook to run these experiments can be found at: <https://github.com/microsoft/FLAML/blob/v1.2.1/notebook/autogen_chatgpt.ipynb>. The experiments were run when AutoGen was a subpackage in FLAML.

## Analysis and Discussion[](#analysis-and-discussion "Direct link to Analysis and Discussion")

While gpt-3.5-turbo demonstrates competitive accuracy with voted answers in relatively easy algebra problems under the same inference budget, gpt-4 is a better choice for the most difficult problems. In general, through parameter tuning and model selection, we can identify the opportunity to save the expensive model for more challenging tasks, and improve the overall effectiveness of a budget-constrained system.

There are many other alternative ways of solving math problems, which we have not covered in this blog post. When there are choices beyond the inference parameters, they can be generally tuned via [`flaml.tune`](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function).

The need for model selection, parameter tuning and cost saving is not specific to the math problems. The [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT) project is an example where high cost can easily prevent a generic complex task to be accomplished as it needs many LLM inference calls.

## For Further Reading[](#for-further-reading "Direct link to For Further Reading")

* [Research paper about the tuning technique](https://arxiv.org/abs/2303.04673)
* [Documentation about inference tuning](/autogen/docs/Use-Cases/enhanced_inference)

*Do you have any experience to share about LLM applications? Do you like to see more support or research of LLM optimization or automation? Please join our [Discord](https://discord.gg/pAbnFJrkgZ) server for discussion.*

**Tags:**

* [LLM](/autogen/blog/tags/llm)
* [GPT](/autogen/blog/tags/gpt)
* [research](/autogen/blog/tags/research)
[Newer PostAchieve More, Pay Less - Use GPT-4 Smartly](/autogen/blog/2023/05/18/GPT-adaptive-humaneval)

* [Experiment Setup](#experiment-setup)
* [Experiment Results](#experiment-results)
* [Analysis and Discussion](#analysis-and-discussion)
* [For Further Reading](#for-further-reading)
Community

* [Discord](https://discord.gg/pAbnFJrkgZ)
* [Twitter](https://twitter.com/pyautogen)
Copyright © 2024 AutoGen Authors | [Privacy and Cookies](https://go.microsoft.com/fwlink/?LinkId=521839)

5. Images (EXIF metadata)

テスト用ディレクトリ内にある下記の Word のファイルを変換します。

引用元：markitdown/tests/test_files/test.jpg at main · microsoft/markitdown

事前に exiftool のインストールを実施します。

!sudo apt-get update -qq
!sudo apt-get install -y -qq libimage-exiftool-perl

下記が変換コードと実行結果です。

image_file = "test.jpg"

result = markitdown.convert(image_file)

print(result.text_content)

ImageSize: 1615x1967
Title: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Description: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns
Author: AutoGen Authors
DateTimeOriginal: 2024:03:14 22:10:00

まとめ

Microsoftが提供するMarkItDownは、多様なファイル形式をMarkdown形式に変換できる強力なPythonライブラリです。本記事では、テスト用ファイルを利用して、その実力を検証しました。

PowerPoint (.pptx) や Word (.docx) ファイルの変換では、テキスト、表、スライド内容がMarkdown形式として出力され、精度の高い変換結果が得られました。
Excel (.xlsx) ファイルの変換では、シートの内容がMarkdownテーブル形式として抽出され、データの構造が視覚的にわかりやすく表現されました。
HTML ファイルでは、記事のコンテンツや画像リンクが適切にMarkdown化され、Webページの要約や分析用途に活用できます。
画像ファイル ではEXIFメタデータがMarkdown形式で出力されました。

MarkItDownは、シンプルなコード記述で多様な形式のデータを扱えるため、テキスト分析、インデックス作成、レポート作成の自動化に非常に有用です。また、大規模言語モデル（LLM） との連携により、さらなる高度な処理も可能です。

Markdown形式に統一することで、情報の再利用や共有が容易になり、データ解析やコンテンツ管理の効率化が期待できます。今後、MarkItDownの機能拡張や他ツールとの統合が進めば、さらに幅広いシーンでの活用が見込めるでしょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up