crewAI で Llama 3, DALL-E, Gemini Pro Vision による、シチュエーション英会話練習アプリを作る

Last updated at 2024-05-07Posted at 2024-04-26

本稿の概要

本稿は、3 本の記事によるシリーズ投稿の 3 つ目です。
前回までの記事で準備したものを統合し、いよいよシチュエーション英会話練習アプリに仕上げます。
前回までの記事はこちら
- Llama 3 で無料の英会話練習アプリを作って、ロールプレイし放題
- LLM も 3 人寄れば文殊の知恵⁉️ crewAI で Gemini と DALL-E と Llama を組み合わせる

crewAI に与えるタスク群の設計

英会話のシチュエーション設定を生み出すために、各 AI エージェントに与えるタスク群を下記のように抽出しました。

与えたキーワードに関連するアバター画像を作成する
作成したアバター画像からアバターの人物設定を生み出す
アバターの人物設定から、ユーザーとの会話を始めるための、最初の質問文を作成する

上記を crewAI で実行し、生成したシチュエーション設定をもとに、Llama 3 による AI 英会話アバターと会話するアプリとしたいと思います。

それでは、それぞれを作っていきましょう。

DALL-E に英会話アバター画像を生成させる

最初のタスクとして、DALL-E に英会話シチュエーション用の AI アバター画像を作成させることにしましょう。

さらに画像を生成させるプロンプトも、与えたキーワードから考えてもらうことにしましょう。

そこで、英会話シチュエーションを決める主な要素として、場所を表すキーワードを与えることにします。

例えば、「New York」と指定したら「若いアジア系の女性がオフィスに立っている」などというように、英会話の相手となる AI アバターの画像を生成するためのプロンプトも考えてもらい、それをもとに DALL-E に画像を生成してもらうこととしましょう。

# This program requires the packages bellow:
# pip install 'crewai[tools]'

from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
from langchain.tools import BaseTool, StructuredTool, tool
from langchain.pydantic_v1 import BaseModel, Field
import openai

client = openai.OpenAI()

class ImageInput(BaseModel):
    prompt: str = Field(description='''
        The prompt for the image generation like
        "An young asian female person standing at an office".''')

def generate_image(prompt:str) -> str:
    '''Generate an image for the prompt, and return the filename of the image.'''

    print("Generating image ... for the prompt: ", prompt)

    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",
        quality="standard",
        n=1
    )

    image_url = response.data[0].url
    print(image_url)

    # download the image
    import requests
    import shutil
    from datetime import datetime

    response = requests.get(image_url, stream=True)
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    filename = f'output-{timestamp}.png'

    with open(filename, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)

    print(f"Image saved to {filename}")
    return filename

image_generation = StructuredTool.from_function(
    func=generate_image,
    name="image_generation",
    description='''generate an image for the prompt 
        and return the filename of the image.''',
    args_schema=ImageInput,
    return_direct=True,
)

gpt4 = ChatOpenAI(model="gpt-4")

# Ask the place to go.
location = input('''Where do you want to go today? 
   (like Hawaii, Tokyo, or Silicon Valley, etc.) ''')

avator_maker_agent = Agent(
    role = 'Avator Maker',
    goal = '''Create a prompt to generate an image of a person who 
        talks with the user in English.
        The prompt must include the age of the person like "young" or "old",
        the region of the person like "asian", "african", "european", and "indian", etc.,
        the gender of the person like "female" or "male", etc.,
        the behaviour of the person like "standing", "smiling", "angry", "sad", and "surprised", etc.,
        and the scene of the image like "an office", "a beach", "a city", "a forest" and "a mountain", etc.
        For example, "An young asian female person standing at an office.".
        And the prompt should be appropriate for the location provided by the user.
        The agent must generate the image for the prompt.''',
    backstory = 'You are an avator maker who creates an image of a person who talks with the user in English.',
    allow_delegation = False,
    verbose = True,
    llm = gpt4,
    tools = [image_generation],
    )


image_generation_task = Task (
    description = f'Create an image of a person at {location}.',
    expected_output = 'A filename of the image.',
    agent = avator_maker_agent,
    human_input = False,
    )

crew = Crew(
  agents = [avator_maker_agent],
  tasks = [image_generation_task],
  process = 'sequential',
  verbose = 2
  )

result = crew.kickoff()
print('####################')
print(result)

実行例

上記を実行してみた例が下記になります。

実行例 1：

指定した場所： Silicon Valley
GPT-4 がチョイスしてくれたプロンプト： "A young asian male person standing and smiling in Silicon Valley."
DALL-E が生成した画像：

実行例 2：

指定した場所： New York
GPT-4 がチョイスしてくれたプロンプト： "A young European male standing and smiling in a city like NY."
DALL-E が生成した画像：

いずれも指定した場所に相応しいアバター画像を生成してくれました。

Gemini Pro Vision に英会話アバターの人物エピソードを作らせる

アバター画像だけでは、英会話を切り出すのに話のネタになるものが不足しますね。そこで、アバター画像から人物プロフィールを考えてもらいましょう。

そこで本稿では、画像や動画にも対応した、マルチモーダルモデルである Google Gemini Pro Vision を使うことにします。

Google AI Studio で API key を取得する

Gemini Pro Vision を使うには API キーを取得することが必要です。そこで、Google AI Studio で API キーを作成します。

crewAI のタスクとエージェントを定義

# This program requires the packages bellow:
# pip install 'crewai[tools]'
# pip install -q -U google-generativeai
# pip install Pillow
#
# https://ai.google.dev/gemini-api/docs/get-started/python?hl=ja

from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
from langchain.tools import BaseTool, StructuredTool, tool
from langchain.pydantic_v1 import BaseModel, Field
import google.generativeai as genai
from PIL import Image

GOOGLE_API_KEY = 'your key'

genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro-vision')

def analyze_image(filepath:str, prompt:str) -> str:
    '''Analyze an image of the filepath, and return the analyzation
       of the image.'''

    print("Analyzing the image ... : ", filepath)

    img = Image.open(filepath)

    response = model.generate_content([prompt, img])
    response.resolve()

    analyzed_text = response.text

    print(analyzed_text)

    return analyzed_text

class ProfileCreationInput(BaseModel):
    filepath: str = Field(description='''The file path of the image
        to be analyzed''')

def create_profile(filepath:str) -> str:
    '''Analyze an image of the filepath, and return the profile of 
       the person in the image.'''

    prompt = '''Create a background story for the person in the picture
        to introduce the persion itself to the user.

        It should include the name of the person, the job, the gender, 
        the age, the nation, etc. based on the situation of the picture.
        '''
    return analyze_image(filepath, prompt)

profile_creation = StructuredTool.from_function(
    func=create_profile,
    name="profile_creation",
    description='''Analyzing an image of the filepath, and return 
        the profile of the person in the image.''',
    args_schema=ProfileCreationInput,
    return_direct=True,
)

gpt4 = ChatOpenAI(model="gpt-4")

filepath = 'avator.png'

scenario_writer_agent = Agent(
    role = 'Scenario Writer',
    goal = '''Create an attractive story of the person in the picture 
        who talks with the user in English,
        based on the specification of the task.''',
    backstory = '''You are a creative scenario writer who creates 
        an attractive story of a person from the picture to start
        the conversation with the user.''',
    allow_delegation = False,
    verbose = True,
    llm = gpt4,
    )

profile_creation_task = Task (
    description = f'''
        Create a profile of the person in the picture to introduce 
        the persion itself to the user.
        the picture is saved on the filepath: '{filepath}'.
        ''',
    expected_output = 'the profile of the person',
    agent = scenario_writer_agent,
    tools = [profile_creation],
    human_input = False,
    )

question_creation_task = Task (
    description = f'''
        Create a question from the person in the picture to start 
        the conversation with the user,
        based on the situation of the picture and the profile of 
        the person generated by the previous task.
        ''',
    expected_output = '''the self introduction of the person in 
        the picture, and the first question''',
    agent = scenario_writer_agent,
    human_input = False,
    )

crew = Crew(
  agents = [scenario_writer_agent],
  tasks = [profile_creation_task, question_creation_task],
  process = 'sequential',
  verbose = 2
  )

result = crew.kickoff()

print('####################')
print('The Avator Profile:')
print(profile_creation_task.output.raw_output)

print('####################')
print('The First Question:')
print(question_creation_task.output.raw_output)

実行結果

では、前述の手法で作成した、下記のアバター画像をもとに、プロフィールと最初の質問文を作成させてみましょう。

まず初めに、下記が Gemini Pro Vision がアバター画像を分析した上で、考えくれたプロフィールです。

####################
The Avator Profile:
My name is Jessica Tan. I am 30 years old and I am the CEO 
of a successful tech company. I am originally from Singapore, 
but I have lived in the United States for most of my life. 
I am a hard worker and I am always looking for new challenges. 
I am passionate about technology and I believe that 
it has the power to change the world. I am also a strong 
advocate for diversity and inclusion in the tech industry. 
I believe that everyone should have the opportunity 
to succeed, regardless of their background.

I am a role model for many young women who are interested in 
pursuing careers in technology. I am proof that it is possible 
to achieve your dreams if you work hard and never give up. 
I am also a strong advocate for diversity and inclusion 
in the tech industry. I believe that everyone should have 
the opportunity to succeed, regardless of their background.

そして次が、上記のプロフィールをもとに、GPT-4 が考えてくれた、ユーザとの英会話の切り口となる、最初の質問文です。

####################
The First Question:
Hi there, I'm Jessica Tan, a 30-year-old tech CEO 
who's originally from Singapore but spent most of 
my life in the United States. I have a passion for 
technology and I believe in its power to bring about 
significant change in the world. As an advocate for 
diversity and inclusion in the tech industry, 
I strive to create opportunities for everyone to succeed, 
regardless of their background. I'm proud to be a role model 
for young women interested in technology. 
I stand as proof that with hard work and perseverance, 
achieving your dreams is possible. Now, I'd love to know 
your thoughts. How do you think we can further promote 
diversity and inclusion in the tech industry?

かなりの情報量ですね。会話を切り出すには十分な内容です。

最後に複数のエージェントとタスクを統合し、チャット風 UI を付ける

最後に、これまでのエージェントとタスクを統合しましょう。

プログラムは２本に分けるものとします。

１本目は、シチュエーション設定の生成プログラムです。シチュエーション設定の生成プログラムでは、本稿で紹介した、GPT-4 と DALL-E、Gemini Pro を組み合わせて作らせた、英会話アバターの画像や人物設定、最初の質問文、などを、シチュエーション設定ファイルに保存するものとします。

# This program requires the following packages:
# $ pip install 'crewai[tools]'
# $ pip install google-generativeai
# $ pip install Pillow
# $ pip install pyyaml

from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI
from langchain.tools import StructuredTool
from langchain.pydantic_v1 import BaseModel, Field
import openai
import google.generativeai as genai
from PIL import Image
import yaml

client = openai.OpenAI()
gpt4 = ChatOpenAI(model="gpt-4")

GOOGLE_API_KEY = 'your key'

genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro-vision')

#--------------------------------------------

# Ask the place to go.
location = input('''Where do you want to go today? 
    (like Hawaii, Tokyo, or Silicon Valley, etc.) ''')

#--------------------------------------------

class ImageInput(BaseModel):
    prompt: str = Field(description='''The prompt for the image 
        generation like "An young asian female person standing at an office".''')

def generate_image(prompt:str) -> str:
    '''Generate an image for the prompt, and return the filename of the image.'''

    print("Generating image ... for the prompt: ", prompt)

    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",
        quality="standard",
        n=1
    )

    image_url = response.data[0].url
    print(image_url)

    # download the image
    import requests
    import shutil
    from datetime import datetime

    response = requests.get(image_url, stream=True)
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    filename = f'output-{timestamp}.png'

    with open(filename, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)

    print(f"Image saved to {filename}")
    return filename

image_generation = StructuredTool.from_function(
    func=generate_image,
    name="image_generation",
    description='''generate an image for the prompt and return the 
        filename of the image.''',
    args_schema=ImageInput,
    return_direct=True,
)

#--------------------------------------------

def analyze_image(filepath:str, prompt:str) -> str:
    '''Analyze an image of the filepath, and return the analyzation of the image.'''

    print("Analyzing the image ... : ", filepath)

    img = Image.open(filepath)

    response = model.generate_content([prompt, img])
    response.resolve()

    analyzed_text = response.text

    print(analyzed_text)

    return analyzed_text

class ProfileCreationInput(BaseModel):
    filepath: str = Field(description='''The file path of the image 
        to be analyzed''')

def create_profile(filepath:str) -> str:
    '''Analyze an image of the filepath, and return the profile of 
       the person in the image.'''

    prompt = '''Create a background story for the person in the picture
        to introduce the persion itself to the user.

        It should include the name of the person, the job, the gender, 
        the age, the nation, etc. based on the situation of the picture.
        '''
    return analyze_image(filepath, prompt)

profile_creation = StructuredTool.from_function(
    func=create_profile,
    name="profile_creation",
    description='''Analyzing an image of the filepath, 
        and return the profile of the person in the image.''',
    args_schema=ProfileCreationInput,
    return_direct=True,
)

#--------------------------------------------

avator_maker_agent = Agent(
    role = 'Avator Maker',
    goal = '''Create a prompt to generate an image of a person who 
        talks with the user in English.
        The prompt must include the age of the person like "young" or "old",
        the region of the person like "asian", "african", "european", and "indian", etc.,
        the gender of the person like "female" or "male", etc.,
        the behaviour of the person like "standing", "smiling", "angry", "sad", and "surprised", etc.,
        and the scene of the image like "an office", "a beach", "a city", "a forest" and "a mountain", etc.
        For example, "An young asian female person standing at an office.".
        And the prompt should be appropriate for the location provided by the user.
        The agent must generate the image for the prompt.''',
    backstory = 'You are an avator maker who creates an image of a 
        person who talks with the user in English.',
    allow_delegation = False,
    verbose = True,
    llm = gpt4,
    tools = [image_generation],
    )

image_generation_task = Task (
    description = f'Create an image of a person at {location}.',
    expected_output = 'A filename of the image.',
    agent = avator_maker_agent,
    human_input = False,
    )

crew = Crew(
  agents = [avator_maker_agent],
  tasks = [image_generation_task],
  process = 'sequential',
  verbose = 2
  )

filepath = crew.kickoff()
print('####################')
print(filepath)

#--------------------------------------------

scenario_writer_agent = Agent(
    role = 'Scenario Writer',
    goal = '''Create an attractive story of the person in the picture 
        who talks with the user in English,
        based on the specification of the task.''',
    backstory = '''You are a creative scenario writer who creates 
        an attractive story of a person from the picture to start 
        the conversation with the user.''',
    allow_delegation = False,
    verbose = True,
    llm = gpt4,
    )

profile_creation_task = Task (
    description = f'''
        Create a profile of the person in the picture to introduce 
        the persion itself to the user.
        the picture is saved on the filepath: '{filepath}'.
        ''',
    expected_output = 'the profile of the person',
    agent = scenario_writer_agent,
    tools = [profile_creation],
    human_input = False,
    )

question_creation_task = Task (
    description = f'''
        Create a question from the person in the picture to start 
        the conversation with the user,
        based on the situation of the picture and the profile of 
        the person generated by the previous task.
        ''',
    expected_output = '''the self introduction of the person in 
        the picture, and the first question''',
    agent = scenario_writer_agent,
    human_input = False,
    )

crew = Crew(
  agents = [scenario_writer_agent],
  tasks = [profile_creation_task, question_creation_task],
  process = 'sequential',
  verbose = 2
  )

result = crew.kickoff()

avator_profile = profile_creation_task.output.raw_output
first_question = question_creation_task.output.raw_output

print('####################')
print('The Avator Profile:')
print(avator_profile)

print('####################')
print('The First Question:')
print(first_question)

#--------------------------------------------

# Save the situation (location, filepath, avator_profile, first_question) to the file

situation = {
    "location": location,
    "filepath": filepath,
    "avator_profile": avator_profile,
    "first_question": first_question
}

with open('situation.yaml', 'w') as file:
    yaml.safe_dump(situation, file)

２本目は、シチュエーション設定ファイルを読み込み、シチュエーション設定に沿って、AI 英会話アバターのチャット UI を起動するプログラムです。

こちらは、1 本目の記事で紹介した、Llama 3 を Ollama でローカル実行し、Streamlit でチャット風 UI を付ける、の内容とほぼ同じです。

# This program requires the following packages:
# $ pip install ollama
# $ pip install streamlit
# $ pip install pyyaml
# To run this program, type the command bellow:
# $ streamlit run situation-chat.py

import ollama
import streamlit as st
import yaml

with open("situation.yaml", "r") as file:
    situation = yaml.safe_load(file)
    location = situation["location"]
    filepath = situation["filepath"]
    avator_profile = situation["avator_profile"]
    first_question = situation["first_question"]

#--------------------------------------------

st.title("Your Personal English Coach")

# Add a header image
header_image = filepath
st.image(header_image, caption='Situation', use_column_width=True)

if "messages" not in st.session_state:
    st.session_state["messages"] = [{"role": "system", "content": avator_profile}]
    st.session_state["messages"] = [{"role": "assistant", "content": first_question}]

### Write Message History
for msg in st.session_state.messages:
    if msg["role"] == "user":
        st.chat_message(msg["role"], avatar="🧑‍💻").write(msg["content"])
    else:
        st.chat_message(msg["role"], avatar="😃").write(msg["content"])

## Generator for Streaming Tokens
def generate_response():
    response = ollama.chat(model='llama3', stream=True, messages=st.session_state.messages)
    for partial_resp in response:
        token = partial_resp["message"]["content"]
        st.session_state["full_message"] += token
        yield token

if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})
    st.chat_message("user", avatar="🧑‍💻").write(prompt)
    st.session_state["full_message"] = ""
    st.chat_message("assistant", avatar="😃").write_stream(generate_response)
    st.session_state.messages.append({"role": "assistant", "content": st.session_state["full_message"]})

完成！

下記は、場所を「Beach」と指定した場合の実行例です。

下記が AI エージェントたちが生成してくれた人物設定です。

Hi, I am Sarah. I am 25 years old. I am a model. 
I am from California, USA. I love to travel and 
I have been to many countries. I am always up for
an adventure and I love to meet new people. 
I am a very outgoing and friendly person and I love to have fun. 
I am always looking for new things to do and 
I am always up for a challenge. 
I am a very determined person and I never give up on my dreams. 
I am always looking for ways to improve myself and 
I am always striving to be the best that I can be.

そして、下記が AI エージェントたちが生成してくれた最初の質問文です。

"Hi there, I'm Sarah, a model from sunny California. 
I love traveling and meeting new people - 
it's one of the ways I challenge myself to keep growing. 
I'm curious, what's the most adventurous thing 
you've ever done or the most amazing place you've ever been? 
Maybe I can add it to my bucket list!"

すごいですね！シチュエーション英会話を練習する分には、十分に実用的な内容です。

まとめ

いかがでしたでしょうか？

最初のシチュエーション生成の部分は、GPT-4 や DALL-E、Gemini Pro Vision などを使いましたが、あとは Llama 3 による AI アバターと、無料のシチュエーション英会話を存分に楽しむことができます。

ブログ記事としてはやや長くなりましたが、最終的なプログラムはなんとたったの３００行以下です。これはちょっと衝撃的ではないでしょうか？

みなさんは、どんな AI アプリケーションを作りたい、と思われたでしょうか？

おまけ

本稿で作成したプログラムは、場所を指定するだけで英会話シチュエーションを作成してくれますので、遊んでみると、なかなか面白いです。いくつか実行例を紹介しますので、ぜひ皆さんも動かしたり、改変したりして遊んでみてください。

Beach:

Norway:

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up