Azure API ManagementとAzure OpenAIの連携

Posted at 2025-07-14

Azure API Management でAzure AI Foundry APIをインポートしてみました。基本的には以下の内容に従っています。

Steps

1. 前提

Azure API Management リソースを価格レベル Developer で作成済
Azure AI Foundry リソースを作成し、モデル gpt-4.1-nano をデプロイ済

2. Azure API Managementの認証

2.1. マネージドID作成

Azure API Management でシステム割当マネージドIDを作成。メニューのセキュリティ -> マネージドID からシステム割り当て済みタブでONにするだけ。

2.2. ロール割り当てを追加

(Azure AI Foundry Projectではなく)Azure AI Foundry のリソースで、前ステップで作った「Cognitive Services 共同作成者」のロール割り当てを追加
※確認していませんが、いつの間にか「Cognitive Services OpenAI User」ロール割り当てが追加されていました。後続のインポート処理をすると、自動的に必要最小限である「Cognitive Services OpenAI User」ロール割り当てが追加されるのかもしれません。

3. API追加

3.1. API 追加

Azure API Management のメニュー APIs -> API から Azure AI Foundry を追加。 Azure OpenAI Serviceを使うにしても、Azure AI Foundryを選択します。次画面では対象のAzure AI Serviceを選択(画面省略)。

Configure API画面。Base Path は、API Managementを呼び出すときに追加パスです。Azure AI Service側のパスではないです。

Manage token consumption 画面。ここで、Token制限を設定します。

Token per minute(TPM)で1分間あたりのToken数。
Token quotaとToken Quota period で任意の時間単位(Hourly, Daily, Weekly, Monthly, Yearly)あたりのToken数。
Limit by でToken制限の単位(Subscription/ Ip address)
Track token usage をONにするとDimensionが選べます。Application Insightsで使えるCustom Dimensionを追加できます。

Application Insightではこんな風に取得できています。

Apply semantic cahingとSet up AI content safety は設定せずに完了。
Pliciesが Inbound processing の base などをクリックすると確認できます

policies

<policies>
    <inbound>
        <base />
        <set-backend-service id="apim-generated-policy" backend-id="test-ai-foundry-ai-endpoint" />
        <llm-emit-token-metric>
            <dimension name="API ID" />
            <dimension name="User ID" />
            <dimension name="Client IP address" value="@(context.Request.IpAddress)" />
            <dimension name="Gateway ID" />
            <dimension name="Location" />
            <dimension name="Operation ID" />
            <dimension name="Product ID" />
        </llm-emit-token-metric>
        <llm-token-limit remaining-quota-tokens-header-name="remaining-tokens" remaining-tokens-header-name="remaining-tokens" tokens-per-minute="500" token-quota="1000" token-quota-period="Hourly" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="true" tokens-consumed-header-name="consumed-tokens" />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

3.2. Test

Test タブでテストします。 POST Creates a comletion for the chat message を選択し、deployment-idを選択して、 Send をクリック。

結果が見られます。

response body

{
    "choices": [{
        "content_filter_results": {
            "hate": {
                "filtered": false,
                "severity": "safe"
            },
            "protected_material_code": {
                "detected": false,
                "filtered": false
            },
            "protected_material_text": {
                "detected": false,
                "filtered": false
            },
            "self_harm": {
                "filtered": false,
                "severity": "safe"
            },
            "sexual": {
                "filtered": false,
                "severity": "safe"
            },
            "violence": {
                "filtered": false,
                "severity": "safe"
            }
        },
        "finish_reason": "stop",
        "index": 0,
        "logprobs": null,
        "message": {
            "annotations": [],
            "content": "I'm doing well, thank you! How can I assist you today?",
            "refusal": null,
            "role": "assistant"
        }
    }],
    "created": 1752413682,
    "id": "chatcmpl-Bsr8kzBMYY1jT6TKJog4vReTFONjH",
    "model": "gpt-4.1-nano-2025-04-14",
    "object": "chat.completion",
    "prompt_filter_results": [{
        "prompt_index": 0,
        "content_filter_results": {
            "hate": {
                "filtered": false,
                "severity": "safe"
            },
            "jailbreak": {
                "detected": false,
                "filtered": false
            },
            "self_harm": {
                "filtered": false,
                "severity": "safe"
            },
            "sexual": {
                "filtered": false,
                "severity": "safe"
            },
            "violence": {
                "filtered": false,
                "severity": "safe"
            }
        }
    }],
    "system_fingerprint": "fp_68472df8fd",
    "usage": {
        "completion_tokens": 15,
        "completion_tokens_details": {
            "accepted_prediction_tokens": 0,
            "audio_tokens": 0,
            "reasoning_tokens": 0,
            "rejected_prediction_tokens": 0
        },
        "prompt_tokens": 20,
        "prompt_tokens_details": {
            "audio_tokens": 0,
            "cached_tokens": 0
        },
        "total_tokens": 35
    }
}

Trace ボタンをクリックすると、内部的に何をしているかがわかります。以下は、カスタムMetricを追加しているログ、rate-limit, quotaに関するログです。

4. Pythonから呼出

4.1. Pythonプログラム

サンプルのPythonプログラムを作ってAPI Managemenet経由で呼んでみます。
ちなみに必ずAPI Management経由で呼ばせるためには、AI Foundry モデルの Key呼出を禁止にしたり、ネットワークで制御なりが必要です(今の状態であれば、直接AI Foundryモデルを呼び出せる)。
Python3.11を使い、openaiライブラリのバージョン1.64.0を使っています。
キーはAPI Management画面で、メニュー APIs -> サブスクリプションから該当列を選び、「キーの表示/非表示」で主キーをコピーします。

import pprint

from openai import AzureOpenAI

subscription_key = "<API Managementのキー>"

client = AzureOpenAI(
    api_version="2024-12-01-preview",
    # my-foundry-api は、API Managementで設定したBase Pathの値
    azure_endpoint="https://<API Managementのホスト>/my-foundry-api",
    api_key=subscription_key,
)

deployment = "gpt-4.1-nano"

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "君の名は？",
        }
    ],
    max_completion_tokens=800,
    temperature=1.0,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    model=deployment
)

print(response.choices[0].message.content)
pprint.pprint(vars(response))

実行結果

私はChatGPTです。あなたのお手伝いをします。何か質問やリクエストがあれば教えてください！
{'_request_id': '3108a544-e12a-42f0-8b18-912d51a26418',
 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='私はChatGPTです。何かお手伝いできることはありますか？', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'protected_material_code': {'detected': False, 'filtered': False}, 'protected_material_text': {'detected': False, 'filtered': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})],
 'created': 1752400552,
 'id': 'chatcmpl-BsniynShU3mxrKKEadKgGlhwkyHR6',
 'model': 'gpt-4.1-nano-2025-04-14',
 'object': 'chat.completion',
 'service_tier': None,
 'system_fingerprint': 'fp_68472df8fd',
 'usage': CompletionUsage(completion_tokens=19, prompt_tokens=22, total_tokens=41, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))}

Token quota超過させるとこんなメッセージが返ります。

PermissionDeniedError: Error code: 403 - 
{'statusCode': 403, 
'message': 'Token quota is exceeded. Try again in 5 minutes and 44 seconds.'}

Token per minute(TPM)を超過すると、エラーにはならず約1分待たされてResponseが返ってきました。

その他

気になったけど検証できていないこと

実務上はモデル単位でToken quota設定したいことがあると考えています(高いモデルは少なく、など)。
未検証ですが、AIに聞いたらこんなPoliciesを提案されました。試したけどエラー出たので、時間あれば検証して追記したいと思います。

pollicies

<policies>
  <inbound>
    <base />

    <!-- 1) モデル名を変数に格納 -->
    <set-variable name="modelName"
      value="@(context.Request.Matches['/foundry-models/(?<m>[^/]+)/infer'].Groups['m'].Value)" />

    <!-- 2) モデルごとにトークン制限を分岐 -->
    <choose>
      <!-- gpt-4o: 月間 10,000 Token -->
      <when condition="@(context.Variables.GetValueOrDefault<string>('modelName') == 'gpt-4o')">
        <llm-token-limit
          counter-key="@(context.Subscription.Id + '-gpt-4o')"
          tokens-per-minute="60000"
          token-quota="10000" token-quota-period="Monthly"
          estimate-prompt-tokens="false"
          remaining-tokens-variable-name="remainingTokens" />
      </when>

      <!-- gpt-4o-mini: 月間 20,000 Token -->
      <when condition="@(context.Variables.GetValueOrDefault<string>('modelName') == 'gpt-4o-mini')">
        <llm-token-limit
          counter-key="@(context.Subscription.Id + '-gpt-4o-mini')"
          tokens-per-minute="60000"
          token-quota="20000" token-quota-period="Monthly"
          estimate-prompt-tokens="false"
          remaining-tokens-variable-name="remainingTokens" />
      </when>

      <!-- その他モデル: 制限なし or 別設定 -->
      <otherwise>
        <!-- ここに他モデル向けの制限ポリシーを設定するか、何もしない -->
      </otherwise>
    </choose>

    <!-- 3) 必要に応じてレート制限や総クォータも後段に追加 -->
    <rate-limit-by-key
      calls="60" renewal-period="60"
      counter-key="@(context.Subscription.Id + '-' + context.Variables.GetValueOrDefault<string>('modelName'))" />
    <quota-by-key
      calls="100000" renewal-period="2592000"
      counter-key="@(context.Subscription.Id + '-' + context.Variables.GetValueOrDefault<string>('modelName'))" />

  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
  </outbound>
</policies>

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up