foundry-hosted-agents-troubleshoot
Troubleshoot Foundry Hosted Agent errors and issues. Use when users encounter errors, failures, problems, or unexpected behavior with hosted agents. Triggers include agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, logs, debug agent, agent not working, deployment failed. USE FOR: agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, debug agent, agent not working, deployment failed, check logs, fix agent error. DO NOT USE FOR: creating new agents (use foundry-hosted-agents-create), deploying agents (use foundry-hosted-agents-deploy), normal testing (use foundry-hosted-agents-test), learning basics (use foundry-hosted-agents-quickstart). INVOKES: run_in_terminal for az cognitiveservices agent logs/status commands. FOR SINGLE OPERATIONS: run az cognitiveservices agent logs show directly for quick log checks.
SKILL.md
| Name | foundry-hosted-agents-troubleshoot |
| Description | Troubleshoot Foundry Hosted Agent errors and issues. Use when users encounter errors, failures, problems, or unexpected behavior with hosted agents. Triggers include agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, logs, debug agent, agent not working, deployment failed. USE FOR: agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, debug agent, agent not working, deployment failed, check logs, fix agent error. DO NOT USE FOR: creating new agents (use foundry-hosted-agents-create), deploying agents (use foundry-hosted-agents-deploy), normal testing (use foundry-hosted-agents-test), learning basics (use foundry-hosted-agents-quickstart). INVOKES: run_in_terminal for az cognitiveservices agent logs/status commands. FOR SINGLE OPERATIONS: run az cognitiveservices agent logs show directly for quick log checks. |
name: foundry-hosted-agents-troubleshoot description: "Troubleshoot Foundry Hosted Agent errors and issues. Use when users encounter errors, failures, problems, or unexpected behavior with hosted agents. Triggers include agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, logs, debug agent, agent not working, deployment failed. USE FOR: agent failed, agent unhealthy, AcrPullUnauthorized, 403 error, AuthenticationError, connection refused, debug agent, agent not working, deployment failed, check logs, fix agent error. DO NOT USE FOR: creating new agents (use foundry-hosted-agents-create), deploying agents (use foundry-hosted-agents-deploy), normal testing (use foundry-hosted-agents-test), learning basics (use foundry-hosted-agents-quickstart). INVOKES: run_in_terminal for az cognitiveservices agent logs/status commands. FOR SINGLE OPERATIONS: run az cognitiveservices agent logs show directly for quick log checks."
Troubleshoot Foundry Hosted Agents
Use this skill when users are experiencing errors or issues with hosted agents.
For creating agents, see the foundry-hosted-agents-create skill.
For testing agents, see the foundry-hosted-agents-test skill.
For deploying agents, see the foundry-hosted-agents-deploy skill.
WHEN USER REPORTS AN ERROR - START HERE:
Step 1: Check Agent Status
az cognitiveservices agent status \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--agent-version 1
Step 2: Check Agent Logs
az cognitiveservices agent logs show \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--agent-version 1
Step 3: Match Error to Solution Below
WHEN USER SEES: "Azure AI project endpoint is required"
Cause
agent.yaml is using the wrong environment variable name.
Solution
In agent.yaml, use ${AZURE_AI_PROJECT_ENDPOINT} (the azd variable), NOT ${PROJECT_ENDPOINT}:
environment_variables:
- name: PROJECT_ENDPOINT
value: ${AZURE_AI_PROJECT_ENDPOINT} # ✓ Correct
# value: ${PROJECT_ENDPOINT} # ✗ Wrong
WHEN USER SEES: "PROJECT_ENDPOINT environment variable is required"
Cause
When using az cognitiveservices agent create, environment variables were not passed.
Solution
Add --env flag with required variables:
az cognitiveservices agent create \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--source . \
--registry <acr-name> \
--env PROJECT_ENDPOINT=https://<account>.services.ai.azure.com/api/projects/<project> MODEL_DEPLOYMENT_NAME=gpt-4.1 \
--show-logs
WHEN USER SEES: "AcrPullUnauthorized" or Container Pull Errors
Cause
The project's managed identity doesn't have permission to pull from the container registry.
Solution
Grant AcrPull role:
# Get project managed identity
PROJECT_IDENTITY=$(az cognitiveservices account project show \
--name <foundry-account> \
--resource-group <resource-group> \
--project-name <project-name> \
--query identity.principalId -o tsv)
# Get ACR resource ID
ACR_ID=$(az acr show --name <acr-name> --resource-group <resource-group> --query id -o tsv)
# Grant AcrPull
az role assignment create \
--assignee $PROJECT_IDENTITY \
--role "AcrPull" \
--scope $ACR_ID
WHEN USER SEES: 403 Error, "Model access denied", or Authorization Errors
Cause
The project's managed identity doesn't have the Azure AI User role on the Foundry account.
Solution
Grant Azure AI User role:
# Get project managed identity
PROJECT_IDENTITY=$(az cognitiveservices account project show \
--name <foundry-account> \
--resource-group <resource-group> \
--project-name <project-name> \
--query identity.principalId -o tsv)
# Get Foundry account resource ID
FOUNDRY_ID=$(az cognitiveservices account show \
--name <foundry-account> \
--resource-group <resource-group> \
--query id -o tsv)
# Grant Azure AI User
az role assignment create \
--assignee $PROJECT_IDENTITY \
--role "Azure AI User" \
--scope $FOUNDRY_ID
WHEN USER SEES: "AuthenticationError" During Local Testing
Cause
User is not logged into Azure CLI.
Solution
az login
az account show # Verify you're logged in
If using a specific subscription:
az account set --subscription <subscription-id>
WHEN USER SEES: Agent Status "Failed" or "Unhealthy"
Diagnosis
Check the logs for specific error:
az cognitiveservices agent logs show \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--agent-version 1
Common Causes
| Log Message | Cause | Solution |
|---|---|---|
PROJECT_ENDPOINT is required | Missing env var | Redeploy with --env flag |
Model not found | Wrong model name | Check MODEL_DEPLOYMENT_NAME matches deployed model |
Import error | Missing dependency | Add to requirements.txt and redeploy |
Connection refused | Agent crashed on startup | Check main.py for errors |
Restart Agent
az cognitiveservices agent stop \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--agent-version 1
az cognitiveservices agent start \
--account-name <account> \
--project-name <project> \
--name <agent-name> \
--agent-version 1
WHEN USER SEES: "Connection refused" or Port 8088 Issues (Local)
Cause
Agent is not running, or port is blocked/in use.
Solution
Check if port is in use:
lsof -i:8088
Kill existing process:
lsof -ti:8088 | xargs kill -9
Restart agent:
python main.py
# Or for azd projects:
python src/<agent-name>/main.py
WHEN USER SEES: "Invalid connection string" for App Insights
Cause
Application Insights connection string is not set or invalid.
Impact
This is usually NOT a critical error. The agent will work without App Insights, but you lose valuable observability.
Solution
Step 1: Check if project has AppInsights connection (auto-injection)
# If this returns a result, connection string should be auto-injected
az rest --method GET \
--url "https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CognitiveServices/accounts/<account>/projects/<project>/connections?api-version=2025-06-01" \
--query "value[?properties.category=='AppInsights'].name" -o tsv
If AppInsights connection exists: The connection string should be auto-injected. Try redeploying the agent.
If NO AppInsights connection: Continue to find and connect Application Insights.
Step 2: Find Application Insights resources
# Check resource group first
az resource list --resource-type "Microsoft.Insights/components" \
--resource-group <resource-group> \
--query "[].{name:name, id:id}" -o table
# If not found, search entire subscription
az resource list --resource-type "Microsoft.Insights/components" \
--query "[].{name:name, resourceGroup:resourceGroup, id:id}" -o table
Step 3a: If App Insights exists - Create project connection (RECOMMENDED)
# Set variables
SUBSCRIPTION_ID="<subscription-id>"
RESOURCE_GROUP="<resource-group>"
ACCOUNT_NAME="<foundry-account>"
PROJECT_NAME="<project>"
APPINSIGHTS_NAME="<app-insights-name>"
CONNECTION_NAME="${APPINSIGHTS_NAME}-connection"
# Get App Insights resource ID and connection string
APPINSIGHTS_ID=$(az monitor app-insights component show \
--app $APPINSIGHTS_NAME \
--resource-group $RESOURCE_GROUP \
--query id -o tsv)
CONN_STRING=$(az monitor app-insights component show \
--app $APPINSIGHTS_NAME \
--resource-group $RESOURCE_GROUP \
--query connectionString -o tsv)
# Create JSON body file (avoids shell escaping issues)
cat > /tmp/appinsights-connection.json << EOF
{
"properties": {
"authType": "ApiKey",
"category": "AppInsights",
"credentials": {
"key": "${CONN_STRING}"
},
"group": "ServicesAndApps",
"isDefault": true,
"metadata": {
"ApiType": "Azure",
"ResourceId": "${APPINSIGHTS_ID}"
},
"target": "${APPINSIGHTS_ID}"
}
}
EOF
# Create the connection
az rest --method PUT \
--url "https://management.azure.com/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RESOURCE_GROUP}/providers/Microsoft.CognitiveServices/accounts/${ACCOUNT_NAME}/projects/${PROJECT_NAME}/connections/${CONNECTION_NAME}?api-version=2025-06-01" \
--body @/tmp/appinsights-connection.json
# Redeploy agent (connection string will be auto-injected)
Step 3b: If NO App Insights exists - Create one first
az monitor app-insights component create \
--app <app-insights-name> \
--location <location> \
--resource-group <resource-group> \
--kind web \
--application-type web
# Then create the connection (Step 3a)
Step 4: Verify observability is working
Check startup logs for: Observability setup completed with provided exporters
WHEN USER SEES: Remote Test Not Working (No Response)
Cause
Usually one of:
- Wrong API being used
- Missing
extra_bodyparameter - Wrong agent name
Solution
Use the correct API pattern for deployed agents:
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
project_client = AIProjectClient(
endpoint=PROJECT_ENDPOINT,
credential=DefaultAzureCredential(),
)
# Must use get_openai_client()
openai_client = project_client.get_openai_client()
# Must include extra_body
response = openai_client.responses.create(
conversation=conversation.id,
extra_body={"agent": {"name": "<agent-name>", "type": "agent_reference"}}, # Required!
input="Hello!",
store=True,
)
Common mistakes:
- Using
AgentsClientinstead ofAIProjectClient.get_openai_client() - Forgetting
extra_body={"agent": {...}} - Agent name doesn't match agent.yaml
namefield
WHEN USER ASKS TO VERIFY ROLE ASSIGNMENTS:
Check All Role Assignments for Project Identity
# Get project managed identity
PROJECT_IDENTITY=$(az cognitiveservices account project show \
--name <foundry-account> \
--resource-group <resource-group> \
--project-name <project-name> \
--query identity.principalId -o tsv)
# List all roles
az role assignment list \
--assignee $PROJECT_IDENTITY \
--query "[].{Role:roleDefinitionName, Scope:scope}" \
-o table
Expected Roles
| Role | Scope |
|---|---|
| AcrPull | Container Registry |
| Azure AI User | Foundry Account |
COMPLETE TROUBLESHOOTING CHECKLIST:
For Local Testing Issues
- Azure CLI logged in:
az account show -
.envfile exists withPROJECT_ENDPOINTandMODEL_DEPLOYMENT_NAME - Virtual environment activated:
source .venv/bin/activate - Dependencies installed:
pip install -r requirements.txt - No other process on port 8088:
lsof -i:8088 - Agent started successfully:
python main.py
For Deployment Issues
- ACR connected to Foundry project
- AcrPull role granted to project identity
- Azure AI User role granted to project identity
-
--envincludesPROJECT_ENDPOINTandMODEL_DEPLOYMENT_NAME - Model deployment exists and name matches
- Dockerfile and requirements.txt are correct
- (Optional)
APPLICATIONINSIGHTS_CONNECTION_STRINGincluded for observability
For Remote Testing Issues
- Agent status is "Running":
az cognitiveservices agent status ... - Using
AIProjectClient.get_openai_client()(notAgentsClient) - Including
extra_body={"agent": {...}} - Agent name matches agent.yaml exactly
- Azure CLI logged in:
az login
For Observability Issues
- Application Insights exists:
az resource list --resource-type "Microsoft.Insights/components" --resource-group <rg> - Agent deployed with
APPLICATIONINSIGHTS_CONNECTION_STRING - Startup logs show:
Observability setup completed with provided exporters - Telemetry appearing:
az monitor app-insights query --app <name> --analytics-query 'traces | take 5'
WHEN USER ASKS TO DIAGNOSE WITH APPLICATION INSIGHTS:
Query Agent Request Logs
az monitor app-insights query \
--app <app-insights-name> \
--resource-group <resource-group> \
--analytics-query 'traces | where timestamp > ago(30m) | where message has "CreateResponse" or message has "Error" or message has "Exception" | project timestamp, message, severityLevel | order by timestamp desc | take 30' \
-o json
Query for Errors Only
az monitor app-insights query \
--app <app-insights-name> \
--resource-group <resource-group> \
--analytics-query 'traces | where timestamp > ago(1h) | where severityLevel >= 3 | project timestamp, message | order by timestamp desc | take 50' \
-o json
Query Model Call Performance
az monitor app-insights query \
--app <app-insights-name> \
--resource-group <resource-group> \
--analytics-query 'dependencies | where timestamp > ago(1h) | where name has "chat" | summarize avgDuration=avg(duration), count=count() by name' \
-o json
Query Failed Dependencies
az monitor app-insights query \
--app <app-insights-name> \
--resource-group <resource-group> \
--analytics-query 'dependencies | where timestamp > ago(1h) | where success == false | project timestamp, name, duration, resultCode | order by timestamp desc' \
-o json