Skip to content

Commit b3e96b8

Browse files
Python: Use AI Foundry evaluators for self-reflection (microsoft#2250)
* First working version * Simplify the implementations * Remove unused env var * Update Python syntax * Address feedbacks * Fix a typo * Update names as review suggestions * Citation for self-reflection * Move to independent folder * Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/README.md Co-authored-by: Eduard van Valkenburg <[email protected]> * Updated from parquet to JSONL and hide the default environment variables * As review feedback, remove the purpose of using `run_self_reflection_batch` as a library, only use it as sample code * Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py Co-authored-by: Eduard van Valkenburg <[email protected]> --------- Co-authored-by: Eduard van Valkenburg <[email protected]>
1 parent 92df9e1 commit b3e96b8

File tree

5 files changed

+490
-0
lines changed

5 files changed

+490
-0
lines changed

python/samples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,7 @@ This directory contains samples demonstrating the capabilities of Microsoft Agen
185185
| File | Description |
186186
|------|-------------|
187187
| [`getting_started/evaluation/azure_ai_foundry/red_team_agent_sample.py`](./getting_started/evaluation/azure_ai_foundry/red_team_agent_sample.py) | Red team agent evaluation sample for Azure AI Foundry |
188+
| [`getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py`](./getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py) | LLM self-reflection with AI Foundry graders example |
188189

189190
## MCP (Model Context Protocol)
190191

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
AZURE_OPENAI_ENDPOINT="..."
2+
AZURE_OPENAI_API_KEY="..."
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Self-Reflection Evaluation Sample
2+
3+
This sample demonstrates the self-reflection pattern using Agent Framework and Azure AI Foundry's Groundedness Evaluator. For details, see [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (NeurIPS 2023).
4+
5+
## Overview
6+
7+
**What it demonstrates:**
8+
- Iterative self-reflection loop that automatically improves responses based on groundedness evaluation
9+
- Batch processing of prompts from Parquet files with progress tracking
10+
- Using `AzureOpenAIChatClient` with Azure CLI authentication
11+
- Comprehensive summary statistics and detailed result tracking
12+
13+
## Prerequisites
14+
15+
### Azure Resources
16+
- **Azure OpenAI**: Deploy models (default: gpt-4.1 for both agent and judge)
17+
- **Azure CLI**: Run `az login` to authenticate
18+
19+
### Python Environment
20+
```bash
21+
pip install agent-framework-core azure-ai-evaluation pandas --pre
22+
```
23+
24+
### Environment Variables
25+
```bash
26+
# .env file
27+
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
28+
AZURE_OPENAI_API_KEY=your-api-key # Optional with Azure CLI
29+
```
30+
31+
## Running the Sample
32+
33+
```bash
34+
# Basic usage
35+
python self_reflection.py
36+
37+
# With options
38+
python self_reflection.py --input my_prompts.parquet \
39+
--output results.parquet \
40+
--max-reflections 5 \
41+
-n 10
42+
```
43+
44+
**CLI Options:**
45+
- `--input`, `-i`: Input parquet file
46+
- `--output`, `-o`: Output parquet file
47+
- `--agent-model`, `-m`: Agent model name (default: gpt-4.1)
48+
- `--judge-model`, `-e`: Evaluator model name (default: gpt-4.1)
49+
- `--max-reflections`: Max iterations (default: 3)
50+
- `--limit`, `-n`: Process only first N prompts
51+
52+
## Understanding Results
53+
54+
The agent iteratively improves responses:
55+
1. Generate initial response
56+
2. Evaluate groundedness (1-5 scale)
57+
3. If score < 5, provide feedback and retry
58+
4. Stop at max iterations or perfect score (5/5)
59+
60+
**Example output:**
61+
```
62+
[1/31] Processing prompt 0...
63+
Self-reflection iteration 1/3...
64+
Groundedness score: 3/5
65+
Self-reflection iteration 2/3...
66+
Groundedness score: 5/5
67+
✓ Perfect groundedness score achieved!
68+
✓ Completed with score: 5/5 (best at iteration 2/3)
69+
```
70+
71+
## Related Resources
72+
73+
- [Reflexion Paper](https://arxiv.org/abs/2303.11366)
74+
- [Azure AI Evaluation SDK](https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk)
75+
- [Agent Framework](https://github.com/microsoft/agent-framework)

0 commit comments

Comments
 (0)