SKYLENAGE-GameCodeGym (V-GameGym) is an open-source benchmark designed to evaluate and measure the capabilities of Large Language Models (LLMs) in generating functional, playable, and visually rich games with the Pygame library.
The framework provides a complete pipeline for automatic game generation, execution, evaluation, and gameplay recording, bridging the gap between code generation accuracy and real-world game development workflows.
- Automatic Game Generation: Convert natural language requirements into runnable Pygame code with LLMs.
- Comprehensive Game Evaluation: Built-in scoring metrics for functionality, playability, and execution.
- Visual Recording: Automated screenshots and gameplay videos during execution.
- Testset Management: Includes a curated dataset with 2,219 game samples across 100 clusters.
- Parallel Processing: Multiprocessing support for efficient large-scale evaluation.
V-GameGym-opensource/
├── game_evaluator.py # Main evaluation script
├── generate_pygame_codes.py # Game generation utilities
├── screenshot_recorder.py # Screenshot and video recording
├── config/
│ └── config.json # LLM client configuration
├── gamegym_testset/
│ ├── gamegym_testset.jsonl # Test cases dataset
│ └── files/ # Generated game files and media
└── V_GameGym.pdf # Research paper
- Python 3.10+
- Pygame
- OpenAI API access or compatible LLM endpoint
pip install pygame numpy pillow openai tqdm jsonlinesEdit config/config.json to configure your LLM API:
{
"client_config": {
"api_key": "your-api-key",
"base_url": "your-llm-endpoint",
"timeout": 7200,
"max_retries": 10
},
"chat_config": {
"model": "your-model-name",
"temperature": 0.7,
"max_tokens": 8192
}
}python generate_pygame_codes.py --config config/config.json --input requirements.jsonl --output generated_games.jsonlpython game_evaluator.py --input games.jsonl --output results.jsonl --record-screenshots --generate-videospython screenshot_recorder.py --game-file game.py --duration 10 --fps 5The project includes a comprehensive testset (gamegym_testset/gamegym_testset.jsonl) with diverse game examples:
- Puzzle Games: Sliding puzzle, Tetris-style games
- Action Games: Frogger-like crossing games, dodge games
- Sports Games: Pong-style paddle games
- Arcade Games: Various classic arcade game implementations
Each test case includes:
- Game requirements description
- Generated Python code
- Execution results and metadata
- Screenshots and gameplay videos
- Interfaces with LLMs for code generation
- Includes batch processing, error handling, and retries
- Captures screenshots during execution
- Converts image sequences into gameplay videos
- Runs games in isolated environments
- Records errors, screenshots, and evaluation metrics
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is released under the Apache License 2.0 License. See the LICENSE file for details.
If you use V-GameGym in your research, please cite:
@misc{zhang2025vgamegymvisualgamegeneration,
title = {V-GameGym: Visual Game Generation for Code Large Language Models},
author = {Wei Zhang and Jack Yang and Renshuai Tao and Lingzheng Chai and Shawn Guo and Jiajun Wu and Xiaoming Chen and Ganqu Cui and Ning Ding and Xander Xu and Hu Wei and Bowen Zhou},
year = {2025},
eprint = {2509.20136},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2509.20136}
}- Thanks to the Pygame community for the excellent framework
- OpenAI and other LLM providers for enabling automated code generation
- All contributors and researchers advancing automated programming
🔗 Official Website: Skylenage Benchmark Platform
📧 Contact Us: [email protected]