An AI-powered tool that converts natural language queries into Oracle SQL statements, with advanced features for schema understanding and query optimization.
- Natural language to SQL conversion using Claude Sonnet
- Intelligent schema understanding and relationship mapping
- Support for Oracle-specific features:
- Window functions
- Analytic functions
- Advanced analytics (MODEL clause, pattern matching)
- Query optimization and performance suggestions
- Detailed query explanations
- Flexible table filtering and schema management
- Python 3.8+
- Oracle Database
- Anthropic API key
- Oracle Client libraries
- Clone the repository:
git clone https://github.com/yourusername/oracle-text-to-sql.git
cd oracle-text-to-sql- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create a
.envfile with your configuration:
# Oracle Database
ORACLE_USER=your_username
ORACLE_PASSWORD=your_password
ORACLE_HOST=your_host
ORACLE_PORT=1521
ORACLE_SERVICE=your_service
# Anthropic
ANTHROPIC_API_KEY=your_api_key
# Model Configuration
CLAUDE_MODEL=claude-3-sonnet-20240229- Initialize the TextToSQLGenerator:
from sql_generator import TextToSQLGenerator
from schema_manager import SchemaManager
import oracledb
# Create database connection
connection = oracledb.connect(
user=os.getenv('ORACLE_USER'),
password=os.getenv('ORACLE_PASSWORD'),
host=os.getenv('ORACLE_HOST'),
port=os.getenv('ORACLE_PORT'),
service_name=os.getenv('ORACLE_SERVICE')
)
# Initialize with specific tables
schema_manager = SchemaManager(connection, use_specific_tables=True)
# Create generator
generator = TextToSQLGenerator(connection)- Generate SQL from natural language:
query = "Show me the top 5 employees by salary in each department"
result = generator.generate_sql(query)
print(result['sql'])
print(result['explanation'])sql_generator.py: Main SQL generation logicschema_manager.py: Database schema managementllm_manager.py: LLM integration and query analysisapi.py: FastAPI endpointsmain.py: Application entry point
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Anthropic for Claude API
- Oracle for database features
- FastAPI for the web framework
- main.py: Entry point. Loads environment, starts Uvicorn, serves FastAPI app from api.py.
- api.py: REST API. Endpoints to generate SQL, validate/execute, get optimizations, fetch/export schema, and retrieve learning data. Creates a single
TextToSQLGeneratorinstance. - sql_generator.py: Core orchestration. Connects to Oracle, builds prompts, calls LLM for SQL, validates via EXPLAIN PLAN, optionally executes queries, suggests optimizations, and exposes helpers (hints, stored procedures, templates).
- schema_manager.py: Reads schema from Oracle. Loads tables, columns, PK/FK, relationships, stats; supports filtering by tables/pattern; formats schema for LLM; exports schema to JSON.
- schema_ai.py: AI understanding of schema. Embeddings/classification to map business terms to tables/columns and identify relationships; used to find relevant tables/entities for a query.
- llm_manager.py: Interface to Claude Sonnet. Generates SQL from prompts, analyzes queries (JSON output), and provides SQL optimization suggestions; includes prompt templates (CTEs, window/analytic functions, advanced Oracle features).
- learning_manager.py: Learning and adaptation. Tracks query history, similar queries via embeddings, uncertain cases, feedback incorporation, pattern extraction, and query template generation; basic optimization helper.
If you plan to only fetch schema from your DB:
- Run main.py, then call GET /schema or GET /schema/export.
sql_cli.py python sql_cli.py --query "Top 10 customers by total sales last year" --tables OBJ_BP OBJ_KEY --pattern bp --execute
Pattern + specific tables: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --export schema_export.json Optional stats: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --stats All tables in the DB: python schema_manager.py --export schema_export.json Optional stats: python schema_manager.py --stats
When you run: python schema_manager.py --export schema_export.json The CLI creates SchemaManager(connection) with no table_filter and use_specific_tables=False. SchemaManager.init calls initialize_schema(), which queries all user tables: SELECT table_name FROM user_tables WHERE table_name NOT LIKE 'BIN$%' Because there’s no table_filter, it loads every table owned by the connected user. For the export, save_schema_to_file() calls get_complete_schema() which independently fetches ALL tables (again) via: SELECT table_name, comments FROM user_tab_comments WHERE table_type = 'TABLE' Then pulls columns (user_tab_columns), PKs and FKs (all_constraints/all_cons_columns) for each table. So the export always includes all tables unless you explicitly filter. If you want to limit the export, run: By pattern and specific list: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --export schema_export.json By pattern only: python schema_manager.py --pattern bp --export schema_export.json\
- Builds an AI-understanding of your database schema so natural language queries can be mapped to the right tables/columns.
- Provides semantic search over schema entities and extracts business-like terms.
- Vertex AI Text Embeddings (
textembedding-gecko@latest) for semantic vectors. - BERT-based classifier (via HuggingFace) and tokenizer for NLP tasks.
- Logging and environment via
.env.
- SchemaEntity: represents a table or column with description, embedding, and related entities.
- Internal maps:
entities: name → SchemaEntityentity_clusters: cluster_id → [entity names]business_term_mappings: discovered domain terms → metadata
- analyze_schema(schema_json):
- Iterates exported JSON tables.
- Creates entities for each table and its columns.
- Generates embeddings for descriptions.
- Links columns back to their table as related entities.
- cluster_entities(threshold):
- Cosine-similarity over embeddings to group similar entities.
- generate_business_terms():
- From clusters, extracts common terms to propose business terminology and related tables/columns.
- understand_query(query):
- Embeds the query and computes cosine similarity with all entity embeddings.
- Returns top-matched entities and extracted key terms.
- extract_meaningful_terms(text):
- Simple NLP cleaning + stop-word removal to keep meaningful tokens.
- save_analysis(path):
- Persists entities, clusters, and generated business terms to JSON.
- During startup (API or CLI), schema JSON is analyzed to build a semantic map.
- For a user query,
understand_queryreturns relevant entities which are used to:- Pick relevant tables for prompts
- Provide schema context to the LLM for SQL generation
If you want, I can add:
- A method to load entities back from a saved analysis to avoid recomputation.
- A filter to restrict relevance to a specific table subset at runtime.
- Abstraction over the LLM (Claude Sonnet) used for:
- Generating Oracle SQL from a prompt
- Analyzing a natural-language query into a structured JSON (intent, entities, window/analytic functions, CTEs, complexity)
- Providing SQL optimization suggestions
- Model setup and configuration (API keys via env).
- Safe call wrapper: retries, basic rate limiting, error logging.
- Prompt templates optimized for Oracle:
- Encourages ANSI joins, CTEs, window/analytic functions, advanced features (MODEL, MATCH_RECOGNIZE, PIVOT/UNPIVOT), Oracle-specific syntax.
- generate_sql(prompt: str) -> str
- Sends a carefully engineered prompt to Claude Sonnet and returns only the SQL text.
- analyze_query(query: str) -> Dict[str, Any]
- Produces a structured JSON analysis (intent, entities, conditions, aggregations, window functions, analytic functions, advanced features, complexity/perf hints).
- optimize_sql(sql: str) -> Dict[str, Any]
- Asks the LLM for optimization guidance (indexes, CTE refactors, pushdowns, window function tuning).
sql_generator.py: Builds a prompt from schema/context and callsllm_manager.generate_sql()to get the SQL.local_cli.py: Directly builds a prompt from the exported JSON +SchemaAIand callsgenerate_sql()for offline generation.- Optionally,
analyze_query()can be used before generation to enforce structure and better constraints.
If you want, I can:
- Expose a strict “SQL-only” generation guardrail (regex post-filtering).
- Add a “SQL + explanation” mode.
- Add temperature/top_p controls via CLI/API flags.
- Learns from usage and feedback to improve future SQL generation.
Key capabilities:
- Similar query search: stores past queries with embeddings; finds nearest matches for new queries.
- Pattern recognition: extracts generalized templates from successful queries (e.g., common joins/filters) to guide prompt building.
- Feedback incorporation: records user feedback (correct/incorrect, corrected SQL) and updates patterns/templates accordingly.
- Uncertain queries tracking: logs low-confidence or error-prone cases for review.
- Domain terms evolution: updates internal domain terminology stats (usage counts, confidence) from observed queries and feedback.
- Optimization hints: analyzes execution outcomes/metadata to suggest improvements (e.g., indexing opportunities, join order, pushdowns).
Typical flow:
- After generation/execution, you call analyze_feedback(...) with the outcome.
- It updates:
- Successful pattern store
- Query templates
- Domain term stats
- History/embeddings for similarity search
- Next queries benefit from:
- Better similar examples
- Stronger templates
- More accurate term mapping
- Optimization suggestions
Outputs used by others:
sql_generator.pyfetches similar queries/templates to enrich prompts.- API endpoints expose uncertain queries, templates, and domain terms for monitoring.