Skip to content

vamsikan/oracle-text-to-sql

Repository files navigation

Oracle Text-to-SQL Generator

An AI-powered tool that converts natural language queries into Oracle SQL statements, with advanced features for schema understanding and query optimization.

Features

  • Natural language to SQL conversion using Claude Sonnet
  • Intelligent schema understanding and relationship mapping
  • Support for Oracle-specific features:
    • Window functions
    • Analytic functions
    • Advanced analytics (MODEL clause, pattern matching)
  • Query optimization and performance suggestions
  • Detailed query explanations
  • Flexible table filtering and schema management

Prerequisites

  • Python 3.8+
  • Oracle Database
  • Anthropic API key
  • Oracle Client libraries

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/oracle-text-to-sql.git
cd oracle-text-to-sql
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Create a .env file with your configuration:
# Oracle Database
ORACLE_USER=your_username
ORACLE_PASSWORD=your_password
ORACLE_HOST=your_host
ORACLE_PORT=1521
ORACLE_SERVICE=your_service

# Anthropic
ANTHROPIC_API_KEY=your_api_key

# Model Configuration
CLAUDE_MODEL=claude-3-sonnet-20240229

Usage

  1. Initialize the TextToSQLGenerator:
from sql_generator import TextToSQLGenerator
from schema_manager import SchemaManager
import oracledb

# Create database connection
connection = oracledb.connect(
    user=os.getenv('ORACLE_USER'),
    password=os.getenv('ORACLE_PASSWORD'),
    host=os.getenv('ORACLE_HOST'),
    port=os.getenv('ORACLE_PORT'),
    service_name=os.getenv('ORACLE_SERVICE')
)

# Initialize with specific tables
schema_manager = SchemaManager(connection, use_specific_tables=True)

# Create generator
generator = TextToSQLGenerator(connection)
  1. Generate SQL from natural language:
query = "Show me the top 5 employees by salary in each department"
result = generator.generate_sql(query)
print(result['sql'])
print(result['explanation'])

Project Structure

  • sql_generator.py: Main SQL generation logic
  • schema_manager.py: Database schema management
  • llm_manager.py: LLM integration and query analysis
  • api.py: FastAPI endpoints
  • main.py: Application entry point

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Anthropic for Claude API
  • Oracle for database features
  • FastAPI for the web framework

Project files overview

  • main.py: Entry point. Loads environment, starts Uvicorn, serves FastAPI app from api.py.
  • api.py: REST API. Endpoints to generate SQL, validate/execute, get optimizations, fetch/export schema, and retrieve learning data. Creates a single TextToSQLGenerator instance.
  • sql_generator.py: Core orchestration. Connects to Oracle, builds prompts, calls LLM for SQL, validates via EXPLAIN PLAN, optionally executes queries, suggests optimizations, and exposes helpers (hints, stored procedures, templates).
  • schema_manager.py: Reads schema from Oracle. Loads tables, columns, PK/FK, relationships, stats; supports filtering by tables/pattern; formats schema for LLM; exports schema to JSON.
  • schema_ai.py: AI understanding of schema. Embeddings/classification to map business terms to tables/columns and identify relationships; used to find relevant tables/entities for a query.
  • llm_manager.py: Interface to Claude Sonnet. Generates SQL from prompts, analyzes queries (JSON output), and provides SQL optimization suggestions; includes prompt templates (CTEs, window/analytic functions, advanced Oracle features).
  • learning_manager.py: Learning and adaptation. Tracks query history, similar queries via embeddings, uncertain cases, feedback incorporation, pattern extraction, and query template generation; basic optimization helper.

If you plan to only fetch schema from your DB:

  • Run main.py, then call GET /schema or GET /schema/export.

sql_cli.py python sql_cli.py --query "Top 10 customers by total sales last year" --tables OBJ_BP OBJ_KEY --pattern bp --execute

Pattern + specific tables: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --export schema_export.json Optional stats: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --stats All tables in the DB: python schema_manager.py --export schema_export.json Optional stats: python schema_manager.py --stats

When you run: python schema_manager.py --export schema_export.json The CLI creates SchemaManager(connection) with no table_filter and use_specific_tables=False. SchemaManager.init calls initialize_schema(), which queries all user tables: SELECT table_name FROM user_tables WHERE table_name NOT LIKE 'BIN$%' Because there’s no table_filter, it loads every table owned by the connected user. For the export, save_schema_to_file() calls get_complete_schema() which independently fetches ALL tables (again) via: SELECT table_name, comments FROM user_tab_comments WHERE table_type = 'TABLE' Then pulls columns (user_tab_columns), PKs and FKs (all_constraints/all_cons_columns) for each table. So the export always includes all tables unless you explicitly filter. If you want to limit the export, run: By pattern and specific list: python schema_manager.py --tables OBJ_BP OBJ_KEY --pattern bp --export schema_export.json By pattern only: python schema_manager.py --pattern bp --export schema_export.json\

Purpose of schema_ai.py

  • Builds an AI-understanding of your database schema so natural language queries can be mapped to the right tables/columns.
  • Provides semantic search over schema entities and extracts business-like terms.

What it loads/initializes

  • Vertex AI Text Embeddings (textembedding-gecko@latest) for semantic vectors.
  • BERT-based classifier (via HuggingFace) and tokenizer for NLP tasks.
  • Logging and environment via .env.

Core data structures

  • SchemaEntity: represents a table or column with description, embedding, and related entities.
  • Internal maps:
    • entities: name → SchemaEntity
    • entity_clusters: cluster_id → [entity names]
    • business_term_mappings: discovered domain terms → metadata

Key methods and flow

  • analyze_schema(schema_json):
    • Iterates exported JSON tables.
    • Creates entities for each table and its columns.
    • Generates embeddings for descriptions.
    • Links columns back to their table as related entities.
  • cluster_entities(threshold):
    • Cosine-similarity over embeddings to group similar entities.
  • generate_business_terms():
    • From clusters, extracts common terms to propose business terminology and related tables/columns.
  • understand_query(query):
    • Embeds the query and computes cosine similarity with all entity embeddings.
    • Returns top-matched entities and extracted key terms.
  • extract_meaningful_terms(text):
    • Simple NLP cleaning + stop-word removal to keep meaningful tokens.
  • save_analysis(path):
    • Persists entities, clusters, and generated business terms to JSON.

How it’s used

  • During startup (API or CLI), schema JSON is analyzed to build a semantic map.
  • For a user query, understand_query returns relevant entities which are used to:
    • Pick relevant tables for prompts
    • Provide schema context to the LLM for SQL generation

If you want, I can add:

  • A method to load entities back from a saved analysis to avoid recomputation.
  • A filter to restrict relevance to a specific table subset at runtime.

Purpose of llm_manager.py

  • Abstraction over the LLM (Claude Sonnet) used for:
    • Generating Oracle SQL from a prompt
    • Analyzing a natural-language query into a structured JSON (intent, entities, window/analytic functions, CTEs, complexity)
    • Providing SQL optimization suggestions

What it initializes/handles

  • Model setup and configuration (API keys via env).
  • Safe call wrapper: retries, basic rate limiting, error logging.
  • Prompt templates optimized for Oracle:
    • Encourages ANSI joins, CTEs, window/analytic functions, advanced features (MODEL, MATCH_RECOGNIZE, PIVOT/UNPIVOT), Oracle-specific syntax.

Key methods

  • generate_sql(prompt: str) -> str
    • Sends a carefully engineered prompt to Claude Sonnet and returns only the SQL text.
  • analyze_query(query: str) -> Dict[str, Any]
    • Produces a structured JSON analysis (intent, entities, conditions, aggregations, window functions, analytic functions, advanced features, complexity/perf hints).
  • optimize_sql(sql: str) -> Dict[str, Any]
    • Asks the LLM for optimization guidance (indexes, CTE refactors, pushdowns, window function tuning).

How it’s used in the app

  • sql_generator.py: Builds a prompt from schema/context and calls llm_manager.generate_sql() to get the SQL.
  • local_cli.py: Directly builds a prompt from the exported JSON + SchemaAI and calls generate_sql() for offline generation.
  • Optionally, analyze_query() can be used before generation to enforce structure and better constraints.

If you want, I can:

  • Expose a strict “SQL-only” generation guardrail (regex post-filtering).
  • Add a “SQL + explanation” mode.
  • Add temperature/top_p controls via CLI/API flags.

What learning_manager.py does

  • Learns from usage and feedback to improve future SQL generation.

Key capabilities:

  • Similar query search: stores past queries with embeddings; finds nearest matches for new queries.
  • Pattern recognition: extracts generalized templates from successful queries (e.g., common joins/filters) to guide prompt building.
  • Feedback incorporation: records user feedback (correct/incorrect, corrected SQL) and updates patterns/templates accordingly.
  • Uncertain queries tracking: logs low-confidence or error-prone cases for review.
  • Domain terms evolution: updates internal domain terminology stats (usage counts, confidence) from observed queries and feedback.
  • Optimization hints: analyzes execution outcomes/metadata to suggest improvements (e.g., indexing opportunities, join order, pushdowns).

Typical flow:

  1. After generation/execution, you call analyze_feedback(...) with the outcome.
  2. It updates:
    • Successful pattern store
    • Query templates
    • Domain term stats
    • History/embeddings for similarity search
  3. Next queries benefit from:
    • Better similar examples
    • Stronger templates
    • More accurate term mapping
    • Optimization suggestions

Outputs used by others:

  • sql_generator.py fetches similar queries/templates to enrich prompts.
  • API endpoints expose uncertain queries, templates, and domain terms for monitoring.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages