Skip to content

Automated ETL pipeline for extracting data from Microsoft Dynamics 365 Dataverse and constructing knowledge graphs in Neo4j. Streamlines enterprise data integration with async processing, OAuth 2.0 authentication, and relationship mapping

License

Notifications You must be signed in to change notification settings

awong789/dynamics365-knowledge-graph

Repository files navigation

D365 Knowledge Graph Construction

Transform your Dynamics 365 CRM into knowledge graph for advanced analytics and relationship discovery

🎯 Problem

Due to the nature of query languange, there are limitation for LLMs to extract meaning and relationship within traditional record based system. Knowledge graph has been proven to improve the accuracy and reduced hallucination when presented with multi-part questioning that requires connecting dots across system.

https://neo4j.com/blog/genai/knowledge-graph-llm-multi-hop-reasoning/

This automated ETL pipeline transforms D365 into a Neo4j knowledge graph, unlocking powerful graph analytics, relationship discovery, and AI-powered insights that are impossible with traditional relational queries.

Preview functionality

🎯 Project Goals

This project builds a comprehensive knowledge graph from Microsoft Dynamics 365 data using Neo4j Aura DB. The system extracts, transforms, and loads D365 entities and their relationships into a graph database, enabling advanced analytics, relationship discovery, and intelligent querying capabilities.

Primary Objectives

  • Data Integration: Extract data from Dynamics 365 Dataverse
  • Graph Transformation: Convert relational D365 data into a property graph model
  • Relationship Preservation: Maintain all existing D365 relationships in the knowledge graph
  • Extract meaning from text and description: Extract relationship from text in notes, email, and relates them to entity
  • Real-time Synchronization: Keep the knowledge graph updated with D365 changes
  • Scalable Architecture: Build a robust, maintainable, and scalable ETL pipeline

🚀 Core Functionality

1. Data Extraction ✅ COMPLETE

  • Async D365 Client: High-performance async client with MSAL authentication
  • Rate Limiting: Automatic rate limiting (6000 requests/minute) with backoff
  • Multi-Entity Extractors: 10 specialized extractors for all D365 entities
  • File-Based Storage: Parquet compression with JSON fallback
  • Pagination Handling: Microsoft-recommended @odata.nextLink approach
  • Batch Processing: Configurable batch sizes for optimal performance
  • Quality Scoring: Entity-specific validation and data quality metrics

2. Graph Transformation ✅ COMPLETE

  • Schema Mapping: Automated D365 entity to Neo4j node mapping
  • Field Transformation: 50+ business rules for data standardization
  • Relationship Building: Intelligent relationship discovery and mapping
  • Data Validation: Comprehensive validation with quality scoring
  • Business Rules Engine: Entity-specific transformations and formatting
  • Multi-Entity Support: Unified transformation pipeline for all 10 entities

3. Neo4j Loading ✅ COMPLETE

  • Connection Management: Async Neo4j driver with connection pooling
  • Batch Loading: UNWIND-based batch operations for optimal performance
  • Node Creation: MERGE operations for idempotency (prevents duplicates)
  • Relationship Loading: Polymorphic relationship handling
  • File Support: Direct loading from Parquet, JSON, JSONL files
  • Index Management: Automatic constraint and index creation
  • Data Validation: Pre-load validation with quality checks

4. Supported D365 Entities (11 Total)

  • Core Business Entities:
    • Account (Companies/Organizations)
    • Contact (Individual Persons)
    • Lead (Sales Prospects)
  • Sales Transaction Entities:
    • Opportunity (Sales Deals)
    • Order (Sales Orders)
    • Invoice (Financial Invoices)
  • Communication Activity Entities:
    • Email (Email Communications)
    • PhoneCall (Phone Call Activities)
    • Appointment (Scheduled Meetings)
    • ActivityParty (Activity Participants - Senders/Recipients) ✨ NEW
  • Content Entities:
    • Note/Annotation (Notes, Attachments, Comments)
  • Extensibility: Framework supports custom D365 entities

5. Knowledge Graph Operations

  • Batch loading of historical data
  • Real-time incremental updates
  • Data quality validation
  • Relationship integrity checks
  • Graph optimization and indexing

📁 Project Structure

D365KGConstruct/
│
├── README.md                 # Project overview and documentation
├── requirements.txt          # Python dependencies
├── .env.example             # Environment variables template
├── config.yaml              # Configuration settings
│
├── documents/               # Documentation folder
│   ├── architecture.md      # System architecture and design
│   ├── api-reference.md     # API documentation
│   └── deployment-guide.md  # Deployment instructions
│
├── src/                     # Source code
│   ├── __init__.py
│   ├── extractors/          # D365 data extraction modules
│   ├── transformers/        # Data transformation logic
│   ├── loaders/            # Neo4j loading modules
│   ├── models/             # Data models and schemas
│   ├── utils/              # Utility functions
│   └── orchestration/      # Pipeline orchestration
│
├── tasks/                   # Manual task tracking
│   └── TODO.md             # Task list and progress tracking
│
├── ai-dev-tasks/           # AI-assisted development tasks
│   ├── prompts/            # AI prompts for code generation
│   └── generated/          # AI-generated code artifacts
│
├── tests/                  # Test suite
│   ├── unit/              # Unit tests
│   ├── integration/       # Integration tests
│   └── fixtures/          # Test data and mocks
│
├── scripts/               # Utility scripts
│   ├── setup_neo4j.py    # Neo4j initialization script
│   ├── validate_graph.py # Graph validation utilities
│   └── run_etl.py       # Main ETL execution script
│
└── docker/               # Docker configuration
    ├── Dockerfile       # Container definition
    └── docker-compose.yml # Multi-container setup

🛠️ Technology Stack

  • Graph Database: Neo4j Aura DB
  • D365 Integration: Dataverse Web API / Azure SDK
  • Data Processing: Pandas, NumPy
  • Graph Driver: py2neo / neo4j-python-driver
  • Containerization: Docker
  • CI/CD: GitHub Actions / Azure DevOps

🚦 Quick Start

  1. Clone the Repository

    git clone <repository-url>
    cd D365KGConstruct
  2. Set Up Environment

    Windows Users (Recommended):

    # Option 1: Quick setup (use if starting fresh)
    reset_venv.bat
    venv\Scripts\activate
    pip install -r requirements.txt
    
    # Option 2: Manual setup
    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt

    If Virtual Environment is Broken:

    # This fixes "Unable to create process" errors
    reset_venv.bat
    venv\Scripts\activate
    pip install -r requirements-minimal.txt

    Linux/Mac Users:

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt

    Minimal Installation (Core Features Only):

    pip install -r requirements-minimal.txt
    python -m src.cli --help
    # Note: System works with minimal packages using JSON storage

    For Annotation Enrichment (Optional):

    # After activating venv
    install_enrichment.bat   # Windows
    # Or manually:
    pip install neo4j-graphrag openai beautifulsoup4 html2text
  3. Configure Credentials

    cp .env.example .env
    # Edit .env with your D365 and Neo4j credentials
  4. Verify Connectivity

    # Test all connections (D365 + Neo4j)
    python -m src.cli test-connection
    
    # Test specific connections
    python -m src.cli test-connection --service neo4j    # Neo4j AuraDB only
    python -m src.cli test-connection --service d365     # D365 Dataverse only
  5. Initialize Neo4j Schema

    python -m src.cli init
  6. Extract Data from D365 to Files

    # Multi-Entity Extraction (All 11 entities in one run)
    python -m src.cli extract --mode=full                # Extract ALL entities
    python -m src.cli extract --mode=incremental         # Incremental multi-entity
    
    # Single Entity Extraction
    python -m src.cli extract --entity=account --mode=full       # Accounts only
    python -m src.cli extract --entity=contact --mode=full       # Contacts only
    python -m src.cli extract --entity=lead --mode=full          # Leads only
    python -m src.cli extract --entity=opportunity --mode=full   # Opportunities only
    python -m src.cli extract --entity=salesorder --mode=full    # Orders only
    python -m src.cli extract --entity=invoice --mode=full       # Invoices only
    python -m src.cli extract --entity=email --mode=full         # Emails only
    python -m src.cli extract --entity=phonecall --mode=full     # Phone calls only
    python -m src.cli extract --entity=appointment --mode=full   # Appointments only
    python -m src.cli extract --entity=activityparty --mode=full # Activity participants only
    python -m src.cli extract --entity=annotation --mode=full    # Notes/attachments only
    
    # Custom output directory
    python -m src.cli extract --output-dir=custom/path --mode=full
    
    # Output structure for multi-entity extraction:
    # /output/extract/{run_id}/
    # ├── account/batch_001.parquet
    # ├── contact/batch_001.parquet
    # ├── lead/batch_001.parquet
    # ├── opportunity/batch_001.parquet
    # ├── salesorder/batch_001.parquet
    # ├── invoice/batch_001.parquet
    # ├── email/batch_001.parquet
    # ├── phonecall/batch_001.parquet
    # ├── appointment/batch_001.parquet
    # ├── activityparty/batch_001.parquet  # Links activities to contacts
    # └── annotation/batch_001.parquet
  7. Load Extracted Data to Neo4j

    # Load all entities from extraction directory
    python -m src.cli load --source output/extract/{run_id}
    
    # Load specific entity only
    python -m src.cli load --source output/extract/{run_id} --entity=account
    
    # Load annotations WITH entity extraction from text (using LLMs)
    python -m src.cli load --source output/extract/{run_id} --entity=annotation --enrich
    
    # Test annotation enrichment with sample (cost-effective)
    python -m src.cli load --source output/extract/{run_id} --entity=annotation --enrich --enrich-sample=10
    
    # Clear existing graph before loading (careful!)
    python -m src.cli load --source output/extract/{run_id} --clear-first
    
    # Custom batch size for loading
    python -m src.cli load --source output/extract/{run_id} --batch-size=500
    
    # Create indexes and constraints only (no data loading)
    python -m src.cli load --source output/extract/{run_id} --indexes-only
    
    # Supported file formats: Parquet, JSON, JSONL
    # The loader automatically detects and processes all formats
  8. Run Complete ETL Pipeline

    # Full pipeline: Extract → Transform → Load
    python -m src.cli run --mode=full         # Complete ETL (initial load)
    python -m src.cli run --mode=incremental  # Incremental ETL pipeline
    python -m src.cli run --dry-run           # Dry run without making changes
    
    # Pipeline with options
    python -m src.cli run --mode=full --clear-graph      # Clear graph before loading
    python -m src.cli run --mode=full --entity=account   # Single entity pipeline
    
    # Extract annotations AND enrich with LLM entity extraction
    python -m src.cli run --mode=full --entity=annotation --enrich
    
    # Test annotation enrichment with sample (10 annotations)
    python -m src.cli run --mode=full --entity=annotation --enrich --enrich-sample=10
    
    # Skip specific phases
    python -m src.cli run --skip-extract      # Use existing extracted files
    python -m src.cli run --skip-transform    # Skip transformation phase
    python -m src.cli run --skip-load         # Extract and transform only
  9. Activity Text Enrichment with LLM ✨ NEW

    # The --enrich flag enables LLM-based entity extraction from activity text
    # Works on: Annotations (notetext), PhoneCalls (description), Appointments (description)
    # Extracts: Entities (Person, Company, Product, etc.) and relationships (WORKS_FOR, FOUNDED_BY, etc.)
    
    # Enrich ALL activities (annotations + phonecalls + appointments)
    python -m src.cli load --source output/extract/{run_id} --enrich
    
    # Enrich specific activity type only
    python -m src.cli load --source output/extract/{run_id} --entity=phonecall --enrich
    python -m src.cli load --source output/extract/{run_id} --entity=appointment --enrich
    python -m src.cli load --source output/extract/{run_id} --entity=annotation --enrich
    
    # Test with sample before full processing (cost-effective, 10 records per entity type)
    python -m src.cli load --source output/extract/{run_id} --enrich --enrich-sample=10
    
    # Full ETL with enrichment in one command
    python -m src.cli run --mode=full --enrich
    
    # Requirements:
    # - Set OPENAI_API_KEY environment variable
    # - Install: pip install neo4j-graphrag openai beautifulsoup4 html2text
    # - Configure: config/annotation_kg_schema.yaml (optional)
    
    # Output (written directly to Neo4j by SimpleKGPipeline):
    # - :Company, :Person, :Product, :Technology, :Location nodes (LLM-extracted)
    # - Relationships: WORKS_FOR, FOUNDED_BY, PARTNERED_WITH, etc.
    # - Super-labels automatically applied: :BusinessEntity, :PersonEntity
    
    # Example: Enrich phone call that mentions "Ivan Komashinsky purchased My Course (ABC111)"
    # Creates: Person node "Ivan Komashinsky", Product node "My Course (ABC111)"
    # Relationship: (Ivan)-[:PURCHASED]->(My Course)
  10. Apply Ontology Super-Labels ✨ NEW

    # Apply BusinessEntity and PersonEntity super-labels to existing nodes
    # This enables querying across entity type synonyms (Company/Account, Person/Contact)
    python -m src.cli apply-ontology
    
    # What it does:
    # - Adds :BusinessEntity label to both :Account (D365) and :Company (LLM) nodes
    # - Adds :PersonEntity label to both :Contact (D365) and :Person (LLM) nodes
    # - Sets source property to track data origin (D365 vs LLM)
    
    # Example queries after applying ontology:
    # Find all business entities (both D365 Accounts and LLM-extracted Companies)
    MATCH (n:BusinessEntity {name: "Coho Winery"}) 

OPTIONAL MATCH (n)-[r]-(related) RETURN n, r, related

# Find all person entities (both D365 Contacts and LLM-extracted Persons)
MATCH (n:PersonEntity) RETURN n.name, n.source LIMIT 10

# See docs/ontology_and_entity_resolution.md for full documentation
```

QUICK RUN

python -m src.cli init
python -m src.cli extract --mode=full   
python -m src.cli load --source .\output\extract\full_2025-12-18_180701\ --enrich --clear-first

📈 Use Cases

  1. 360-Degree Customer View: Visualize all customer interactions and relationships
  2. Sales Intelligence: Discover hidden patterns in sales data
  3. Relationship Analytics: Analyze complex business relationships
  4. Impact Analysis: Understand cascading effects of business changes
  5. Recommendation Engine: Build AI-powered recommendations based on graph patterns
  6. Fraud Detection: Identify suspicious patterns through graph algorithms
  7. Master Data Management: Maintain a single source of truth for entity relationships

🤝 Contributing

Please refer to CONTRIBUTING.md for guidelines on how to contribute to this project.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

For questions, issues, or suggestions:

  • Create an issue in the GitHub repository
  • Contact the development team at [team-email]
  • Refer to the documentation for detailed guides

🏁 Project Status

Current Version: 0.3.0 (Beta) - Phase 3 Complete

✅ Phase 1: Foundation (100% Complete)

  • Project structure setup
  • Architecture design
  • Neo4j Aura DB connectivity
  • D365 OAuth authentication setup

✅ Phase 2: Core Development (100% Complete)

  • Data Extraction Module
    • Async D365 client with MSAL authentication
    • Multi-entity extraction (11 D365 entities including ActivityParty)
    • File-based storage (Parquet/JSON)
    • Pagination handling (@odata.nextLink)
    • Rate limiting and error handling
    • Entity name pluralization with English grammar rules
  • Transformation Module
    • Schema mapping (D365 → Neo4j)
    • Field transformation and validation
    • Relationship building
    • Business rules engine (50+ rules)
    • Data quality scoring

✅ Phase 3: Loading Module (100% Complete)

  • Neo4j connection manager with async driver
  • Node creation logic with MERGE operations
  • Relationship creation logic (polymorphic support)
  • Batch processing with UNWIND queries
  • Index and constraint management
  • File-based loading (Parquet/JSON/JSONL)
  • Data validation and quality checks
  • Complete CLI integration

📋 Upcoming Phases

  • Pipeline orchestration (Airflow/Prefect)
  • Enhanced incremental sync mechanism
  • Testing suite completion
  • Performance optimization
  • Production deployment
  • Relationship loading enhancement
  • Advanced graph analytics

Overall Progress: 90% Complete

🔧 Recent Updates

December 19, 2025 - Multi-Entity Enrichment & Ontology

  • Extended: LLM enrichment now supports PhoneCall and Appointment entities (not just Annotations)
  • Added: extract_from_activity() method for generic activity text enrichment
  • Enhanced: Single --enrich flag automatically processes all activity types
  • Added: apply-ontology CLI command for semantic entity alignment
  • Feature: Automatic super-label application (BusinessEntity, PersonEntity)
  • Fixed: Critical bug in super-label application (wrong query execution mode)
  • Enhanced: execute_write_query now supports returning data with return_data parameter
  • Result: Extract entities from phone call descriptions, appointment notes, and annotations
  • Result: Query across entity type synonyms (Company/Account, Person/Contact)
  • Documentation: Complete ontology guide in docs/ontology_and_entity_resolution.md

December 11, 2025 - ActivityParty Extraction Fix

  • Fixed: Entity name pluralization bug (activitypartyactivityparties)
  • Fixed: Invalid field configuration for ActivityParty entity
  • Fixed: Null handling in ActivityParty transformation
  • Added: 11th entity support (ActivityParty) for activity-to-participant relationships
  • Result: Successfully extracts sender/recipient/organizer relationships for emails, calls, and appointments

Last Updated: December 19, 2025

About

Automated ETL pipeline for extracting data from Microsoft Dynamics 365 Dataverse and constructing knowledge graphs in Neo4j. Streamlines enterprise data integration with async processing, OAuth 2.0 authentication, and relationship mapping

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages