Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

Document Parsing API & SDKs

Product Page Docs Demos API Blog Search Support Temp License

GroupDocs.Parser is a document parsing and data extraction API. Extract text, metadata, barcodes, structured fields, images, tables, and document entities from PDFs, Office files, emails, eBooks, and archives—built for search indexing, compliance, data capture, and content ingestion workflows.

📰 Latest Parser News & Updates

  • See the latest release notes on NuGet and Maven Central for parser engine improvements, faster template-based extraction, and better table detection.
  • Updated sample apps show invoice data extraction, email parsing, and PDF text extraction scenarios.
  • New how-tos on templated parsing and container file processing in the documentation.

📂 Supported Platforms & Repository Groups

🌐 .NET Document Parsing (C#, ASP.NET, WinForms)

High-performance APIs for document parsing on .NET Framework and .NET Core.

  • GroupDocs.Parser-for-.NET: Core C# API for text, metadata, tables, and template-based extraction.
  • Samples & Demos: Explore runnable examples in the repository to parse PDFs, DOCX, XLSX, PPTX, MSG/EML, EPUB, ZIP, and more.
// Quick .NET Parsing Example
using (var parser = new GroupDocs.Parser.Parser("invoice.pdf"))
{
    // Extract plain text from the document
    using (var reader = parser.GetText())
    {
        Console.WriteLine(reader.ReadToEnd());
    }
}

☕ Java Document Parsing (Maven, Spring)

Native Java library for text, metadata, and structured data extraction.

// Quick Java Parsing Example
try (com.groupdocs.parser.Parser parser = new com.groupdocs.parser.Parser("contract.docx")) {
    java.io.Reader reader = parser.getText();
    if (reader != null) {
        char[] buffer = new char[2048];
        int read;
        while ((read = reader.read(buffer)) != -1) {
            System.out.print(new String(buffer, 0, read));
        }
    }
}

🐍 Python Document Parsing (Python via .NET)

Cross-platform Python bindings for text, metadata, and structured data extraction.

# Quick Python Parsing Example
from groupdocs.parser import Parser

with Parser("sample.pdf") as parser:
    text = parser.GetText()
    print(text)

🧠 Business Use-Cases

  • Invoice & receipt data extraction: pull totals, dates, vendors, and line items via templates.
  • Email & attachment parsing: extract headers, bodies, attachments, and metadata from MSG/EML.
  • Contract analysis: capture clauses, signatures, and key fields from DOCX/PDF.
  • PDF table extraction: pull line items and financial tables from PDFs (see table extraction sample).
  • Content migration: normalize mixed file types into structured outputs.

✅ API Key Features & Benefits

  • High-fidelity text extraction for PDF, DOC/DOCX, XLS/XLSX, PPT/PPTX, HTML, RTF, TXT, EPUB.
  • Template-based extraction to capture labeled fields, tables, and repeating blocks reliably.
  • Table recognition with cell-by-cell extraction for spreadsheets and tabular PDFs.
  • Metadata parsing (built-in and custom) for compliance and governance.
  • Container support for ZIP, OST/PST, MSG/EML, and attachments within archived files.
  • Image & embedded object extraction for logos, signatures, and inline graphics.
  • Page-level & area-limited parsing to target specific regions for faster processing.
  • Performance & scaling tuned for server-side, multi-document workloads.

🆘 Technical Support & Resources

🏷️ Tags

groupdocs-parser document-parser pdf-parser text-extraction data-extraction metadata-parser email-parser invoice-parsing table-extraction template-based-parsing content-ingestion document-ai search-indexing enterprise-parsing