1© Cloudera, Inc. All rights reserved.
Bringing Trust and Visibility to
Apache Hadoop
Mark Donsky, Product Management, Cloudera
Chang She, Software Engineering, Cloudera
2© Cloudera, Inc. All rights reserved.
The benefits of Hadoop...
One place for unlimited data
• All types
• More sources
• Faster, larger ingestion
Unified, multi-framework data access
• More users
• More tools
• Faster changes
3© Cloudera, Inc. All rights reserved.
…Cause trust, visibility, and governance challenges
Business Users
How do I find what’s
relevant?
Can I trust what I find?
How can I explore data on
my own?
Information
Security
Who’s accessing what data?
What are they doing with
the data?
Is sensitive data governed
and protected?
Can I meet compliance
needs?
Database
Admins
How is data being used
today?
How can I optimize for
future workloads?
How can I take advantage
of Hadoop risk-free and
fast?
4© Cloudera, Inc. All rights reserved.
Building blocks of governance in Hadoop
Audit Logs Lineage Data Policies
Technical
Metadata
Business
Metadata
5© Cloudera, Inc. All rights reserved.
Metadata
6© Cloudera, Inc. All rights reserved.
Enterprise metadata
The foundation for governance
Metadata enables you to put context and meaning to data
Operational
Job Run-Time Stats
Report Run Information
Hardware Usage
Scheduler Stats
Database Schema
File Definition
ETL Job Design
BI Report Definition
Data Model
Technical
Business Glossary
Enterprise Taxonomy
Ontology
Business
Data Lineage
Impact Analysis
Topology Understanding
Data Governance
Compliance Audits
7© Cloudera, Inc. All rights reserved.
Enterprise metadata
The foundation for governance
Metadata enables you to put context and meaning to data to
answer the important questions
Business Technical Operational
Unified Metadata Repository
What data or information exists?
Where is data being used?
What is the data’s business definition?
Who is responsible for the data?
How is it inter-related to other data?
Who is using the data?
Why do we need this data?
Can we trust this data?
When was this data last updated?
Who are the high-value
customers?
How do we define that?
How is high value calculated?
Where is customer data stored
and used?
Is the data reliable and
accurate?
8© Cloudera, Inc. All rights reserved.
Technical metadata – what’s available?
Hive
Query Text
Table name
Column name
Data Type
Owner
Partitions
Pig
Script name
Owner
Creation date
Last modified date
HDFS
Permissions
Owner
Group
Creation date
Last modified date
MR/YARN
JobID
Mapper Class
Reducer Class
Inputs
Outputs
9© Cloudera, Inc. All rights reserved.
Technical metadata – where can I find it?
Component Metadata
HDFS fsimage (ls –lRa /)
Hive Hive Metastore Server (database metadata tables)
MapReduce JobTracker
YARN Job History Server
Oozie Oozie Server
Pig JobTracker, Job History Server
10© Cloudera, Inc. All rights reserved.
Technical metadata – Hive metastore
Collection of structured tables containing technical
metadata about Hive databases, tables, views, and columns
11© Cloudera, Inc. All rights reserved.
Technical metadata – HCatalog
• HCatalog uses the Hive Metastore to provide a management layer
• Abstracts the file location and storage format
• Makes formats available to Pig, Hive, MapReduce, etc.
• Also accessible via REST API
12© Cloudera, Inc. All rights reserved.
Business metadata – can we do this in Hadoop?
• Custom metadata is vital for trust and visibility
• Find all files associated with a particular clinical trial
• Locate all statements for high-profile customers
• Where is my sensitive data?
• Where is the protected health information?
• No - Hadoop doesn’t support business metadata
13© Cloudera, Inc. All rights reserved.
Hadoop Auditing
14© Cloudera, Inc. All rights reserved.
Hadoop audit logs – what do they look like?
• Logs all file system
access requests
• Impala, HBase and
other components use
a similar format
• Implemented in log4j
at the INFO level
{ "allowed": true,
"serviceName": "HDFS-1”,
"username": "training”,
"src": "/user”,
"eventTime": 1398544478141,
"ipAddress": "10.20.187.39”,
"operation": "getfileinfo”,
"dest": null,
"permissions": null,
"impersonator": null,
"delegationTokenId": null
}
{ "serviceName": "HIVE-1",
"username": "admin",
"impersonator": null,
"ipAddress": "10.20.187.39",
"operation": "QUERY",
"eventTime": 1398402718797,
"operationText": "select count(*) from salesdata",
"allowed": true,
"databaseName": "default",
"tableName": "salesdata",
"resourcePath": "/user/hive/warehouse/salesdata",
"objectType": "TABLE"
}
HDFS Audit Log Hive Audit Log
HDFS Property: Log4j.logger.org.apache.hadoop.hdfs.
server.namenode.FSNamesystem.audit
15© Cloudera, Inc. All rights reserved.
Hadoop audit logs – where can I find them?
Component Default Location (CDH)
HDFS Audit Logs /var/log/hadoop-hdfs/audit
Hive Audit Logs /var/log/hive/audit
Impala Audit Logs /var/log/impalad/audit
HBase Audit Logs /var/log/hbase/audit
• Log files are automatically rotated when a size limit is reached
• Location and size limit are configurable
16© Cloudera, Inc. All rights reserved.
Hadoop audit logs – limitations
• Consolidation
• Persistence
• Filtering
• Integration
17© Cloudera, Inc. All rights reserved.
Lineage
18© Cloudera, Inc. All rights reserved.
Lineage – how to track lineage
• You can’t do this easily – you used to need to track this manually unless you’re
using a tool like Cloudera Navigator
• But…lineage is embedded in Hadoop technical metadata
• Job configurations provide inputs/outputs
• Hive metastore provides location of HDFS directory where data resides
• Hive/Impala queries can be interpreted to provide fine-grained column-level
lineage between query input-output
• Some relationships (e.g., directory–file) are implicit
19© Cloudera, Inc. All rights reserved.
Data Policies
20© Cloudera, Inc. All rights reserved.
Data policies – Hadoop limitations
• Information is of limited use unless it is actionable
• There is a treasure trove of actionable information in the metadata that the various
Hadoop services emit
• Archival of unused data
• Encryption of sensitive data
• Remediation of incorrect permissions
• Triggers should be configurable based on user-defined criteria
• Hadoop does not offer a sufficient policy engine or action framework
21© Cloudera, Inc. All rights reserved.
Building blocks of trust and visibility in Hadoop
Audit Logs Lineage Data Policies
Technical
Metadata
Business
Metadata
22© Cloudera, Inc. All rights reserved.
Cloudera Navigator
Overview & Demo
23© Cloudera, Inc. All rights reserved.
Cloudera Navigator
The only integrated data management and governance platform for Hadoop
Governance & Foundational Layer
Business Metadata Technical Metadata Lineage Policies Audit Logs
Self-Service
Discovery & Analytics
Data Scientists & BI Users
Effortlessly find and trust the data
that matters most
Search
Data definitions
Analytics
Profiling
Usage-Driven
Model Optimization
Hadoop Administrators & DBAs
Configure Hadoop to boost user
productivity
Migration
Optimization
Reporting
Model maintenance
Compliance-Ready
Governance & Protection
Information Security
Track, understand and protect
access to sensitive data
Auditing
Lineage
Encryption
Key management
Active Data Management &
Information Lifecycle
Management
Data Stewards & Curators
Maximize cluster performance at
Hadoop scale with ease
Classification
Stewardship
Backup
Retention
24© Cloudera, Inc. All rights reserved.
Trust and visibility is an ecosystem
Data
Systems
Enterprise Data Hub
Security and Administration
Unlimited Storage
Process Discover Model Serve
System Integration
Infrastructure
More than 1,600 partners
ensure compatibility with existing
investments, lower skill barriers, and
help maximize value from your data.
Operational
Tools
Applications
25© Cloudera, Inc. All rights reserved.
Learn more!
Please stop by our
booth at P13
• See a demo of Cloudera Enterprise,
including our governance solution
that’s used by nearly 200 production
customers for over two years!
• Find out what makes Cloudera
Enterprise the only PCI-certified
Hadoop distribution
• Learn about our 1600+ partner
ecosystem
26© Cloudera, Inc. All rights reserved.
Thank You!
@markdonsky
@changhiskhan

Bringing Trus and Visibility to Apache Hadoop

  • 1.
    1© Cloudera, Inc.All rights reserved. Bringing Trust and Visibility to Apache Hadoop Mark Donsky, Product Management, Cloudera Chang She, Software Engineering, Cloudera
  • 2.
    2© Cloudera, Inc.All rights reserved. The benefits of Hadoop... One place for unlimited data • All types • More sources • Faster, larger ingestion Unified, multi-framework data access • More users • More tools • Faster changes
  • 3.
    3© Cloudera, Inc.All rights reserved. …Cause trust, visibility, and governance challenges Business Users How do I find what’s relevant? Can I trust what I find? How can I explore data on my own? Information Security Who’s accessing what data? What are they doing with the data? Is sensitive data governed and protected? Can I meet compliance needs? Database Admins How is data being used today? How can I optimize for future workloads? How can I take advantage of Hadoop risk-free and fast?
  • 4.
    4© Cloudera, Inc.All rights reserved. Building blocks of governance in Hadoop Audit Logs Lineage Data Policies Technical Metadata Business Metadata
  • 5.
    5© Cloudera, Inc.All rights reserved. Metadata
  • 6.
    6© Cloudera, Inc.All rights reserved. Enterprise metadata The foundation for governance Metadata enables you to put context and meaning to data Operational Job Run-Time Stats Report Run Information Hardware Usage Scheduler Stats Database Schema File Definition ETL Job Design BI Report Definition Data Model Technical Business Glossary Enterprise Taxonomy Ontology Business Data Lineage Impact Analysis Topology Understanding Data Governance Compliance Audits
  • 7.
    7© Cloudera, Inc.All rights reserved. Enterprise metadata The foundation for governance Metadata enables you to put context and meaning to data to answer the important questions Business Technical Operational Unified Metadata Repository What data or information exists? Where is data being used? What is the data’s business definition? Who is responsible for the data? How is it inter-related to other data? Who is using the data? Why do we need this data? Can we trust this data? When was this data last updated? Who are the high-value customers? How do we define that? How is high value calculated? Where is customer data stored and used? Is the data reliable and accurate?
  • 8.
    8© Cloudera, Inc.All rights reserved. Technical metadata – what’s available? Hive Query Text Table name Column name Data Type Owner Partitions Pig Script name Owner Creation date Last modified date HDFS Permissions Owner Group Creation date Last modified date MR/YARN JobID Mapper Class Reducer Class Inputs Outputs
  • 9.
    9© Cloudera, Inc.All rights reserved. Technical metadata – where can I find it? Component Metadata HDFS fsimage (ls –lRa /) Hive Hive Metastore Server (database metadata tables) MapReduce JobTracker YARN Job History Server Oozie Oozie Server Pig JobTracker, Job History Server
  • 10.
    10© Cloudera, Inc.All rights reserved. Technical metadata – Hive metastore Collection of structured tables containing technical metadata about Hive databases, tables, views, and columns
  • 11.
    11© Cloudera, Inc.All rights reserved. Technical metadata – HCatalog • HCatalog uses the Hive Metastore to provide a management layer • Abstracts the file location and storage format • Makes formats available to Pig, Hive, MapReduce, etc. • Also accessible via REST API
  • 12.
    12© Cloudera, Inc.All rights reserved. Business metadata – can we do this in Hadoop? • Custom metadata is vital for trust and visibility • Find all files associated with a particular clinical trial • Locate all statements for high-profile customers • Where is my sensitive data? • Where is the protected health information? • No - Hadoop doesn’t support business metadata
  • 13.
    13© Cloudera, Inc.All rights reserved. Hadoop Auditing
  • 14.
    14© Cloudera, Inc.All rights reserved. Hadoop audit logs – what do they look like? • Logs all file system access requests • Impala, HBase and other components use a similar format • Implemented in log4j at the INFO level { "allowed": true, "serviceName": "HDFS-1”, "username": "training”, "src": "/user”, "eventTime": 1398544478141, "ipAddress": "10.20.187.39”, "operation": "getfileinfo”, "dest": null, "permissions": null, "impersonator": null, "delegationTokenId": null } { "serviceName": "HIVE-1", "username": "admin", "impersonator": null, "ipAddress": "10.20.187.39", "operation": "QUERY", "eventTime": 1398402718797, "operationText": "select count(*) from salesdata", "allowed": true, "databaseName": "default", "tableName": "salesdata", "resourcePath": "/user/hive/warehouse/salesdata", "objectType": "TABLE" } HDFS Audit Log Hive Audit Log HDFS Property: Log4j.logger.org.apache.hadoop.hdfs. server.namenode.FSNamesystem.audit
  • 15.
    15© Cloudera, Inc.All rights reserved. Hadoop audit logs – where can I find them? Component Default Location (CDH) HDFS Audit Logs /var/log/hadoop-hdfs/audit Hive Audit Logs /var/log/hive/audit Impala Audit Logs /var/log/impalad/audit HBase Audit Logs /var/log/hbase/audit • Log files are automatically rotated when a size limit is reached • Location and size limit are configurable
  • 16.
    16© Cloudera, Inc.All rights reserved. Hadoop audit logs – limitations • Consolidation • Persistence • Filtering • Integration
  • 17.
    17© Cloudera, Inc.All rights reserved. Lineage
  • 18.
    18© Cloudera, Inc.All rights reserved. Lineage – how to track lineage • You can’t do this easily – you used to need to track this manually unless you’re using a tool like Cloudera Navigator • But…lineage is embedded in Hadoop technical metadata • Job configurations provide inputs/outputs • Hive metastore provides location of HDFS directory where data resides • Hive/Impala queries can be interpreted to provide fine-grained column-level lineage between query input-output • Some relationships (e.g., directory–file) are implicit
  • 19.
    19© Cloudera, Inc.All rights reserved. Data Policies
  • 20.
    20© Cloudera, Inc.All rights reserved. Data policies – Hadoop limitations • Information is of limited use unless it is actionable • There is a treasure trove of actionable information in the metadata that the various Hadoop services emit • Archival of unused data • Encryption of sensitive data • Remediation of incorrect permissions • Triggers should be configurable based on user-defined criteria • Hadoop does not offer a sufficient policy engine or action framework
  • 21.
    21© Cloudera, Inc.All rights reserved. Building blocks of trust and visibility in Hadoop Audit Logs Lineage Data Policies Technical Metadata Business Metadata
  • 22.
    22© Cloudera, Inc.All rights reserved. Cloudera Navigator Overview & Demo
  • 23.
    23© Cloudera, Inc.All rights reserved. Cloudera Navigator The only integrated data management and governance platform for Hadoop Governance & Foundational Layer Business Metadata Technical Metadata Lineage Policies Audit Logs Self-Service Discovery & Analytics Data Scientists & BI Users Effortlessly find and trust the data that matters most Search Data definitions Analytics Profiling Usage-Driven Model Optimization Hadoop Administrators & DBAs Configure Hadoop to boost user productivity Migration Optimization Reporting Model maintenance Compliance-Ready Governance & Protection Information Security Track, understand and protect access to sensitive data Auditing Lineage Encryption Key management Active Data Management & Information Lifecycle Management Data Stewards & Curators Maximize cluster performance at Hadoop scale with ease Classification Stewardship Backup Retention
  • 24.
    24© Cloudera, Inc.All rights reserved. Trust and visibility is an ecosystem Data Systems Enterprise Data Hub Security and Administration Unlimited Storage Process Discover Model Serve System Integration Infrastructure More than 1,600 partners ensure compatibility with existing investments, lower skill barriers, and help maximize value from your data. Operational Tools Applications
  • 25.
    25© Cloudera, Inc.All rights reserved. Learn more! Please stop by our booth at P13 • See a demo of Cloudera Enterprise, including our governance solution that’s used by nearly 200 production customers for over two years! • Find out what makes Cloudera Enterprise the only PCI-certified Hadoop distribution • Learn about our 1600+ partner ecosystem
  • 26.
    26© Cloudera, Inc.All rights reserved. Thank You! @markdonsky @changhiskhan

Editor's Notes

  • #25 Cloudera partners more broadly and deeply across the Hadoop ecosystem than any other vendor. With over 1200 partners and counting, our partnerships offer: Compatibility with your existing tools and skills 160+ certified on Cloudera 5, including all 12 of the 12 Gartner Business Intelligence Magic Quadrant leaders Flexible deployment options On-premises Public, private, or hybrid cloud Appliances and engineered systems Partnerships you can trust Deep engineering relationships Comprehensive certification program