Big Data Consulting
doing hadoop, securely
Rob Gibbon
■ Architect @Big Industries Belgium
■ Focus on designing, deploying & integrating web
scale solutions with Hadoop
■ Deliveries for clients in telco, financial services &
media
Hadoop was built to survive data tsunamis
■ a response to challenges that enterprise vendors
were unable to address
■ focused on data volumes and cost reduction
■ initially, the solution had some serious holes
Confidentiality, Integrity, Availability
■ early prereleases couldn’t really meet any of these
three fundamental infosec objectives
■ basic controls weren’t there
the early days
■ Multiple SPoF
■ No authentication
■ Easily spoofed authorisation
■ No encryption of data at rest nor in transit
■ No accounting
enter the hadoop vendors
■ Vendors like Cloudera focus on making Apache
Hadoop “enterprise ready”
■ Includes building robust infosec controls into
Hadoop core
■ Multilayer security is now available for Hadoop
running a cluster in non-secure mode
■ malicious|mistaken user:
■ recursively delete all the data please
■ by the way, I’m the system superuser
■ hadoop:
■ oh ok then
bad things happen with slack controls in
place
average cost of a data breach = $3.8m
running a secure cluster
■ Kerberos is one of the primary security controls you
can use
■ Btw, what’s wrong with this kerberos principal?
■ hdfs@BIGINDUSTRIES.BE
kerberos continued
■ Kerberos uses a three-part principal
■ hdfs/node1.cluster1.bigindustries.be@BIGINDUSTRIES.BE
■ hdfs/node1.cluster2.bigindustries.be@BIGINDUSTRIES.BE
■ Best to use explicit mappings from kerberos principals to local
users
hive / impala
■ HiveServer doesn’t support Kerberos => use HiveServer2
■ Best to use Sentry to enforce role based access controls from
SQL
■ Users can upload and execute arbitrary [possibly hostile] UDFs
=> enable Sentry
■ Older versions of Metastore don’t enforce permissions on
grant_* and revoke_* APIs => stay up to date
availability
■ Most core components now support HA
■ HDFS
■ YARN
■ Hive
■ Hbase
disaster recovery
■ HDFS and HBase offer point-in-time snapshots
■ => consistentency!
■ Vendor-tethered solutions for site-to-site replication
are available
encryption at rest
■ HDFS encryption zones
■ transparent to existing applications
■ minimal performance overhead on Intel
architecture
■ key management is externalised
wire encryption
■ SSL encryption is now available for most Hadoop
services
■ Note that AES-256 for SSL and for Kerberos preauth
requires extra JCE policy files on the cluster
accounting
■ Vendor-tethered solutions are available for auditing
■ Navigator for Cloudera clusters
■ Ranger for HortonWorks clusters
tokenization
■ The process of substituting a sensitive data
element with a non-sensitive equivalent
■ 3rd Party vendor solutions are available that
integrate well with Hadoop
some places where there’s still some work to
do
■ Setting up hadoop security controls is complex and time
consuming
■ Not much support for SELinux around here
■ No general, coherent, policy-based framework for controlling
resource access demands
■ Apache Knox is a starting point
■ => network and host resource access?
Integration
■ Integrating hadoop into an organisation’s services environment
needs careful planning
■ Hadoop can conflict with established governance policies
■ system accounts & privileges
■ remote access
■ firewall flows
■ domains and trust
■ etc.
layered security in hadoop-core
■ Authentication: Kerberos
■ Authorisation: Local unix group or LDAP mappings
■ Authorisation: Sentry RBACS for hive/impala
■ Encryption: HDFS encryption
■ Encryption: SSL encryption for most services
■ Availability: Active/Passive failover HDFS, YARN, Hbase
■ Integrity: HDFS block replication & CRC checksum
but what about
poodle/heartbleed/shellshock/whatever...
■ underlines the need for a mature information
security governance strategy & architecture
defence-in-depth
■ A layered security architecture for Hadoop clusters
is doable
■ eg. MasterCard’s Cloudera Hadoop cluster achieved
PCI compliance in 2014 http://goo.gl/FP5DUt
thanks for listening
be.linkedin.com/in/robertgibbon
www.bigindustries.be

Doing hadoop securely

  • 1.
    Big Data Consulting doinghadoop, securely
  • 2.
    Rob Gibbon ■ Architect@Big Industries Belgium ■ Focus on designing, deploying & integrating web scale solutions with Hadoop ■ Deliveries for clients in telco, financial services & media
  • 3.
    Hadoop was builtto survive data tsunamis ■ a response to challenges that enterprise vendors were unable to address ■ focused on data volumes and cost reduction ■ initially, the solution had some serious holes
  • 4.
    Confidentiality, Integrity, Availability ■early prereleases couldn’t really meet any of these three fundamental infosec objectives ■ basic controls weren’t there
  • 5.
    the early days ■Multiple SPoF ■ No authentication ■ Easily spoofed authorisation ■ No encryption of data at rest nor in transit ■ No accounting
  • 6.
    enter the hadoopvendors ■ Vendors like Cloudera focus on making Apache Hadoop “enterprise ready” ■ Includes building robust infosec controls into Hadoop core ■ Multilayer security is now available for Hadoop
  • 7.
    running a clusterin non-secure mode ■ malicious|mistaken user: ■ recursively delete all the data please ■ by the way, I’m the system superuser ■ hadoop: ■ oh ok then
  • 8.
    bad things happenwith slack controls in place
  • 9.
    average cost ofa data breach = $3.8m
  • 10.
    running a securecluster ■ Kerberos is one of the primary security controls you can use ■ Btw, what’s wrong with this kerberos principal? ■ [email protected]
  • 11.
    kerberos continued ■ Kerberosuses a three-part principal ■ hdfs/[email protected] ■ hdfs/[email protected] ■ Best to use explicit mappings from kerberos principals to local users
  • 12.
    hive / impala ■HiveServer doesn’t support Kerberos => use HiveServer2 ■ Best to use Sentry to enforce role based access controls from SQL ■ Users can upload and execute arbitrary [possibly hostile] UDFs => enable Sentry ■ Older versions of Metastore don’t enforce permissions on grant_* and revoke_* APIs => stay up to date
  • 13.
    availability ■ Most corecomponents now support HA ■ HDFS ■ YARN ■ Hive ■ Hbase
  • 14.
    disaster recovery ■ HDFSand HBase offer point-in-time snapshots ■ => consistentency! ■ Vendor-tethered solutions for site-to-site replication are available
  • 15.
    encryption at rest ■HDFS encryption zones ■ transparent to existing applications ■ minimal performance overhead on Intel architecture ■ key management is externalised
  • 16.
    wire encryption ■ SSLencryption is now available for most Hadoop services ■ Note that AES-256 for SSL and for Kerberos preauth requires extra JCE policy files on the cluster
  • 17.
    accounting ■ Vendor-tethered solutionsare available for auditing ■ Navigator for Cloudera clusters ■ Ranger for HortonWorks clusters
  • 18.
    tokenization ■ The processof substituting a sensitive data element with a non-sensitive equivalent ■ 3rd Party vendor solutions are available that integrate well with Hadoop
  • 19.
    some places wherethere’s still some work to do ■ Setting up hadoop security controls is complex and time consuming ■ Not much support for SELinux around here ■ No general, coherent, policy-based framework for controlling resource access demands ■ Apache Knox is a starting point ■ => network and host resource access?
  • 20.
    Integration ■ Integrating hadoopinto an organisation’s services environment needs careful planning ■ Hadoop can conflict with established governance policies ■ system accounts & privileges ■ remote access ■ firewall flows ■ domains and trust ■ etc.
  • 21.
    layered security inhadoop-core ■ Authentication: Kerberos ■ Authorisation: Local unix group or LDAP mappings ■ Authorisation: Sentry RBACS for hive/impala ■ Encryption: HDFS encryption ■ Encryption: SSL encryption for most services ■ Availability: Active/Passive failover HDFS, YARN, Hbase ■ Integrity: HDFS block replication & CRC checksum
  • 22.
    but what about poodle/heartbleed/shellshock/whatever... ■underlines the need for a mature information security governance strategy & architecture
  • 23.
    defence-in-depth ■ A layeredsecurity architecture for Hadoop clusters is doable ■ eg. MasterCard’s Cloudera Hadoop cluster achieved PCI compliance in 2014 http://goo.gl/FP5DUt
  • 24.