Skip to content

MBtech/HiBench

 
 

Repository files navigation

HiBench Suite Build Status

The bigdata micro benchmark suite

This is a modified versoin of HiBench benchmark for Vanir. These are the default configurations that are used for the paper.


OVERVIEW

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Flink, Storm and Gearpump.

Getting Started

Workloads

There are totally 19 workloads in HiBench. The workloads are divided into 6 categories which are micro, ml(machine learning), sql, graph, websearch and streaming.

Batch pipelines A price predictor (pp) benchmark is part of the workloads used for Vanir. The price predictor benchmark consists of a pipeline of three batch jobs: ETL, Update, ML.

ETL stage gets new data that has been loaded to S3. This data contains records of housing data with prices of the house as well as value for the features of the house. ETL enriches this data with the location features from a static dataset and then stores this enriched data into S3.

Update loads the historical data and the transformed data from ETL stage in order to merge them into a dataset in S3 that can be processed by ML.

ML stage loads the housing pricing data from S3, splits the data into test and training set and then trains a random forest model on this data before performance cross valuation for model evaluation.

The total runtime of the pipeline is the sum of runtimes of each stage. Micro Benchmarks:

  1. Sort (sort)

    This workload sorts its text input data, which is generated using RandomTextWriter.

  2. WordCount (wordcount)

    This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.

  3. TeraSort (terasort)

    TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.

  4. Sleep (sleep)

    This workload sleep an amount of seconds in each task to test framework scheduler.

  5. enhanced DFSIO (dfsioe)

    Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.

Machine Learning:

  1. Bayesian Classification (Bayes)

    Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. This workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.ords.

  2. K-means clustering (Kmeans)

    This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.

  3. Logistic Regression (LR)

    Logistic Regression (LR) is a popular method to predict a categorical response. This workload is implemented in spark.mllib with LBFGS optimizer and the input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.

  4. Alternating Least Squares (ALS)

    The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering. This workload is implemented in spark.mllib and the input data set is generated by RatingDataGenerator for a product recommendation system.

  5. Gradient Boosting Trees (GBT)

    Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees. This workload is implemented in spark.mllib and the input data set is generated by GradientBoostingTreeDataGenerator.

  6. Linear Regression (Linear)

    Linear Regression (Linear) is a workload that implemented in spark.mllib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.

  7. Latent Dirichlet Allocation (LDA)

    Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. This workload is implemented in spark.mllib and the input data set is generated by LDADataGenerator.

  8. Principal Components Analysis (PCA)

    Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.mllib. The input data set is generated by PCADataGenerator.

  9. Random Forest (RF)

    Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.

  10. Support Vector Machine (SVM)

    Support Vector Machine (SVM) is a standard method for large-scale classification tasks. This workload is implemented in spark.mllib and the input data set is generated by SVMDataGenerator.

  11. Singular Value Decomposition (SVD)

    Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.

SQL:

  1. Scan (scan), Join(join), Aggregate(aggregation)

    These workloads are developed based on SIGMOD 09 paper "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution.

Websearch Benchmarks:

  1. PageRank (pagerank)

    This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.

  2. Nutch indexing (nutchindexing)

    Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file.

Graph Benchmark:

  1. NWeight (nweight)

    NWeight is an iterative graph-parallel algorithm implemented by Spark GraphX and pregel. The algorithm computes associations between two vertices that are n-hop away.

Streaming Benchmarks:

  1. Identity (identity)

    This workload reads input data from Kafka and then writes result to Kafka immediately, there is no complex business logic involved.

  2. Repartition (repartition)

    This workload reads input data from Kafka and changes the level of parallelism by creating more or fewer partitionstests. It tests the efficiency of data shuffle in the streaming frameworks.

  3. Stateful Wordcount (wordcount)

    This workload counts words cumulatively received from Kafka every few seconds. This tests the stateful operator performance and Checkpoint/Acker cost in the streaming frameworks.

  4. Fixwindow (fixwindow)

    The workloads performs a window based aggregation. It tests the performance of window operation in the streaming frameworks.

Supported Hadoop/Spark/Flink/Storm/Gearpump releases:

  • Hadoop: Apache Hadoop 2.x, CDH5, HDP
  • Spark: Spark 1.6.x, Spark 2.0.x, Spark 2.1.x, Spark 2.2.x
  • Flink: 1.0.3
  • Storm: 1.0.1
  • Gearpump: 0.8.1
  • Kafka: 0.8.2.2

About

HiBench is a big data benchmark suite.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Java 54.5%
  • Shell 17.0%
  • Scala 16.7%
  • Python 10.0%
  • HTML 1.4%
  • Dockerfile 0.4%