This is a local Apache Spark cluster with Apache Cassandra database, which can be built quickly and easily using Docker Compose. The focus is on the integration of Elite Dangerous (EDDN) data, which is loaded directly into Cassandra. This makes it possible to run PySpark tests and analyses with Spark directly after initialization.
To use the cluster, it is required to install:
- Python 3
- Docker
- Docker Compose
Docker is needed for the individual components, each of them running in its own container. Docker Compose starts all containers together. And Python 3 is used to load the Elite Dangerous data.
It is recomanded to install the latest version of python 3.
Before installing the latest version, check for a currently installed python 3 version.
To check this, run:
$ python3 --versionIf a version of python 3 is installed you can upgrade this to latest version:
$ sudo apt-get upgrade python3 If you want to install the latest version of python3 you can run:
$ sudo apt-get install python3 After the installation you need to check if pip (python package manager) is installed along with your python installation:
$ pip3 -V If pip isn't installed on your machine, run the following command to install it:
$ sudo apt install python3-pipDownload the excecutable installer from https://www.python.org/downloads/windows/ .
Afterwards execute the installer and following the instruction.
It also is possible to install python with Anaconda or with configuration in power shell.
Before installing python make sure that Xcode and Homebrew are installed on your computer.
If that is not the case, then run this in the terminal to install Xcode:
$ xcode-select –install and this to install Homebrew:
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" Now check if a python version is already installed:
$ python3 --version Is a version installed you can upgrade this to the latest version:
$ brew update $ brew upgrade python3 To install python 3 run this command in your terminal:
$ brew install python3 After the installation you need to check if pip (python package manager) is installed along with your python installation:
$ pip3 -V If pip isn't installed on your machine, run the following command to install it:
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py $ python3 get-pip.pyFor detailed information take a look at the Docker Documentation, the first Link in chapter References.
Make sure that no outdated Docker version is installed:
$ sudo apt-get remove docker docker-engine docker.io containerd runc Update the apt package index:
$ sudo apt-get updateThen install packages to allow apt to use a repository over HTTPS:
$ sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-commonNow add Docker’s official GPG key:
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -Afterwards check if the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 was added. Use the last 8 characters of the fingerprint for searching.
$ sudo apt-key fingerprint 0EBFCD88At last use the following command to set up the stable repository.
$ sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"Update the apt package index:
$ sudo apt-get update Install the latest version of Docker:
$ sudo apt-get install docker-ce docker-ce-cli containerd.ioIf you haven’t already downloaded the installer (Docker Desktop Installer.exe), you can get it from download.docker.com.
-
Double-click Docker Desktop for Windows Installer.exe to run the installer.
-
Follow the install wizard to accept the license, authorize the installer, and proceed with the installation.
-
Click Finish on the setup dialog to complete and launch Docker.
To install Docker Desktop for Mac download the Docker.dmg from Docker Hub.
It is required to sign up on docker hub to download docker for mac.
-
Double-click Docker.dmg to open the installer, then drag Moby the whale to the Applications folder.
-
Double-click Docker.app in the Applications folder to start Docker.
To install docker compose with curl, run this command to download the current stable release of Docker Compose:
sudo curl -L "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose Apply executable permissions to the binary:
sudo chmod +x /usr/local/bin/docker-compose Alternativly you can use pip to install docker compose:
sudo pip install docker-compose Test if the installation was successful:
$ docker-compose –version The desktop version of docker includes docker compose.
So the installation is already done.
The execution is divided in single shell scripts. The functionality and benefits are explained in the chapter project section. There is also a shell script with which all steps can executed in the correct order. The project is designed for Linux systems, but can be ported with adaptations of the shell scripts for the respective operating system.
Following all steps are described that you need to run in the chapter order to make the project/cluster works.
The used data comes from the game Elite Dangerous (EDDN) and is provided by the API of the website eddb.io.
With the python script EDDNClient.py the data is read by the API and written in JSON format into a .log file (Logs_JSON_EDDN_yyyy-mm-dd).
The number of downloaded datasets/rows is defined by the argument -d, --datasets.
Afterwards the .log file is transformed with the script transform_to_csv.py into a CSV format to make it suitable for Cassandra.
For the execution use the shell script download_and_transform_data.sh with the necessary argument:
$ bash download_and_transform_data.sh -d <Number of datasets that should be downloaded> or
$ bash download_and_transform_data.sh --datasets=<Number of datasets that should be downloaded>The cluster can be created after the database has been set up.
To do this, use the shell script run_docker_compose.sh. The script expects an argument to specify the number of Spark nodes/slaves. Then Docker Compose is used to build the cluster with the scaled number of Spark nodes.
$ bash run_docker_compose.sh -n <number of nodes/workers that will be created> or
$ bash run_docker_compose.sh --nodes=<number of nodes/workers that will be created> After the cluster is initialized, the EDDN data can be loaded into the database. Execute the script:
$ bash load_data_into_cassandra.sh This script calls the Cassandra file copy_data.cql which creates the keyspace and the table and loads the data from the CSV file.
The provided PySpark script eddb_data.py is just an example. It will select the whole table and write the result into a CSV file in the folder ./compose_cluster/export_data.
Use this shell script to run PySpark:
$ bash exec_pyspark_scripts.sh In this script you can also insert your own PySpark scripts or replace the existing one to execute them.
Note: A pandas function is used to create the CSV. However, this is only useful for small amounts of data, because it loads the data into the RAM before writing. As a result, the RAM runs full if the amount of data is too large. To avoid this you can comment out the following line in the PySpark script eddb_data.py:
# df_data.toPandas().to_csv('/tmp/check_cass_data.csv', header=True, encoding='utf8')To avoid that each step has to be executed separately, a shell script with two arguments can be used:
$ bash run_all.sh -d <Number of datasets that should be downloaded> -n <number of nodes/workers that will be created>or
$ bash run_all.sh --dataset=<Number of datasets that should be downloaded> --nodes=<number of nodes/workers that will be created>This script executes all steps in the correct order.
Note: During the first execution of the script it is possible that the data is not copied and the PySpark script is not executed correctly. This happens because of the initialization time of the Docker Container. If this problem occurs, the error can be fixed by executing the shell script again. Since it is recognized that the cluster already exists when the script is executed again, only the missing steps will be performed. But each time the cluster is executed, the amount of loaded datasets must be specified. Therefore, it is recommended to set this number to a low value when the cluster is executed again. After the cluster has built up, there will be no recurrent complications when running the script again.
To remove the cluster with all docker containers and the docker images, use the script:
$ bash remove_docker_container_images.sh To delete all created data files, run the script:
$ bash clean_folders_from_files.sh These are the references that were used for the creation of the readme.