This Scala Spark Streaming program reads streaming CSV price data from files, transforms them into DataFrames, and performs live analytics using SQL. Here’s a step-by-step explanation:
- Imports Spark SQL and streaming libraries.
- Creates a Spark Streaming context (
ssc) that processes new data every 5 seconds. - Sets up a stream to watch the
/tmp/amazon/directory for new text files (each file assumed to contain price data).
- Defines a Scala case class
Priceswith two fields:priceandtime, representing a price value and its timestamp (both as strings).
- For each RDD (batch) of lines:
- Splits each line by commas (assumes format
"price,time"). - Converts the split arrays to
Pricescase class objects. - Converts the collection of
Pricesinto a Spark SQL DataFrame. - Registers the DataFrame as a temporary table named
"prices". - Shows the contents of the DataFrame with
.show().
- Splits each line by commas (assumes format
- Executes a SQL query (
select count(*) from prices) to count how many price entries exist in the current batch. - Prints the count to the console.
- Also prints all incoming lines directly with
lines.print(). - Starts the streaming context to begin processing and continues until stopped.
- Purpose: Performs real-time streaming ingestion, transformation, and SQL-based analysis of Amazon-style (price, time) CSV files dropped in a local directory.
- Features: Live monitoring, immediate conversion of each batch of records into a DataFrame, on-the-fly SQL, and streaming batch output.
- Use Case: Useful for ingesting live price or log feeds into Spark for ETL, visualization, or monitoring dashboards.
This workflow demonstrates how to go from live file data to structured SQL analysis in Spark Streaming, using minimal code and strong type safety.