This document discusses visualizing data in R using various packages and techniques. It introduces ggplot2, a popular package for data visualization that implements Wilkinson's Grammar of Graphics. Ggplot2 can serve as a replacement for base graphics in R and contains defaults for displaying common scales online and in print. The document then covers basic visualizations like histograms, bar charts, box plots, and scatter plots that can be created in R, as well as more advanced visualizations. It also provides examples of code for creating simple time series charts, bar charts, and histograms in R.
Overview of data visualization's importance, R's capabilities, and ggplot2's popularity.
Introduction to ggplot2, its creation, and its role in implementing Grammar of Graphics.
Basic and advanced visualization techniques like histograms, bar/line charts, heat maps, and 3D graphs.
Explanation of histograms, their purpose, and how to adjust data bins for better analysis.
Usage of line and bar charts for trend analysis over time and comparisons between categories.
Description of box plots and their statistical significance in showing data spread.
Methods for reading data into R using read.table() or read.csv() functions.
Various functions in R for generating time series, bar charts, histograms and their respective usages.Details on ggplot function usage, aesthetic mappings, and methods to create visualizations.
Data-Visualization tools andtechniques offer executives and other
knowledge workers new approaches to dramatically improve their ability
to grasp information hiding in their data.
Data visualization is a general term that describes any effort to help
people understand the significance of data by placing it in a visual
context. Patterns, trends and correlations that might go undetected in
text-based data can be exposed and recognized easier with data
visualization software.
It isn't just the attraction of the huge range of statistical analyses
afforded by R that attracts data people to R. The language has also
developed a rich ecosystem of charts, plots and visualizations over
the years.
3.
ggplot2 is adata visualization package for the statistical programming language R.
Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland
Wilkinson's Grammar of Graphics—a general scheme for data visualization which
breaks up graphs into semantic components such as scales and layers.
ggplot2 can serve as a replacement for the base graphics in R and contains a
number of defaults for web and print display of common scales.
Since 2005, ggplot2 has grown in use to become one of the most popular R
packages. It is licensed under GNU GPL v2.
ggplot2
1. Histogram
Histogram isbasically a plot that breaks the data into bins (or
breaks) and shows frequency distribution of these bins. You
can change the breaks also and see the effect it has data
visualization in terms of understandability.
7.
2. Bar/ LineChart
Line Chart
Below is the line chart showing the increase in air
passengers over given time period. Line Charts are
commonly preferred when we are to analyse a trend
spread over a time period. Furthermore, line plot is also
suitable to plots where we need to compare relative
changes in quantities across some variable (like time).
Below is the code:
plot(AirPassengers,type="l") #Simple Line Plot
9.
Bar Chart
Bar Plotsare suitable for showing comparison between
cumulative totals across several groups. Stacked Plots are
used for bar plots for various categories. Here’s the code:
11.
3. Box Plot( including group-by option )
Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the
median, the 75th percentile and the maximum. It is thus useful for visualizing the spread
of the data is and deriving inferences accordingly. Here’s the basic code:
12.
Ingest Data
When readingdata into R, we generally will use
the read.table() or read.csv()function. This opens a file and
returns the content of that file.
In the above example we store the contents of the file in the
variab le bugData. Notice that we use the <- operator in R
instead of the = like in most other languages.
There are certain parameters that we can pass in
to table.read().
Among the most often used of these parameters
are: sep, header, row.name, and col.name.
13.
R provides theplot function that can be used to create time
series charts. We can either pass in a complete data structure
like in the example below (if it contains a plotting function), or
we can pass in lists to serve as the x- and y- axes of the chart.
?plot
R provides thehist() function to create histograms.
The hist() function accepts a vector of values.
18.
Usage
ggplot(data = NULL,mapping = aes(), ..., environment = parent.frame())
Arguments
Data:
Default dataset to use for plot. If not already a data.frame, will be converted to one
by fortify. If not specified, must be suppled in each layer added to the plot.
mapping
Default list of aesthetic mappings to use for plot. If not specified, must be suppled in
each layer added to the plot.
environment
If an variable defined in the aesthetic mapping is not found in the data, ggplot will
look for it in this environment. It defaults to using the environment in which ggplot() is
called.
19.
ggplot() is usedto construct the initial plot object, and is
almost always followed by + to add component to the plot. There
are three common ways to invoke ggplot:
ggplot(df, aes(x, y, ))
ggplot(df)
ggplot()
The first method is recommended if all layers use the same data
and the same set of aesthetics, although this method can also be
used to add a layer using data from another data frame. See the
first example below.
The second method specifies the default data frame to use for the
plot, but no aesthetics are defined up front. This is useful when one
data frame is used predominantly as layers are added, but the
aesthetics may vary from one layer to another.
The third method initializes a skeleton ggplot object which is
fleshed out as layers are added. This method is useful when
multiple data frames are used to produce different layers, as is
often the case in complex graphics.