Pandas
• Pandas isa Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
• The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in
2008.
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
• Pandas can clean messy data sets, and make them readable
and relevant.
• Relevant data is very important in data science.
• like acolumn in a table.
• a one-dimensional array holding data of any type.
• import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
• import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
Pandas Series
Data Cleaning
• Datacleaning means fixing bad data in your
data set.
• Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Data Correlations
• df.corr()
•Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship with
itself.
• Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very
good correlation, and we can predict that the longer you work out, the
more calories you burn
• Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very
bad correlation, meaning that we can not predict the max pulse by just
looking at the duration of the work out
16.
Pandas - Plotting
•import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
• import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
• import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')
plt.show()
• df["Duration"].plot(kind = 'hist')
17.
Data Indexing andSelection
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving columns by indexing operator
first = data["Age"]
print(first)
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving multiple columns by indexing operator
first = data[["Age", "College", "Salary"]]
print(first )
18.
• # importingpandas package
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving row by loc method
• first = data.loc["Avery Bradley"]
• second = data.loc["R.J. Hunter"]
• print(first, "nnn", second)
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving multiple rows by loc method
• first = data.loc[["Avery Bradley", "R.J. Hunter"]]
19.
• import pandasas pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving two rows and three columns by loc method
• first = data.loc[["Avery Bradley", "R.J. Hunter"],
• ["Team", "Number", "Position"]]
• print(first)
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving all rows and some columns by loc method
• first = data.loc[:, ["Team", "Number", "Position"]]
• print(first)
20.
• import pandasas pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving rows by iloc method
• row2 = data.iloc[3]
• print(row2)
•
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving multiple rows by iloc method
• row2 = data.iloc [[3, 5, 7]]
• row2
21.
• import pandasas pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving two rows and two columns by iloc method
• row2 = data.iloc [[3, 4], [1, 2]]
• print(row2)
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving all rows and some columns by iloc method
• row2 = data.iloc [:, [1, 2]]
• print(row2)
22.
• import pandasas pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving row by ix method
• first = data.ix["Avery Bradley"] #index slicing
• print(first)
•
• import pandas as pd
• # making data frame from csv file
• data = pd.read_csv("nba.csv", index_col ="Name")
• # retrieving row by ix method
• first = data.ix[1]
• print(first)
23.
Function Description
Dataframe.head() Returntop n rows of a data frame.
Dataframe.tail() Return bottom n rows of a data frame.
Dataframe.at[] Access a single value for a row/column label pair.
Dataframe.iat[] Access a single value for a row/column pair by integer position.
Dataframe.tail() Purely integer-location based indexing for selection by position.
DataFrame.lookup() Label-based “fancy indexing” function for DataFrame.
DataFrame.pop() Return item and drop from frame.
DataFrame.xs() Returns a cross-section (row(s) or column(s)) from the DataFrame.
DataFrame.get() Get item from object for given key (DataFrame column, Panel slice, etc.).
DataFrame.isin() Return boolean DataFrame showing whether each element in the DataFrame
is contained in values.
DataFrame.where()
Return an object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.
DataFrame.mask() Return an object of same shape as self and whose corresponding entries are
from self where cond is False and otherwise are from other.
DataFrame.query() Query the columns of a frame with a boolean expression.
DataFrame.insert() Insert column into DataFrame at specified location.
Hierarchical Indexes arealso known as multi-indexing is setting more than one column name as
the index.
• import pandas as pd
• df = pd.read_csv('homelessness.csv')
• print(df.head())
• col = df.columns
• print(col)
• # using the pandas set_index() function.
• df_ind3 = df.set_index(['region', 'state', 'individuals'])
• # we can sort the data by using sort_index()
• df_ind3.sort_index()
• print(df_ind3.head(10))
• df_ind3_region = df_ind3.loc[['Pacific', 'Mountain']]
• print(df_ind3_region.head(10))
• sum() :Computesum of column values
• min() :Compute min of column values
• max() :Compute max of column values
• mean() :Compute mean of column
• size() :Compute column sizes
• describe() :Generates descriptive statistics
• first() :Compute first of group values
• last() :Compute last of group values
• count() :Compute count of column values
• std() :Standard deviation of column
• var() :Compute variance of column
• sem() :Standard error of the mean of column