Foreword
Code snippets and excerpts from the tutorial. Python 3. From DataCamp.
Import the Data
sep
, delimiter.
delimiter
, delimiter.
names
, column names to use.
index_col
, column to use as the row labels.
read_table()
, general delimited files.
read_excel()
, Excel files.
read_fwf()
, Fixed-Width Formatted data.
read_clipboard
, data copied to the clipboard.
read_sql()
, SQL query.
Input-output documentation .
% pylab inline
import numpy as np
import pandas as pd
Populating the interactive namespace from numpy and matplotlib
digits
# Load in the data with `read_csv()`
digits = pd . read_csv ( "http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra" ,
header = None )
digits . head ()
0
1
2
3
4
5
6
7
8
9
...
55
56
57
58
59
60
61
62
63
64
0
0
1
6
15
12
1
0
0
0
7
...
0
0
0
6
14
7
1
0
0
0
1
0
0
10
16
6
0
0
0
0
7
...
0
0
0
10
16
15
3
0
0
0
2
0
0
8
15
16
13
0
0
0
1
...
0
0
0
9
14
0
0
0
0
7
3
0
0
0
3
11
16
0
0
0
0
...
0
0
0
0
1
15
2
0
0
4
4
0
0
5
14
4
0
0
0
0
0
...
0
0
0
4
12
14
7
0
0
6
5 rows × 65 columns
Find out about the dataset .
iris
Another classical dataset.
iris = pd . read_csv ( "http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data" )
iris . columns = [ 'Sepal_Length' , 'Sepal_Width' , 'Petal_Length' , 'Petal_Width' , 'Class' ]
iris . head ()
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
4.9
3.0
1.4
0.2
Iris-setosa
1
4.7
3.2
1.3
0.2
Iris-setosa
2
4.6
3.1
1.5
0.2
Iris-setosa
3
5.0
3.6
1.4
0.2
Iris-setosa
4
5.4
3.9
1.7
0.4
Iris-setosa
Basic Description of the Data
Describing The Data
Sepal_Length float64
Sepal_Width float64
Petal_Length float64
Petal_Width float64
Class object
dtype: object
def get_var_category ( series ):
unique_count = series . nunique ( dropna = False )
total_count = len ( series )
if pd . api . types . is_numeric_dtype ( series ):
return 'Numerical'
elif pd . api . types . is_datetime64_dtype ( series ):
return 'Date'
elif unique_count == total_count :
return 'Text (Unique)'
else :
return 'Categorical'
def print_categories ( df ):
for column_name in df . columns :
print ( column_name , ": " , get_var_category ( df [ column_name ]))
Sepal_Length : Numerical
Sepal_Width : Numerical
Petal_Length : Numerical
Petal_Width : Numerical
Class : Categorical
0
1
2
3
4
5
6
7
8
9
...
55
56
57
58
59
60
61
62
63
64
count
3823.0
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
...
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
3823.000000
mean
0.0
0.301334
5.481821
11.805912
11.451478
5.505362
1.387392
0.142297
0.002093
1.960502
...
0.148313
0.000262
0.283024
5.855872
11.942977
11.461156
6.700497
2.105676
0.202197
4.497253
std
0.0
0.866986
4.631601
4.259811
4.537556
5.613060
3.371444
1.051598
0.088572
3.052353
...
0.767761
0.016173
0.928046
4.980012
4.334508
4.991934
5.775815
4.028266
1.150694
2.869831
min
0.0
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
25%
0.0
0.000000
1.000000
10.000000
9.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.000000
1.000000
10.000000
9.000000
0.000000
0.000000
0.000000
2.000000
50%
0.0
0.000000
5.000000
13.000000
13.000000
4.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.000000
5.000000
13.000000
13.000000
6.000000
0.000000
0.000000
4.000000
75%
0.0
0.000000
9.000000
15.000000
15.000000
10.000000
0.000000
0.000000
0.000000
3.000000
...
0.000000
0.000000
0.000000
10.000000
15.000000
16.000000
12.000000
2.000000
0.000000
7.000000
max
0.0
8.000000
16.000000
16.000000
16.000000
16.000000
16.000000
16.000000
5.000000
15.000000
...
12.000000
1.000000
10.000000
16.000000
16.000000
16.000000
16.000000
16.000000
16.000000
9.000000
8 rows × 65 columns
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
count
149.000000
149.000000
149.000000
149.000000
mean
5.848322
3.051007
3.774497
1.205369
std
0.828594
0.433499
1.759651
0.761292
min
4.300000
2.000000
1.000000
0.100000
25%
5.100000
2.800000
1.600000
0.300000
50%
5.800000
3.000000
4.400000
1.300000
75%
6.400000
3.300000
5.100000
1.800000
max
7.900000
4.400000
6.900000
2.500000
iris [[ "Sepal_Length" , "Sepal_Width" ]] . describe ()
Sepal_Length
Sepal_Width
count
149.000000
149.000000
mean
5.848322
3.051007
std
0.828594
0.433499
min
4.300000
2.000000
25%
5.100000
2.800000
50%
5.800000
3.000000
75%
6.400000
3.300000
max
7.900000
4.400000
length = len ( digits )
print ( length )
count = digits [ 2 ] . count ()
print ( count )
number_of_missing_values = length - count
pct_of_missing_values = float ( number_of_missing_values / length )
pct_of_missing_values = " {0:.1f} %" . format ( pct_of_missing_values * 100 )
print ( pct_of_missing_values )
print ( "Minimum value: " , iris [ "Sepal_Length" ] . min ())
print ( "Maximum value: " , iris [ "Sepal_Length" ] . max ())
Minimum value: 4.3
Maximum value: 7.9
print ( iris [ "Sepal_Length" ] . mode ())
print ( iris [ "Sepal_Length" ] . mean ())
print ( iris [ "Sepal_Length" ] . median ())
print ( iris [ "Sepal_Length" ] . std ())
print ( iris [ "Sepal_Length" ] . quantile ([ .25 , .5 , .75 ]))
0.25 5.1
0.50 5.8
0.75 6.4
Name : Sepal_Length , dtype : float64
import seaborn as sns
sns . set ( color_codes = True )
sns . set_palette ( sns . color_palette ( "muted" ))
sns . distplot ( iris [ "Sepal_Length" ] . dropna ())
<matplotlib.axes._subplots.AxesSubplot at 0x7f87b39b0320>
iris [[ "Sepal_Length" , "Sepal_Width" ]] . corr ()
Sepal_Length
Sepal_Width
Sepal_Length
1.000000
-0.103784
Sepal_Width
-0.103784
1.000000
import pandas_profiling
# Print a full report
pandas_profiling . ProfileReport ( iris )
Dataset info
Number of variables
5
Number of observations
149
Total Missing (%)
0.0%
Total size in memory
5.9 KiB
Average record size in memory
40.5 B
Variables types
Numeric
3
Categorical
1
Date
0
Text (Unique)
0
Rejected
1
Warnings
Petal_Width
is highly correlated with Petal_Length
(ρ = 0.96231) Rejected Dataset has 3 duplicate rows Warning
Distinct count
3
Unique (%)
2.0%
Missing (%)
0.0%
Missing (n)
0
Iris-virginica
50
Iris-versicolor
50
Iris-setosa
49
Distinct count
43
Unique (%)
28.9%
Missing (%)
0.0%
Missing (n)
0
Infinite (%)
0.0%
Infinite (n)
0
Mean
3.7745
Minimum
1
Maximum
6.9
Zeros (%)
0.0%
Quantile statistics
Minimum
1
5-th percentile
1.3
Q1
1.6
Median
4.4
Q3
5.1
95-th percentile
6.1
Maximum
6.9
Range
5.9
Interquartile range
3.5
Descriptive statistics
Standard deviation
1.7597
Coef of variation
0.46619
Kurtosis
-1.385
Mean
3.7745
MAD
1.5526
Skewness
-0.28946
Sum
562.4
Variance
3.0964
Memory size
1.2 KiB
Value
Count
Frequency (%)
1.5
14
9.4%
1.4
11
7.4%
5.1
8
5.4%
4.5
8
5.4%
1.3
7
4.7%
1.6
7
4.7%
5.6
6
4.0%
4.0
5
3.4%
4.9
5
3.4%
4.7
5
3.4%
Other values (33)
73
49.0%
Minimum 5 values
Value
Count
Frequency (%)
1.0
1
0.7%
1.1
1
0.7%
1.2
2
1.3%
1.3
7
4.7%
1.4
11
7.4%
Maximum 5 values
Value
Count
Frequency (%)
6.3
1
0.7%
6.4
1
0.7%
6.6
1
0.7%
6.7
2
1.3%
6.9
1
0.7%
Petal_Width
Highly correlated
This variable is highly correlated with Petal_Length
and should be ignored for analysis
Distinct count
35
Unique (%)
23.5%
Missing (%)
0.0%
Missing (n)
0
Infinite (%)
0.0%
Infinite (n)
0
Mean
5.8483
Minimum
4.3
Maximum
7.9
Zeros (%)
0.0%
Quantile statistics
Minimum
4.3
5-th percentile
4.6
Q1
5.1
Median
5.8
Q3
6.4
95-th percentile
7.26
Maximum
7.9
Range
3.6
Interquartile range
1.3
Descriptive statistics
Standard deviation
0.82859
Coef of variation
0.14168
Kurtosis
-0.55356
Mean
5.8483
MAD
0.68748
Skewness
0.3031
Sum
871.4
Variance
0.68657
Memory size
1.2 KiB
Value
Count
Frequency (%)
5.0
10
6.7%
6.3
9
6.0%
5.1
8
5.4%
6.7
8
5.4%
5.7
8
5.4%
5.5
7
4.7%
5.8
7
4.7%
6.4
7
4.7%
6.0
6
4.0%
4.9
6
4.0%
Other values (25)
73
49.0%
Minimum 5 values
Value
Count
Frequency (%)
4.3
1
0.7%
4.4
3
2.0%
4.5
1
0.7%
4.6
4
2.7%
4.7
2
1.3%
Maximum 5 values
Value
Count
Frequency (%)
7.3
1
0.7%
7.4
1
0.7%
7.6
1
0.7%
7.7
4
2.7%
7.9
1
0.7%
Distinct count
23
Unique (%)
15.4%
Missing (%)
0.0%
Missing (n)
0
Infinite (%)
0.0%
Infinite (n)
0
Mean
3.051
Minimum
2
Maximum
4.4
Zeros (%)
0.0%
Quantile statistics
Minimum
2
5-th percentile
2.34
Q1
2.8
Median
3
Q3
3.3
95-th percentile
3.8
Maximum
4.4
Range
2.4
Interquartile range
0.5
Descriptive statistics
Standard deviation
0.4335
Coef of variation
0.14208
Kurtosis
0.31865
Mean
3.051
MAD
0.33199
Skewness
0.3501
Sum
454.6
Variance
0.18792
Memory size
1.2 KiB
Value
Count
Frequency (%)
3.0
26
17.4%
2.8
14
9.4%
3.2
13
8.7%
3.4
12
8.1%
3.1
12
8.1%
2.9
10
6.7%
2.7
9
6.0%
2.5
8
5.4%
3.8
6
4.0%
3.3
6
4.0%
Other values (13)
33
22.1%
Minimum 5 values
Value
Count
Frequency (%)
2.0
1
0.7%
2.2
3
2.0%
2.3
4
2.7%
2.4
3
2.0%
2.5
8
5.4%
Maximum 5 values
Value
Count
Frequency (%)
3.9
2
1.3%
4.0
1
0.7%
4.1
1
0.7%
4.2
1
0.7%
4.4
1
0.7%
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
4.9
3.0
1.4
0.2
Iris-setosa
1
4.7
3.2
1.3
0.2
Iris-setosa
2
4.6
3.1
1.5
0.2
Iris-setosa
3
5.0
3.6
1.4
0.2
Iris-setosa
4
5.4
3.9
1.7
0.4
Iris-setosa
# Print a full report
pandas_profiling . ProfileReport ( digits )
>>> Full (lengthy) report here!!! <<<
First and Last DataFrame Rows
# Inspect the first 5 rows of `digits`
first = digits . head ( 5 )
# Inspect the last 3 rows
last = digits . tail ( 3 )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 \
0 0 1 6 15 12 1 0 0 0 7 ... 0 0 0 6 14 7 1 0
1 0 0 10 16 6 0 0 0 0 7 ... 0 0 0 10 16 15 3 0
2 0 0 8 15 16 13 0 0 0 1 ... 0 0 0 9 14 0 0 0
3 0 0 0 3 11 16 0 0 0 0 ... 0 0 0 0 1 15 2 0
4 0 0 5 14 4 0 0 0 0 0 ... 0 0 0 4 12 14 7 0
63 64
0 0 0
1 0 0
2 0 7
3 0 4
4 0 6
[5 rows x 65 columns]
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 \
3820 0 0 3 15 0 0 0 0 0 0 ... 0 0 0 4 14 16 9
3821 0 0 6 16 2 0 0 0 0 0 ... 0 0 0 5 16 16 16
3822 0 0 2 15 16 13 1 0 0 0 ... 0 0 0 4 14 1 0
62 63 64
3820 0 0 6
3821 5 0 6
3822 0 0 7
[3 rows x 65 columns]
Sample the Data
# Take a sample of 5
digits . sample ( 5 )
0
1
2
3
4
5
6
7
8
9
...
55
56
57
58
59
60
61
62
63
64
1249
0
0
14
14
13
15
5
0
0
0
...
0
0
0
12
16
10
2
0
0
5
3702
0
0
0
9
16
12
2
0
0
0
...
0
0
0
0
9
14
2
0
0
0
1605
0
0
7
16
13
2
0
0
0
2
...
0
0
0
5
14
11
1
0
0
0
1890
0
0
3
15
15
5
0
0
0
0
...
2
0
0
3
15
16
16
13
1
9
1295
0
0
7
15
13
3
0
0
0
0
...
0
0
0
9
13
12
3
0
0
0
5 rows × 65 columns
# import `sample` from `random`
from random import sample
# Create a random index
randomIndex = np . array ( sample ( range ( len ( digits )), 5 ))
print ( randomIndex )
# Get 5 random rows
digitsSample = digits . ix [ randomIndex ]
# Print the sample
print ( digitsSample )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 \
846 0 5 14 15 9 1 0 0 0 7 ... 0 0 4 12 16 12 10
569 0 1 7 12 12 0 0 0 0 3 ... 0 0 0 10 16 13 7
315 0 1 6 13 13 4 0 0 0 9 ... 0 0 0 4 14 16 9
2932 0 0 4 12 10 1 0 0 0 0 ... 0 0 0 4 12 11 3
2328 0 0 4 15 16 16 16 15 0 0 ... 0 0 0 5 15 3 0
62 63 64
846 4 0 2
569 0 0 3
315 2 0 2
2932 0 0 0
2328 0 0 7
[5 rows x 65 columns]
Queries
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
4.9
3.0
1.4
0.2
Iris-setosa
1
4.7
3.2
1.3
0.2
Iris-setosa
# Petal length greater than sepal length?
iris . query ( 'Petal_Length > Sepal_Length' )
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
# reverse
iris . query ( 'Sepal_Length > Petal_Length' )
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
4.9
3.0
1.4
0.2
Iris-setosa
1
4.7
3.2
1.3
0.2
Iris-setosa
2
4.6
3.1
1.5
0.2
Iris-setosa
3
5.0
3.6
1.4
0.2
Iris-setosa
4
5.4
3.9
1.7
0.4
Iris-setosa
5
4.6
3.4
1.4
0.3
Iris-setosa
6
5.0
3.4
1.5
0.2
Iris-setosa
7
4.4
2.9
1.4
0.2
Iris-setosa
8
4.9
3.1
1.5
0.1
Iris-setosa
9
5.4
3.7
1.5
0.2
Iris-setosa
10
4.8
3.4
1.6
0.2
Iris-setosa
11
4.8
3.0
1.4
0.1
Iris-setosa
12
4.3
3.0
1.1
0.1
Iris-setosa
13
5.8
4.0
1.2
0.2
Iris-setosa
14
5.7
4.4
1.5
0.4
Iris-setosa
15
5.4
3.9
1.3
0.4
Iris-setosa
16
5.1
3.5
1.4
0.3
Iris-setosa
17
5.7
3.8
1.7
0.3
Iris-setosa
18
5.1
3.8
1.5
0.3
Iris-setosa
19
5.4
3.4
1.7
0.2
Iris-setosa
20
5.1
3.7
1.5
0.4
Iris-setosa
21
4.6
3.6
1.0
0.2
Iris-setosa
22
5.1
3.3
1.7
0.5
Iris-setosa
23
4.8
3.4
1.9
0.2
Iris-setosa
24
5.0
3.0
1.6
0.2
Iris-setosa
25
5.0
3.4
1.6
0.4
Iris-setosa
26
5.2
3.5
1.5
0.2
Iris-setosa
27
5.2
3.4
1.4
0.2
Iris-setosa
28
4.7
3.2
1.6
0.2
Iris-setosa
29
4.8
3.1
1.6
0.2
Iris-setosa
...
...
...
...
...
...
119
6.9
3.2
5.7
2.3
Iris-virginica
120
5.6
2.8
4.9
2.0
Iris-virginica
121
7.7
2.8
6.7
2.0
Iris-virginica
122
6.3
2.7
4.9
1.8
Iris-virginica
123
6.7
3.3
5.7
2.1
Iris-virginica
124
7.2
3.2
6.0
1.8
Iris-virginica
125
6.2
2.8
4.8
1.8
Iris-virginica
126
6.1
3.0
4.9
1.8
Iris-virginica
127
6.4
2.8
5.6
2.1
Iris-virginica
128
7.2
3.0
5.8
1.6
Iris-virginica
129
7.4
2.8
6.1
1.9
Iris-virginica
130
7.9
3.8
6.4
2.0
Iris-virginica
131
6.4
2.8
5.6
2.2
Iris-virginica
132
6.3
2.8
5.1
1.5
Iris-virginica
133
6.1
2.6
5.6
1.4
Iris-virginica
134
7.7
3.0
6.1
2.3
Iris-virginica
135
6.3
3.4
5.6
2.4
Iris-virginica
136
6.4
3.1
5.5
1.8
Iris-virginica
137
6.0
3.0
4.8
1.8
Iris-virginica
138
6.9
3.1
5.4
2.1
Iris-virginica
139
6.7
3.1
5.6
2.4
Iris-virginica
140
6.9
3.1
5.1
2.3
Iris-virginica
141
5.8
2.7
5.1
1.9
Iris-virginica
142
6.8
3.2
5.9
2.3
Iris-virginica
143
6.7
3.3
5.7
2.5
Iris-virginica
144
6.7
3.0
5.2
2.3
Iris-virginica
145
6.3
2.5
5.0
1.9
Iris-virginica
146
6.5
3.0
5.2
2.0
Iris-virginica
147
6.2
3.4
5.4
2.3
Iris-virginica
148
5.9
3.0
5.1
1.8
Iris-virginica
149 rows × 5 columns
# alternatively
iris [ iris . Sepal_Length > iris . Petal_Length ]
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
4.9
3.0
1.4
0.2
Iris-setosa
1
4.7
3.2
1.3
0.2
Iris-setosa
2
4.6
3.1
1.5
0.2
Iris-setosa
3
5.0
3.6
1.4
0.2
Iris-setosa
4
5.4
3.9
1.7
0.4
Iris-setosa
5
4.6
3.4
1.4
0.3
Iris-setosa
6
5.0
3.4
1.5
0.2
Iris-setosa
7
4.4
2.9
1.4
0.2
Iris-setosa
8
4.9
3.1
1.5
0.1
Iris-setosa
9
5.4
3.7
1.5
0.2
Iris-setosa
10
4.8
3.4
1.6
0.2
Iris-setosa
11
4.8
3.0
1.4
0.1
Iris-setosa
12
4.3
3.0
1.1
0.1
Iris-setosa
13
5.8
4.0
1.2
0.2
Iris-setosa
14
5.7
4.4
1.5
0.4
Iris-setosa
15
5.4
3.9
1.3
0.4
Iris-setosa
16
5.1
3.5
1.4
0.3
Iris-setosa
17
5.7
3.8
1.7
0.3
Iris-setosa
18
5.1
3.8
1.5
0.3
Iris-setosa
19
5.4
3.4
1.7
0.2
Iris-setosa
20
5.1
3.7
1.5
0.4
Iris-setosa
21
4.6
3.6
1.0
0.2
Iris-setosa
22
5.1
3.3
1.7
0.5
Iris-setosa
23
4.8
3.4
1.9
0.2
Iris-setosa
24
5.0
3.0
1.6
0.2
Iris-setosa
25
5.0
3.4
1.6
0.4
Iris-setosa
26
5.2
3.5
1.5
0.2
Iris-setosa
27
5.2
3.4
1.4
0.2
Iris-setosa
28
4.7
3.2
1.6
0.2
Iris-setosa
29
4.8
3.1
1.6
0.2
Iris-setosa
...
...
...
...
...
...
119
6.9
3.2
5.7
2.3
Iris-virginica
120
5.6
2.8
4.9
2.0
Iris-virginica
121
7.7
2.8
6.7
2.0
Iris-virginica
122
6.3
2.7
4.9
1.8
Iris-virginica
123
6.7
3.3
5.7
2.1
Iris-virginica
124
7.2
3.2
6.0
1.8
Iris-virginica
125
6.2
2.8
4.8
1.8
Iris-virginica
126
6.1
3.0
4.9
1.8
Iris-virginica
127
6.4
2.8
5.6
2.1
Iris-virginica
128
7.2
3.0
5.8
1.6
Iris-virginica
129
7.4
2.8
6.1
1.9
Iris-virginica
130
7.9
3.8
6.4
2.0
Iris-virginica
131
6.4
2.8
5.6
2.2
Iris-virginica
132
6.3
2.8
5.1
1.5
Iris-virginica
133
6.1
2.6
5.6
1.4
Iris-virginica
134
7.7
3.0
6.1
2.3
Iris-virginica
135
6.3
3.4
5.6
2.4
Iris-virginica
136
6.4
3.1
5.5
1.8
Iris-virginica
137
6.0
3.0
4.8
1.8
Iris-virginica
138
6.9
3.1
5.4
2.1
Iris-virginica
139
6.7
3.1
5.6
2.4
Iris-virginica
140
6.9
3.1
5.1
2.3
Iris-virginica
141
5.8
2.7
5.1
1.9
Iris-virginica
142
6.8
3.2
5.9
2.3
Iris-virginica
143
6.7
3.3
5.7
2.5
Iris-virginica
144
6.7
3.0
5.2
2.3
Iris-virginica
145
6.3
2.5
5.0
1.9
Iris-virginica
146
6.5
3.0
5.2
2.0
Iris-virginica
147
6.2
3.4
5.4
2.3
Iris-virginica
148
5.9
3.0
5.1
1.8
Iris-virginica
149 rows × 5 columns
The Challenges of Data
Missing Values
# Identifiy missing values
pd . isnull ( digits )
0
1
2
3
4
5
6
7
8
9
...
55
56
57
58
59
60
61
62
63
64
0
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
1
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
2
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
4
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
5
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
6
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
7
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
8
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
9
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
10
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
11
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
12
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
13
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
14
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
15
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
16
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
17
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
18
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
19
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
20
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
21
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
22
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
23
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
24
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
25
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
26
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
27
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
28
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
29
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
3793
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3794
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3795
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3796
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3797
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3798
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3799
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3800
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3801
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3802
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3803
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3804
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3805
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3806
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3807
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3808
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3809
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3810
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3811
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3812
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3813
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3814
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3815
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3816
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3817
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3818
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3819
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3820
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3821
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3822
False
False
False
False
False
False
False
False
False
False
...
False
False
False
False
False
False
False
False
False
False
3823 rows × 65 columns
Delete
# Drop rows with missing values
df . dropna ( axis = 0 )
# Drop columns with missing values
df . dropna ( axis = 1 )
Impute
Imputation: mean, median, another variable, estimate with regression ANOVA, logit, k-NN.
# Import NumPy
import numpy as np
# Calculate the mean of the DataFrame variable Salary
mean = np . mean ( df . Salary )
# Replace missing values with the mean
df = df . Salary . fillna ( mean )
df = df . Salary . fillna ( mean , method = ffill )
ffill
and bfill
for forward and backward fill.
from scipy import interpolate
# Fill the DataFrame
df . interpolate ()
df . interpolate ( method = cubic )
cubic
, polynomial
.
limit
and limit_direction
.
Outliers
Delete (data entry, processing errors), transform (assign weights, natural log to reduce variation) or impute them (replace extreme values with median, mean or mode values).
The Data’s Features
Feature Engineering
Increase the predictive power of learning algorithms by creating features from raw data that will help the learning process.
Encode categorical variables into numerical ones
# Factorize the values
labels , levels = pd . factorize ( iris . Class )
# Save the encoded variables in `iris.Class`
iris . Class = labels
# Print out the first rows
iris . Class . head ()
0 0
1 0
2 0
3 0
4 0
Name : Class , dtype : int 64
Bin continuous variables in groups
# Define the bins
mybins = range ( 0 , df . age . max (), 10 )
# Cut the data from the DataFrame with the help of the bins
df [ 'age_bucket' ] = pd . cut ( df . age , bins = mybins )
# Count the number of values per bucket
df [ 'age_bucket' ] . value_counts ()
Scale features
Center the data around 0.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler () . fit ( X )
rescaledX = scaler . transform ( X )
Feature Selection
Select the key subset of original data features in an attempt to reduce the dimensionality of the training problem.
PCA combines similar (correlated) attributes and creates new ones that are considered superior to the original attributes of the dataset.
Feature selection doesn’t combine attributes: it evaluates the quality and predictive power and selects the best set.
To find important features, calculate how much better or worse a model does when we leave one variable out of the equation.
# Import `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# Isolate Data, class labels and column values
X = iris . iloc [:, 0 : 4 ]
Y = iris . iloc [:, - 1 ]
names = iris . columns . values
# Build the model
rfc = RandomForestClassifier ()
# Fit the model
rfc . fit ( X , Y )
# Print the results
print ( "Features sorted by their score:" )
print ( sorted ( zip ( map ( lambda x : round ( x , 4 ), rfc . feature_importances_ ), names ), reverse = True ))
Features sorted by their score:
[(0.4899, 'Petal_Length'), (0.2752, 'Petal_Width'), (0.2185, 'Sepal_Length'), (0.016400000000000001, 'Sepal_Width')]
The best feature set is one that includes the petal length and petal width data.
# Isolate feature importances
importance = rfc . feature_importances_
# Sort the feature importances
sorted_importances = np . argsort ( importance )
# Insert padding
padding = np . arange ( len ( names ) - 1 ) + 0.5
# Plot the data
plt . barh ( padding , importance [ sorted_importances ], align = 'center' )
# Customize the plot
plt . yticks ( padding , names [ sorted_importances ])
plt . xlabel ( "Relative Importance" )
plt . title ( "Variable Importance" )
# Show the plot
plt . show ()
Patterns In the Data
Visualization of the data; static with Matplotlib or Seaborn, interactive with Bokeh or Plotly.
Correlation Identification with PCA from scikit-learn
Matplotlib
Dimensionality Reduction techniques, such as Principal Component Analysis (PCA). From ‘many’ to two ‘principal components’.
# Import `PCA` from `sklearn.decomposition`
from sklearn.decomposition import PCA
# Build the model
pca = PCA ( n_components = 2 )
# Reduce the data, output is ndarray
reduced_data = pca . fit_transform ( digits )
# Inspect shape of the `reduced_data`
reduced_data . shape
# print out the reduced data
print ( reduced_data )
[[ 12.65674168 -4.63610357]
[ 16.82906354 -12.96575346]
[-19.08072301 10.58293767]
...,
[ 23.90693984 6.06265415]
[ 29.1798759 -3.06847144]
[-25.23132536 11.60863909]]
reduced_data = pd . DataFrame ( reduced_data )
import matplotlib.pyplot as plt
plt . scatter ( reduced_data [ 0 ], reduced_data [ 1 ])
plt . show ()
Bokeh
To be implemented in a webpage for example.
from bokeh.charts import Scatter , output_file , show
# Construct the scatter plot
p = Scatter ( iris , x = 'Petal_Length' , y = 'Petal_Width' , color = "Class" , title = "Petal Length vs Petal Width" ,
xlabel = "Sepal Length" , ylabel = "Sepal Width" )
# Output the file
output_file ( 'scatter.html' )
# Show the scatter plot
show ( p )
The GIF output:
Correlation Identification with Pandas
The Pearson correlation assumes that the variables are normally distributed, that there is a straight line relationship between each of the variables and that the data is normally distributed about the regression line.
The Spearman correlation, on the other hand, assumes that we have two ordinal variables or two variables that are related in some way, but not linearly. The Spearman coefficient is the sum of deviation squared by n
times n
minus 1.
The Kendall Tau correlation is a coefficient that represents the degree of concordance between two columns of ranked data. We can use the Spearman correlation to measure the degree of association between two variables. The Kendal Tau coefficient is calculated by the number of concordant pairs minus the number of discordant pairs divided by the total number of pairs.
Spearman’s coefficient will usually be larger than the Kendall’s Tau coefficient, but this is not always the case: we’ll get a smaller Spearman’s coefficient when the deviations are huge among the observations of the data. The Spearman correlation is very sensitive to this and this might come in handy in some cases!
The two last correlation measures require ranking the data.
# Pearson correlation
iris . corr ()
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
Sepal_Length
1.000000
-0.103784
0.871283
0.816971
0.781219
Sepal_Width
-0.103784
1.000000
-0.415218
-0.350733
-0.414532
Petal_Length
0.871283
-0.415218
1.000000
0.962314
0.948519
Petal_Width
0.816971
-0.350733
0.962314
1.000000
0.956014
Class
0.781219
-0.414532
0.948519
0.956014
1.000000
iris2 = iris . rank ()
# Kendall Tau correlation
iris2 . corr ( 'kendall' )
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
Sepal_Length
1.000000
-0.067636
0.718290
0.654197
0.669163
Sepal_Width
-0.067636
1.000000
-0.175665
-0.140207
-0.327228
Petal_Length
0.718290
-0.175665
1.000000
0.803041
0.822578
Petal_Width
0.654197
-0.140207
0.803041
1.000000
0.837934
Class
0.669163
-0.327228
0.822578
0.837934
1.000000
# Spearman Rank correlation
iris2 . corr ( 'spearman' )
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
Sepal_Length
1.000000
-0.152136
0.881759
0.833586
0.796546
Sepal_Width
-0.152136
1.000000
-0.294020
-0.267686
-0.426319
Petal_Length
0.881759
-0.294020
1.000000
0.936188
0.935220
Petal_Width
0.833586
-0.267686
0.936188
1.000000
0.937409
Class
0.796546
-0.426319
0.935220
0.937409
1.000000