Skip to content

Foreword

Code snippets and excerpts from the tutorial. Python 3. From DataCamp.


Import the Data

  • sep, delimiter.
  • delimiter, delimiter.
  • names, column names to use.
  • index_col, column to use as the row labels.

  • read_table(), general delimited files.

  • read_excel(), Excel files.
  • read_fwf(), Fixed-Width Formatted data.
  • read_clipboard, data copied to the clipboard.
  • read_sql(), SQL query.

Input-output documentation.

%pylab inline
import numpy as np
import pandas as pd
1
Populating the interactive namespace from numpy and matplotlib

digits

# Load in the data with `read_csv()`
digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra",
                     header=None)

digits.head()
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 64
0 0 1 6 15 12 1 0 0 0 7 ... 0 0 0 6 14 7 1 0 0 0
1 0 0 10 16 6 0 0 0 0 7 ... 0 0 0 10 16 15 3 0 0 0
2 0 0 8 15 16 13 0 0 0 1 ... 0 0 0 9 14 0 0 0 0 7
3 0 0 0 3 11 16 0 0 0 0 ... 0 0 0 0 1 15 2 0 0 4
4 0 0 5 14 4 0 0 0 0 0 ... 0 0 0 4 12 14 7 0 0 6

5 rows × 65 columns

Find out about the dataset.

iris

Another classical dataset.

iris = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data")

iris.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Class']

iris.head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa

Basic Description of the Data

Describing The Data

iris.dtypes
1
2
3
4
5
6
Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Class            object
dtype: object
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count==total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))
print_categories(iris)
1
2
3
4
5
Sepal_Length :  Numerical
Sepal_Width :  Numerical
Petal_Length :  Numerical
Petal_Width :  Numerical
Class :  Categorical
digits.describe()
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 64
count 3823.0 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 ... 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000 3823.000000
mean 0.0 0.301334 5.481821 11.805912 11.451478 5.505362 1.387392 0.142297 0.002093 1.960502 ... 0.148313 0.000262 0.283024 5.855872 11.942977 11.461156 6.700497 2.105676 0.202197 4.497253
std 0.0 0.866986 4.631601 4.259811 4.537556 5.613060 3.371444 1.051598 0.088572 3.052353 ... 0.767761 0.016173 0.928046 4.980012 4.334508 4.991934 5.775815 4.028266 1.150694 2.869831
min 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.0 0.000000 1.000000 10.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 10.000000 9.000000 0.000000 0.000000 0.000000 2.000000
50% 0.0 0.000000 5.000000 13.000000 13.000000 4.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 5.000000 13.000000 13.000000 6.000000 0.000000 0.000000 4.000000
75% 0.0 0.000000 9.000000 15.000000 15.000000 10.000000 0.000000 0.000000 0.000000 3.000000 ... 0.000000 0.000000 0.000000 10.000000 15.000000 16.000000 12.000000 2.000000 0.000000 7.000000
max 0.0 8.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 5.000000 15.000000 ... 12.000000 1.000000 10.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 9.000000

8 rows × 65 columns

iris.describe()
Sepal_Length Sepal_Width Petal_Length Petal_Width
count 149.000000 149.000000 149.000000 149.000000
mean 5.848322 3.051007 3.774497 1.205369
std 0.828594 0.433499 1.759651 0.761292
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.400000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
iris[["Sepal_Length", "Sepal_Width"]].describe()
Sepal_Length Sepal_Width
count 149.000000 149.000000
mean 5.848322 3.051007
std 0.828594 0.433499
min 4.300000 2.000000
25% 5.100000 2.800000
50% 5.800000 3.000000
75% 6.400000 3.300000
max 7.900000 4.400000
length = len(digits)
print(length)
1
3823
count = digits[2].count()
print(count)
1
3823
number_of_missing_values = length - count
pct_of_missing_values = float(number_of_missing_values / length)
pct_of_missing_values = "{0:.1f}%".format(pct_of_missing_values*100)
print(pct_of_missing_values)
1
0.0%
print("Minimum value: ", iris["Sepal_Length"].min())
print("Maximum value: ", iris["Sepal_Length"].max())
1
2
Minimum value:  4.3
Maximum value:  7.9
print(iris["Sepal_Length"].mode())
1
2
0    5.0
dtype: float64
print(iris["Sepal_Length"].mean())
1
5.848322147651008
print(iris["Sepal_Length"].median())
1
5.8
print(iris["Sepal_Length"].std())
1
0.8285940572656172
print(iris["Sepal_Length"].quantile([.25, .5, .75]))
1
2
3
4
0.25    5.1
0.50    5.8
0.75    6.4
Name: Sepal_Length, dtype: float64
import seaborn as sns
sns.set(color_codes=True)

sns.set_palette(sns.color_palette("muted"))

sns.distplot(iris["Sepal_Length"].dropna())
1
<matplotlib.axes._subplots.AxesSubplot at 0x7f87b39b0320>

iris[["Sepal_Length", "Sepal_Width"]].corr()
Sepal_Length Sepal_Width
Sepal_Length 1.000000 -0.103784
Sepal_Width -0.103784 1.000000
import pandas_profiling 

# Print a full report
pandas_profiling.ProfileReport(iris)

Overview

Dataset info

Number of variables 5
Number of observations 149
Total Missing (%) 0.0%
Total size in memory 5.9 KiB
Average record size in memory 40.5 B

Variables types

Numeric 3
Categorical 1
Date 0
Text (Unique) 0
Rejected 1

Warnings

  • Petal_Width is highly correlated with Petal_Length (ρ = 0.96231) Rejected
  • Dataset has 3 duplicate rows Warning

Variables

Class
Categorical

Distinct count 3
Unique (%) 2.0%
Missing (%) 0.0%
Missing (n) 0
Iris-virginica
50
Iris-versicolor
50
Iris-setosa
49
Value Count Frequency (%)  
Iris-virginica 50 33.6%
 
Iris-versicolor 50 33.6%
 
Iris-setosa 49 32.9%
 

Petal_Length
Numeric

Distinct count 43
Unique (%) 28.9%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.7745
Minimum 1
Maximum 6.9
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1.3
Q1 1.6
Median 4.4
Q3 5.1
95-th percentile 6.1
Maximum 6.9
Range 5.9
Interquartile range 3.5

Descriptive statistics

Standard deviation 1.7597
Coef of variation 0.46619
Kurtosis -1.385
Mean 3.7745
MAD 1.5526
Skewness -0.28946
Sum 562.4
Variance 3.0964
Memory size 1.2 KiB
Value Count Frequency (%)  
1.5 14 9.4%
 
1.4 11 7.4%
 
5.1 8 5.4%
 
4.5 8 5.4%
 
1.3 7 4.7%
 
1.6 7 4.7%
 
5.6 6 4.0%
 
4.0 5 3.4%
 
4.9 5 3.4%
 
4.7 5 3.4%
 
Other values (33) 73 49.0%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 1 0.7%
 
1.1 1 0.7%
 
1.2 2 1.3%
 
1.3 7 4.7%
 
1.4 11 7.4%
 

Maximum 5 values

Value Count Frequency (%)  
6.3 1 0.7%
 
6.4 1 0.7%
 
6.6 1 0.7%
 
6.7 2 1.3%
 
6.9 1 0.7%
 

Petal_Width
Highly correlated

This variable is highly correlated with Petal_Length and should be ignored for analysis

Correlation 0.96231

Sepal_Length
Numeric

Distinct count 35
Unique (%) 23.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.8483
Minimum 4.3
Maximum 7.9
Zeros (%) 0.0%

Quantile statistics

Minimum 4.3
5-th percentile 4.6
Q1 5.1
Median 5.8
Q3 6.4
95-th percentile 7.26
Maximum 7.9
Range 3.6
Interquartile range 1.3

Descriptive statistics

Standard deviation 0.82859
Coef of variation 0.14168
Kurtosis -0.55356
Mean 5.8483
MAD 0.68748
Skewness 0.3031
Sum 871.4
Variance 0.68657
Memory size 1.2 KiB
Value Count Frequency (%)  
5.0 10 6.7%
 
6.3 9 6.0%
 
5.1 8 5.4%
 
6.7 8 5.4%
 
5.7 8 5.4%
 
5.5 7 4.7%
 
5.8 7 4.7%
 
6.4 7 4.7%
 
6.0 6 4.0%
 
4.9 6 4.0%
 
Other values (25) 73 49.0%
 

Minimum 5 values

Value Count Frequency (%)  
4.3 1 0.7%
 
4.4 3 2.0%
 
4.5 1 0.7%
 
4.6 4 2.7%
 
4.7 2 1.3%
 

Maximum 5 values

Value Count Frequency (%)  
7.3 1 0.7%
 
7.4 1 0.7%
 
7.6 1 0.7%
 
7.7 4 2.7%
 
7.9 1 0.7%
 

Sepal_Width
Numeric

Distinct count 23
Unique (%) 15.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.051
Minimum 2
Maximum 4.4
Zeros (%) 0.0%

Quantile statistics

Minimum 2
5-th percentile 2.34
Q1 2.8
Median 3
Q3 3.3
95-th percentile 3.8
Maximum 4.4
Range 2.4
Interquartile range 0.5

Descriptive statistics

Standard deviation 0.4335
Coef of variation 0.14208
Kurtosis 0.31865
Mean 3.051
MAD 0.33199
Skewness 0.3501
Sum 454.6
Variance 0.18792
Memory size 1.2 KiB
Value Count Frequency (%)  
3.0 26 17.4%
 
2.8 14 9.4%
 
3.2 13 8.7%
 
3.4 12 8.1%
 
3.1 12 8.1%
 
2.9 10 6.7%
 
2.7 9 6.0%
 
2.5 8 5.4%
 
3.8 6 4.0%
 
3.3 6 4.0%
 
Other values (13) 33 22.1%
 

Minimum 5 values

Value Count Frequency (%)  
2.0 1 0.7%
 
2.2 3 2.0%
 
2.3 4 2.7%
 
2.4 3 2.0%
 
2.5 8 5.4%
 

Maximum 5 values

Value Count Frequency (%)  
3.9 2 1.3%
 
4.0 1 0.7%
 
4.1 1 0.7%
 
4.2 1 0.7%
 
4.4 1 0.7%
 

Sample

Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
# Print a full report
pandas_profiling.ProfileReport(digits)

>>> Full (lengthy) report here!!! <<<

First and Last DataFrame Rows

# Inspect the first 5 rows of `digits`
first = digits.head(5)

# Inspect the last 3 rows
last = digits.tail(3)
print(first)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
   0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  62  \
0   0   1   6  15  12   1   0   0   0   7 ...   0   0   0   6  14   7   1   0   
1   0   0  10  16   6   0   0   0   0   7 ...   0   0   0  10  16  15   3   0   
2   0   0   8  15  16  13   0   0   0   1 ...   0   0   0   9  14   0   0   0   
3   0   0   0   3  11  16   0   0   0   0 ...   0   0   0   0   1  15   2   0   
4   0   0   5  14   4   0   0   0   0   0 ...   0   0   0   4  12  14   7   0

   63  64  
0   0   0  
1   0   0  
2   0   7  
3   0   4  
4   0   6

[5 rows x 65 columns]
print(last)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
      0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  \
3820   0   0   3  15   0   0   0   0   0   0 ...   0   0   0   4  14  16   9   
3821   0   0   6  16   2   0   0   0   0   0 ...   0   0   0   5  16  16  16   
3822   0   0   2  15  16  13   1   0   0   0 ...   0   0   0   4  14   1   0

      62  63  64  
3820   0   0   6  
3821   5   0   6  
3822   0   0   7

[3 rows x 65 columns]

Sample the Data

# Take a sample of 5
digits.sample(5)
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 64
1249 0 0 14 14 13 15 5 0 0 0 ... 0 0 0 12 16 10 2 0 0 5
3702 0 0 0 9 16 12 2 0 0 0 ... 0 0 0 0 9 14 2 0 0 0
1605 0 0 7 16 13 2 0 0 0 2 ... 0 0 0 5 14 11 1 0 0 0
1890 0 0 3 15 15 5 0 0 0 0 ... 2 0 0 3 15 16 16 13 1 9
1295 0 0 7 15 13 3 0 0 0 0 ... 0 0 0 9 13 12 3 0 0 0

5 rows × 65 columns

# import `sample` from `random`
from random import sample

# Create a random index
randomIndex = np.array(sample(range(len(digits)), 5))

print(randomIndex)
1
[ 846  569  315 2932 2328]
# Get 5 random rows
digitsSample = digits.ix[randomIndex]

# Print the sample
print(digitsSample)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
      0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  \
846    0   5  14  15   9   1   0   0   0   7 ...   0   0   4  12  16  12  10   
569    0   1   7  12  12   0   0   0   0   3 ...   0   0   0  10  16  13   7   
315    0   1   6  13  13   4   0   0   0   9 ...   0   0   0   4  14  16   9   
2932   0   0   4  12  10   1   0   0   0   0 ...   0   0   0   4  12  11   3   
2328   0   0   4  15  16  16  16  15   0   0 ...   0   0   0   5  15   3   0

      62  63  64  
846    4   0   2  
569    0   0   3  
315    2   0   2  
2932   0   0   0  
2328   0   0   7

[5 rows x 65 columns]

Queries

iris.head(2)
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
# Petal length greater than sepal length?
iris.query('Petal_Length > Sepal_Length')
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
# reverse
iris.query('Sepal_Length > Petal_Length')
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5 4.6 3.4 1.4 0.3 Iris-setosa
6 5.0 3.4 1.5 0.2 Iris-setosa
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.4 3.7 1.5 0.2 Iris-setosa
10 4.8 3.4 1.6 0.2 Iris-setosa
11 4.8 3.0 1.4 0.1 Iris-setosa
12 4.3 3.0 1.1 0.1 Iris-setosa
13 5.8 4.0 1.2 0.2 Iris-setosa
14 5.7 4.4 1.5 0.4 Iris-setosa
15 5.4 3.9 1.3 0.4 Iris-setosa
16 5.1 3.5 1.4 0.3 Iris-setosa
17 5.7 3.8 1.7 0.3 Iris-setosa
18 5.1 3.8 1.5 0.3 Iris-setosa
19 5.4 3.4 1.7 0.2 Iris-setosa
20 5.1 3.7 1.5 0.4 Iris-setosa
21 4.6 3.6 1.0 0.2 Iris-setosa
22 5.1 3.3 1.7 0.5 Iris-setosa
23 4.8 3.4 1.9 0.2 Iris-setosa
24 5.0 3.0 1.6 0.2 Iris-setosa
25 5.0 3.4 1.6 0.4 Iris-setosa
26 5.2 3.5 1.5 0.2 Iris-setosa
27 5.2 3.4 1.4 0.2 Iris-setosa
28 4.7 3.2 1.6 0.2 Iris-setosa
29 4.8 3.1 1.6 0.2 Iris-setosa
... ... ... ... ... ...
119 6.9 3.2 5.7 2.3 Iris-virginica
120 5.6 2.8 4.9 2.0 Iris-virginica
121 7.7 2.8 6.7 2.0 Iris-virginica
122 6.3 2.7 4.9 1.8 Iris-virginica
123 6.7 3.3 5.7 2.1 Iris-virginica
124 7.2 3.2 6.0 1.8 Iris-virginica
125 6.2 2.8 4.8 1.8 Iris-virginica
126 6.1 3.0 4.9 1.8 Iris-virginica
127 6.4 2.8 5.6 2.1 Iris-virginica
128 7.2 3.0 5.8 1.6 Iris-virginica
129 7.4 2.8 6.1 1.9 Iris-virginica
130 7.9 3.8 6.4 2.0 Iris-virginica
131 6.4 2.8 5.6 2.2 Iris-virginica
132 6.3 2.8 5.1 1.5 Iris-virginica
133 6.1 2.6 5.6 1.4 Iris-virginica
134 7.7 3.0 6.1 2.3 Iris-virginica
135 6.3 3.4 5.6 2.4 Iris-virginica
136 6.4 3.1 5.5 1.8 Iris-virginica
137 6.0 3.0 4.8 1.8 Iris-virginica
138 6.9 3.1 5.4 2.1 Iris-virginica
139 6.7 3.1 5.6 2.4 Iris-virginica
140 6.9 3.1 5.1 2.3 Iris-virginica
141 5.8 2.7 5.1 1.9 Iris-virginica
142 6.8 3.2 5.9 2.3 Iris-virginica
143 6.7 3.3 5.7 2.5 Iris-virginica
144 6.7 3.0 5.2 2.3 Iris-virginica
145 6.3 2.5 5.0 1.9 Iris-virginica
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica

149 rows × 5 columns

# alternatively
iris[iris.Sepal_Length > iris.Petal_Length]
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5 4.6 3.4 1.4 0.3 Iris-setosa
6 5.0 3.4 1.5 0.2 Iris-setosa
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.4 3.7 1.5 0.2 Iris-setosa
10 4.8 3.4 1.6 0.2 Iris-setosa
11 4.8 3.0 1.4 0.1 Iris-setosa
12 4.3 3.0 1.1 0.1 Iris-setosa
13 5.8 4.0 1.2 0.2 Iris-setosa
14 5.7 4.4 1.5 0.4 Iris-setosa
15 5.4 3.9 1.3 0.4 Iris-setosa
16 5.1 3.5 1.4 0.3 Iris-setosa
17 5.7 3.8 1.7 0.3 Iris-setosa
18 5.1 3.8 1.5 0.3 Iris-setosa
19 5.4 3.4 1.7 0.2 Iris-setosa
20 5.1 3.7 1.5 0.4 Iris-setosa
21 4.6 3.6 1.0 0.2 Iris-setosa
22 5.1 3.3 1.7 0.5 Iris-setosa
23 4.8 3.4 1.9 0.2 Iris-setosa
24 5.0 3.0 1.6 0.2 Iris-setosa
25 5.0 3.4 1.6 0.4 Iris-setosa
26 5.2 3.5 1.5 0.2 Iris-setosa
27 5.2 3.4 1.4 0.2 Iris-setosa
28 4.7 3.2 1.6 0.2 Iris-setosa
29 4.8 3.1 1.6 0.2 Iris-setosa
... ... ... ... ... ...
119 6.9 3.2 5.7 2.3 Iris-virginica
120 5.6 2.8 4.9 2.0 Iris-virginica
121 7.7 2.8 6.7 2.0 Iris-virginica
122 6.3 2.7 4.9 1.8 Iris-virginica
123 6.7 3.3 5.7 2.1 Iris-virginica
124 7.2 3.2 6.0 1.8 Iris-virginica
125 6.2 2.8 4.8 1.8 Iris-virginica
126 6.1 3.0 4.9 1.8 Iris-virginica
127 6.4 2.8 5.6 2.1 Iris-virginica
128 7.2 3.0 5.8 1.6 Iris-virginica
129 7.4 2.8 6.1 1.9 Iris-virginica
130 7.9 3.8 6.4 2.0 Iris-virginica
131 6.4 2.8 5.6 2.2 Iris-virginica
132 6.3 2.8 5.1 1.5 Iris-virginica
133 6.1 2.6 5.6 1.4 Iris-virginica
134 7.7 3.0 6.1 2.3 Iris-virginica
135 6.3 3.4 5.6 2.4 Iris-virginica
136 6.4 3.1 5.5 1.8 Iris-virginica
137 6.0 3.0 4.8 1.8 Iris-virginica
138 6.9 3.1 5.4 2.1 Iris-virginica
139 6.7 3.1 5.6 2.4 Iris-virginica
140 6.9 3.1 5.1 2.3 Iris-virginica
141 5.8 2.7 5.1 1.9 Iris-virginica
142 6.8 3.2 5.9 2.3 Iris-virginica
143 6.7 3.3 5.7 2.5 Iris-virginica
144 6.7 3.0 5.2 2.3 Iris-virginica
145 6.3 2.5 5.0 1.9 Iris-virginica
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica

149 rows × 5 columns

The Challenges of Data

Missing Values

# Identifiy missing values
pd.isnull(digits)
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 64
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
5 False False False False False False False False False False ... False False False False False False False False False False
6 False False False False False False False False False False ... False False False False False False False False False False
7 False False False False False False False False False False ... False False False False False False False False False False
8 False False False False False False False False False False ... False False False False False False False False False False
9 False False False False False False False False False False ... False False False False False False False False False False
10 False False False False False False False False False False ... False False False False False False False False False False
11 False False False False False False False False False False ... False False False False False False False False False False
12 False False False False False False False False False False ... False False False False False False False False False False
13 False False False False False False False False False False ... False False False False False False False False False False
14 False False False False False False False False False False ... False False False False False False False False False False
15 False False False False False False False False False False ... False False False False False False False False False False
16 False False False False False False False False False False ... False False False False False False False False False False
17 False False False False False False False False False False ... False False False False False False False False False False
18 False False False False False False False False False False ... False False False False False False False False False False
19 False False False False False False False False False False ... False False False False False False False False False False
20 False False False False False False False False False False ... False False False False False False False False False False
21 False False False False False False False False False False ... False False False False False False False False False False
22 False False False False False False False False False False ... False False False False False False False False False False
23 False False False False False False False False False False ... False False False False False False False False False False
24 False False False False False False False False False False ... False False False False False False False False False False
25 False False False False False False False False False False ... False False False False False False False False False False
26 False False False False False False False False False False ... False False False False False False False False False False
27 False False False False False False False False False False ... False False False False False False False False False False
28 False False False False False False False False False False ... False False False False False False False False False False
29 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3793 False False False False False False False False False False ... False False False False False False False False False False
3794 False False False False False False False False False False ... False False False False False False False False False False
3795 False False False False False False False False False False ... False False False False False False False False False False
3796 False False False False False False False False False False ... False False False False False False False False False False
3797 False False False False False False False False False False ... False False False False False False False False False False
3798 False False False False False False False False False False ... False False False False False False False False False False
3799 False False False False False False False False False False ... False False False False False False False False False False
3800 False False False False False False False False False False ... False False False False False False False False False False
3801 False False False False False False False False False False ... False False False False False False False False False False
3802 False False False False False False False False False False ... False False False False False False False False False False
3803 False False False False False False False False False False ... False False False False False False False False False False
3804 False False False False False False False False False False ... False False False False False False False False False False
3805 False False False False False False False False False False ... False False False False False False False False False False
3806 False False False False False False False False False False ... False False False False False False False False False False
3807 False False False False False False False False False False ... False False False False False False False False False False
3808 False False False False False False False False False False ... False False False False False False False False False False
3809 False False False False False False False False False False ... False False False False False False False False False False
3810 False False False False False False False False False False ... False False False False False False False False False False
3811 False False False False False False False False False False ... False False False False False False False False False False
3812 False False False False False False False False False False ... False False False False False False False False False False
3813 False False False False False False False False False False ... False False False False False False False False False False
3814 False False False False False False False False False False ... False False False False False False False False False False
3815 False False False False False False False False False False ... False False False False False False False False False False
3816 False False False False False False False False False False ... False False False False False False False False False False
3817 False False False False False False False False False False ... False False False False False False False False False False
3818 False False False False False False False False False False ... False False False False False False False False False False
3819 False False False False False False False False False False ... False False False False False False False False False False
3820 False False False False False False False False False False ... False False False False False False False False False False
3821 False False False False False False False False False False ... False False False False False False False False False False
3822 False False False False False False False False False False ... False False False False False False False False False False

3823 rows × 65 columns

Delete

# Drop rows with missing values
df.dropna(axis=0)

# Drop columns with missing values 
df.dropna(axis=1)

Impute

Imputation: mean, median, another variable, estimate with regression ANOVA, logit, k-NN.

# Import NumPy
import numpy as np

# Calculate the mean of the DataFrame variable Salary
mean = np.mean(df.Salary)

# Replace missing values with the mean
df = df.Salary.fillna(mean)

df = df.Salary.fillna(mean, method=ffill)
  • ffill and bfill for forward and backward fill.
from scipy import interpolate

# Fill the DataFrame
df.interpolate()

df.interpolate(method=cubic)
  • cubic, polynomial.
  • limit and limit_direction.

Outliers

Delete (data entry, processing errors), transform (assign weights, natural log to reduce variation) or impute them (replace extreme values with median, mean or mode values).

The Data’s Features

Feature Engineering

Increase the predictive power of learning algorithms by creating features from raw data that will help the learning process.

Encode categorical variables into numerical ones

# Factorize the values 
labels,levels = pd.factorize(iris.Class)

# Save the encoded variables in `iris.Class`
iris.Class = labels

# Print out the first rows
iris.Class.head()
1
2
3
4
5
6
0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

Bin continuous variables in groups

# Define the bins
mybins = range(0, df.age.max(), 10)

# Cut the data from the DataFrame with the help of the bins
df['age_bucket'] = pd.cut(df.age, bins=mybins)

# Count the number of values per bucket
df['age_bucket'].value_counts()

Scale features

Center the data around 0.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

Feature Selection

Select the key subset of original data features in an attempt to reduce the dimensionality of the training problem.

PCA combines similar (correlated) attributes and creates new ones that are considered superior to the original attributes of the dataset.

Feature selection doesn’t combine attributes: it evaluates the quality and predictive power and selects the best set.

To find important features, calculate how much better or worse a model does when we leave one variable out of the equation.

# Import `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# Isolate Data, class labels and column values
X = iris.iloc[:,0:4]
Y = iris.iloc[:,-1]
names = iris.columns.values

# Build the model
rfc = RandomForestClassifier()

# Fit the model
rfc.fit(X, Y)

# Print the results
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rfc.feature_importances_), names), reverse=True))
1
2
Features sorted by their score:
[(0.4899, 'Petal_Length'), (0.2752, 'Petal_Width'), (0.2185, 'Sepal_Length'), (0.016400000000000001, 'Sepal_Width')]

The best feature set is one that includes the petal length and petal width data.

# Isolate feature importances 
importance = rfc.feature_importances_

# Sort the feature importances 
sorted_importances = np.argsort(importance)

# Insert padding
padding = np.arange(len(names)-1) + 0.5

# Plot the data
plt.barh(padding, importance[sorted_importances], align='center')

# Customize the plot
plt.yticks(padding, names[sorted_importances])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")

# Show the plot
plt.show()

Patterns In the Data

Visualization of the data; static with Matplotlib or Seaborn, interactive with Bokeh or Plotly.

Correlation Identification with PCA from scikit-learn

Matplotlib

Dimensionality Reduction techniques, such as Principal Component Analysis (PCA). From ‘many’ to two ‘principal components’.

# Import `PCA` from `sklearn.decomposition`
from sklearn.decomposition import PCA

# Build the model
pca = PCA(n_components=2)

# Reduce the data, output is ndarray
reduced_data = pca.fit_transform(digits)

# Inspect shape of the `reduced_data`
reduced_data.shape

# print out the reduced data
print(reduced_data)
1
2
3
4
5
6
7
[[ 12.65674168  -4.63610357]
 [ 16.82906354 -12.96575346]
 [-19.08072301  10.58293767]
 ..., 
 [ 23.90693984   6.06265415]
 [ 29.1798759   -3.06847144]
 [-25.23132536  11.60863909]]
reduced_data = pd.DataFrame(reduced_data)
import matplotlib.pyplot as plt

plt.scatter(reduced_data[0], reduced_data[1])

plt.show()

Bokeh

To be implemented in a webpage for example.

from bokeh.charts import Scatter, output_file, show

# Construct the scatter plot
p = Scatter(iris, x='Petal_Length', y='Petal_Width', color="Class", title="Petal Length vs Petal Width",
            xlabel="Sepal Length", ylabel="Sepal Width")

# Output the file 
output_file('scatter.html')

# Show the scatter plot
show(p)

The GIF output:


Correlation Identification with Pandas

The Pearson correlation assumes that the variables are normally distributed, that there is a straight line relationship between each of the variables and that the data is normally distributed about the regression line.

The Spearman correlation, on the other hand, assumes that we have two ordinal variables or two variables that are related in some way, but not linearly. The Spearman coefficient is the sum of deviation squared by n times n minus 1.

The Kendall Tau correlation is a coefficient that represents the degree of concordance between two columns of ranked data. We can use the Spearman correlation to measure the degree of association between two variables. The Kendal Tau coefficient is calculated by the number of concordant pairs minus the number of discordant pairs divided by the total number of pairs.

Spearman’s coefficient will usually be larger than the Kendall’s Tau coefficient, but this is not always the case: we’ll get a smaller Spearman’s coefficient when the deviations are huge among the observations of the data. The Spearman correlation is very sensitive to this and this might come in handy in some cases!

The two last correlation measures require ranking the data.

# Pearson correlation
iris.corr()
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
Sepal_Length 1.000000 -0.103784 0.871283 0.816971 0.781219
Sepal_Width -0.103784 1.000000 -0.415218 -0.350733 -0.414532
Petal_Length 0.871283 -0.415218 1.000000 0.962314 0.948519
Petal_Width 0.816971 -0.350733 0.962314 1.000000 0.956014
Class 0.781219 -0.414532 0.948519 0.956014 1.000000
iris2 = iris.rank()
# Kendall Tau correlation
iris2.corr('kendall')
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
Sepal_Length 1.000000 -0.067636 0.718290 0.654197 0.669163
Sepal_Width -0.067636 1.000000 -0.175665 -0.140207 -0.327228
Petal_Length 0.718290 -0.175665 1.000000 0.803041 0.822578
Petal_Width 0.654197 -0.140207 0.803041 1.000000 0.837934
Class 0.669163 -0.327228 0.822578 0.837934 1.000000
# Spearman Rank correlation
iris2.corr('spearman')
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
Sepal_Length 1.000000 -0.152136 0.881759 0.833586 0.796546
Sepal_Width -0.152136 1.000000 -0.294020 -0.267686 -0.426319
Petal_Length 0.881759 -0.294020 1.000000 0.936188 0.935220
Petal_Width 0.833586 -0.267686 0.936188 1.000000 0.937409
Class 0.796546 -0.426319 0.935220 0.937409 1.000000