Scaling, Centering, Noise with kNN, Linear Regression, Logit

Foreword

Code snippets and excerpts from the tutorial. Python 3. From DataCamp.

Load and explore the Wine dataset¶

We use the wine quality dataset related to red and white vinho verde wine samples, from the north of Portugal.

# import the modules
%pylab inline
import pandas as pd
import matplotlib.pyplot as plt

# set the style
plt.style.use('ggplot')

1	`Populating the interactive namespace from numpy and matplotlib`

# import the data
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
df.head(3)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

# drop target variable
# only keep the values; the DataFrame becomes a simple array (matrix)
# index (axis=0 / ‘index’) or columns (axis=1 / ‘columns’).
X = df.drop('quality' , axis=1).values

# print the array
print(X)

[[  7.4     0.7     0.    ...,   3.51    0.56    9.4  ]
 [  7.8     0.88    0.    ...,   3.2     0.68    9.8  ]
 [  7.8     0.76    0.04  ...,   3.26    0.65    9.8  ]
 ..., 
 [  6.3     0.51    0.13  ...,   3.42    0.75   11.   ]
 [  5.9     0.645   0.12  ...,   3.57    0.71   10.2  ]
 [  6.      0.31    0.47  ...,   3.39    0.66   11.   ]]

The last column is gone from the array. Make it a list instead (or a single-row array).

y1 = df['quality'].values

# print the single-row array
print(y1)

1	`[5 5 5 ..., 6 5 6]`

# row, col of the DataFrame
df.shape

1	`(1599, 12)`

# plot all the columns or variables
pd.DataFrame.hist(df, figsize = [15,15]);

plt.show()

Notice the range of each variable; some are wider.

Any algorithm, such as k-NN, which cares about the distance between data points. This motivates scaling our data.

Let us turn it into a two-category variable consisting of ‘good’ (rating > 5) & ‘bad’ (rating <= 5) qualities.

print(y1)

1	`[5 5 5 ..., 6 5 6]`

# is the rating <= 5 ?
y = y1 <= 5
print(y)

1	`[ True True True ..., False True False]`

True is worth 1 and False is worth 0.

# plot two histograms
# the original target variable
# and the aggregated target variable
plt.figure(figsize=(20,5));

# left plot
plt.subplot(1, 2, 1 );
plt.hist(y1);
plt.xlabel('original target value')
plt.ylabel('count')

# right plot
plt.subplot(1, 2, 2);
plt.hist(y)
plt.xlabel('aggregated target value')
plt.show()

Again, on the right histogram, True = 1 and False = 0.

k-Nearest Neighbours¶

Measure performance¶

Accuracy is the default scoring method for both

k-Nearest Neighbours and
logistic regression.

$\text{Accuracy}=\frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$

Accuracy is commonly defined for binary classification problems in terms of true positives & false negatives. It can also be defined in terms of a confusion matrix.

Other measures of model performance are derived from the confusion matrix: precision (true positives divided by the number of true & false positives) and recall (number of true positives divided by the number of true positives plus the number of false negatives).

The F1-score is the harmonic mean of the precision and the recall.

Train-test split and performance in practice¶

The rule of thumb is to use approximately

80% of the data for training (train set) and
20% for testing (test set).

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42)

# the k-NN model
from sklearn import neighbors, linear_model

knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
knn_model_1 = knn.fit(X_train, y_train)

print('k-NN score for test set: %f' % knn_model_1.score(X_test, y_test))
print('k-NN score for training set: %f' % knn_model_1.score(X_train, y_train))

1 2	`k-NN score for test set: 0.612500 k-NN score for training set: 0.774042`

The accuracy, more specifically the test accuracy, is not great.

Let us print out all the other performance measures for the test set.

from sklearn.metrics import classification_report

y_true, y_pred = y_test, knn_model_1.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.66      0.64      0.65       179
       True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

Other performance measures for the train set.

y_true, y_pred = y_train, knn_model_1.predict(X_train)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.80      0.76      0.78       676
       True       0.75      0.79      0.77       603

avg / total       0.78      0.77      0.77      1279

These underperformances might come from the spread in the variables. The range of each variable is different; some are wider.

Preprocessing: scaling and centering the data¶

Preprocessing happens before running any model, such as a regression (predicting a continuous variable) or a classification (predicting a discrete variable) using one or another model (k-NN, logistic, decision tree, random forests etc.).

For numerical variables, it is common to either normalize or standardize the data.

Normalization: scaling a dataset so that its minimum is 0 and its maximum 1.

$x_{normalized} = \frac{x-x_{min}}{x_{max}-x_{min}}$

Stardardization: centering the data around 0 and to scale with respect to the standard deviation.

$x_{standardized} = \frac{x-\mu}{\sigma}$

where $\mu$ and $\sigma$ are the mean and standard deviation of the dataset.

There are other transformatoions, such as the log transformation or the Box-Cox transformation, to make the data look more Gaussian or a normally distributed.

k-NN: scaling in practice¶

Scale the data¶

print(X)

[[  7.4     0.7     0.    ...,   3.51    0.56    9.4  ]
 [  7.8     0.88    0.    ...,   3.2     0.68    9.8  ]
 [  7.8     0.76    0.04  ...,   3.26    0.65    9.8  ]
 ..., 
 [  6.3     0.51    0.13  ...,   3.42    0.75   11.   ]
 [  5.9     0.645   0.12  ...,   3.57    0.71   10.2  ]
 [  6.      0.31    0.47  ...,   3.39    0.66   11.   ]]

from sklearn.preprocessing import scale

# minimum is 0 and its maximum 1
Xs = scale(X)
print(Xs)

[[-0.52835961  0.96187667 -1.39147228 ...,  1.28864292 -0.57920652
  -0.96024611]
 [-0.29854743  1.96744245 -1.39147228 ..., -0.7199333   0.1289504
  -0.58477711]
 [-0.29854743  1.29706527 -1.18607043 ..., -0.33117661 -0.04808883
  -0.58477711]
 ..., 
 [-1.1603431  -0.09955388 -0.72391627 ...,  0.70550789  0.54204194
   0.54162988]
 [-1.39015528  0.65462046 -0.77526673 ...,  1.6773996   0.30598963
  -0.20930812]
 [-1.33270223 -1.21684919  1.02199944 ...,  0.51112954  0.01092425
   0.54162988]]

Run the k-NN¶

from sklearn.cross_validation import train_test_split

# split
# 80% of the data for training (train set)
# 20% for testing (test set)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs,
                                                      y,
                                                      test_size=0.2,
                                                      random_state=42)

# Run
knn_model_2 = knn.fit(Xs_train, y_train)

Measure the performance¶

print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
print('k-NN score for training set: %f' % knn_model_2.score(Xs_train, y_train))

1 2	`k-NN score for test set: 0.712500 k-NN score for training set: 0.814699`

y_true, y_pred = y_test, knn_model_2.predict(Xs_test)

# Test set
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.72      0.79      0.75       179
       True       0.70      0.62      0.65       141

avg / total       0.71      0.71      0.71       320

y_true, y_pred = y_train, knn_model_2.predict(Xs_train)

# Train set
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.80      0.86      0.83       676
       True       0.83      0.77      0.80       603

avg / total       0.82      0.81      0.81      1279

Normalization-scaling improves the performance compare to the previous classification_report.

k-NN Recap¶

Without scaling¶

# Set sc = False 
# Do not scale the features 
sc = False
# Set the number of k in k-NN
nk = 5

# Load data 
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';') 
# Drop target variable 
X = df.drop('quality' , 1).values

# Scale, if desired 
if sc == True: 
  X = scale(X) 

# Target value 
y1 = df['quality'].values # original target variable 
# New target variable: is the rating <= 5?
y = y1 <= 5 

# Split (80/20) the data into a test set and a train set
# X_train, X_test, y_train, y_test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42) 

# Train the k-NN model
knn = neighbors.KNeighborsClassifier(n_neighbors = nk)
knn_model = knn.fit(X_train, y_train)

# Print performance on the test set 
print('k-NN accuracy for test set: %f' % knn_model.score(X_test, y_test))
y_true, y_pred = y_test, knn_model.predict(X_test) 
print(classification_report(y_true, y_pred))

k-NN accuracy for test set: 0.612500
             precision    recall  f1-score   support

      False       0.66      0.64      0.65       179
       True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

With scaling¶

# Set sc = True 
# to scale the features 
sc = True
# Set the number of k in k-NN
nk = 5

# Load data 
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';') 
# Drop target variable 
X = df.drop('quality' , 1).values

# Scale, if desired 
if sc == True: 
  X = scale(X) 

# Target value 
y1 = df['quality'].values # original target variable 
# New target variable: is the rating <= 5?
y = y1 <= 5 

# Split (80/20) the data into a test set and a train set
# X_train, X_test, y_train, y_test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42) 

# Train the k-NN model
knn = neighbors.KNeighborsClassifier(n_neighbors = nk)
knn_model = knn.fit(X_train, y_train)

# Print performance on the test set 
print('k-NN accuracy for test set: %f' % knn_model.score(X_test, y_test))
y_true, y_pred = y_test, knn_model.predict(X_test) 
print(classification_report(y_true, y_pred))

k-NN accuracy for test set: 0.712500
             precision    recall  f1-score   support

      False       0.72      0.79      0.75       179
       True       0.70      0.62      0.65       141

avg / total       0.71      0.71      0.71       320

Linear regression¶

Before addressing an alternative to k-NN, the logistic regression or Logit, let us briefly review the linear regresion with a different dataset.

# Import necessary packages
%pylab inline
import pandas as pd
import matplotlib.pyplot as plt

# set the style
plt.style.use('ggplot')

# Import nmore packages
from sklearn import datasets
from sklearn import linear_model
import numpy as np

1	`Populating the interactive namespace from numpy and matplotlib`

# Load the data
# The data is part of the scikit-learn module
boston = datasets.load_boston()
yb = boston.target.reshape(-1, 1)
Xb = boston['data'][:,5].reshape(-1, 1)

print(yb[:10])

[[ 24. ]
 [ 21.6]
 [ 34.7]
 [ 33.4]
 [ 36.2]
 [ 28.7]
 [ 22.9]
 [ 27.1]
 [ 16.5]
 [ 18.9]]

print(Xb[:10])

[[ 6.575]
 [ 6.421]
 [ 7.185]
 [ 6.998]
 [ 7.147]
 [ 6.43 ]
 [ 6.012]
 [ 6.172]
 [ 5.631]
 [ 6.004]]

# Plot data
plt.scatter(Xb,yb)
plt.ylabel('value of house /1000 ($)')
plt.xlabel('number of rooms')

1	`<matplotlib.text.Text at 0x7f3681ae90b8>`

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit( Xb, yb)

1	`LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)`

# Plot outputs
plt.scatter(Xb, yb,  color='black')
plt.plot(Xb, regr.predict(Xb), color='blue',
         linewidth=3)
plt.show()

Logistic regression (Logit)¶

With random numbers¶

# Synthesize data
X1 = np.random.normal(size=150)
y1 = (X1 > 0).astype(np.float)
X1[X1 > 0] *= 4
X1 += .3 * np.random.normal(size=150)
X1 = X1.reshape(-1, 1)

# Run the classifier
clf = linear_model.LogisticRegression()
clf.fit(X1, y1)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

X1[:10]

array([[-0.74466839],
       [ 0.47335714],
       [-1.94951938],
       [ 0.12078443],
       [-1.62121705],
       [-2.23684396],
       [ 7.66984914],
       [-0.31941781],
       [-1.07205326],
       [ 0.85413978]])

# Order X1
X1_ordered = sorted(X1, reverse=False)

X1_ordered[:10]

[array([-3.29826361]),
 array([-2.76292445]),
 array([-2.23684396]),
 array([-1.96629089]),
 array([-1.94951938]),
 array([-1.87501025]),
 array([-1.83321548]),
 array([-1.73611093]),
 array([-1.62121705]),
 array([-1.61885181])]

# Plot the result
plt.scatter(X1.ravel(), y1, color='black', zorder=20 , alpha = 0.5)
plt.plot(X1_ordered, clf.predict_proba(X1_ordered)[:,1], color='blue' , linewidth = 3)
plt.ylabel('target variable')
plt.xlabel('predictor variable')
plt.show()

With the Wine dataset¶

# Load data
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')

df.head(3)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

# Drop target variable
X = df.drop('quality' , 1).values

# Print the array
print(X)

[[  7.4     0.7     0.    ...,   3.51    0.56    9.4  ]
 [  7.8     0.88    0.    ...,   3.2     0.68    9.8  ]
 [  7.8     0.76    0.04  ...,   3.26    0.65    9.8  ]
 ..., 
 [  6.3     0.51    0.13  ...,   3.42    0.75   11.   ]
 [  5.9     0.645   0.12  ...,   3.57    0.71   10.2  ]
 [  6.      0.31    0.47  ...,   3.39    0.66   11.   ]]

The last column is gone.

y1 = df['quality'].values

# Print the single-row array
print(y1)

1	`[5 5 5 ..., 6 5 6]`

df.shape

1	`(1599, 12)`

# plot the other columns or variables
pd.DataFrame.hist(df, figsize = [15,15]);

plt.show() # facultative in Jypyter

Let us turn it into a two-category variable consisting of ‘good’ (rating > 5) & ‘bad’ (rating <= 5) qualities.

# is the rating <= 5 ?
y = y1 <= 5
print(y)

1	`[ True True True ..., False True False]`

from sklearn.cross_validation import train_test_split

# split
# 80% of the data for training (train set)
# 20% for testing (test set)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

from sklearn import linear_model

# Initial logistic regression model
lr = linear_model.LogisticRegression()

# Fit the model
lr = lr.fit(X_train, y_train)
y_true, y_pred = y_train, lr.predict(X_train)

# Evaluate the train set
print('Logistic Regression score for train set: %f' % lr.score(X_train, y_train))

1	`Logistic Regression score for train set: 0.752932`

print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.77      0.75      0.76       676
       True       0.73      0.75      0.74       603

avg / total       0.75      0.75      0.75      1279

from sklearn.metrics import classification_report

# Use the test set
y_true, y_pred = y_test, lr.predict(X_test)

# Evaluate the test set
print('Logistic Regression score for test set: %f' % lr.score(X_test, y_test))

1	`Logistic Regression score for test set: 0.740625`

print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.78      0.74      0.76       179
       True       0.69      0.74      0.71       141

avg / total       0.74      0.74      0.74       320

Note: the logistic regression performs better than k-NN without scaling.

Scale the data¶

print(X)

[[  7.4     0.7     0.    ...,   3.51    0.56    9.4  ]
 [  7.8     0.88    0.    ...,   3.2     0.68    9.8  ]
 [  7.8     0.76    0.04  ...,   3.26    0.65    9.8  ]
 ..., 
 [  6.3     0.51    0.13  ...,   3.42    0.75   11.   ]
 [  5.9     0.645   0.12  ...,   3.57    0.71   10.2  ]
 [  6.      0.31    0.47  ...,   3.39    0.66   11.   ]]

from sklearn.preprocessing import scale

Xs = scale(X)
print(Xs)

[[-0.52835961  0.96187667 -1.39147228 ...,  1.28864292 -0.57920652
  -0.96024611]
 [-0.29854743  1.96744245 -1.39147228 ..., -0.7199333   0.1289504
  -0.58477711]
 [-0.29854743  1.29706527 -1.18607043 ..., -0.33117661 -0.04808883
  -0.58477711]
 ..., 
 [-1.1603431  -0.09955388 -0.72391627 ...,  0.70550789  0.54204194
   0.54162988]
 [-1.39015528  0.65462046 -0.77526673 ...,  1.6773996   0.30598963
  -0.20930812]
 [-1.33270223 -1.21684919  1.02199944 ...,  0.51112954  0.01092425
   0.54162988]]

Run the Logit and measure the performance¶

from sklearn.cross_validation import train_test_split

# Split 80/20
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs,
                                                      y,
                                                      test_size=0.2,
                                                      random_state=42)

# Run the logistic regression model
lr_2 = lr.fit(Xs_train, y_train)

# Fit the model
y_true, y_pred = y_train, lr_2.predict(Xs_train)

# Evaluate the train set
print('Logistic Regression score for train set: %f' % lr_2.score(Xs_train, y_train))

1	`Logistic Regression score for train set: 0.752150`

print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.77      0.76      0.76       676
       True       0.73      0.75      0.74       603

avg / total       0.75      0.75      0.75      1279

# Use the test set
y_true, y_pred = y_test, lr_2.predict(Xs_test)

# Evaluate the test set
print('Logistic Regression score for test set: %f' % lr_2.score(Xs_test, y_test))

1	`Logistic Regression score for test set: 0.740625`

print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.79      0.74      0.76       179
       True       0.69      0.74      0.72       141

avg / total       0.74      0.74      0.74       320

This is very interesting! The performance of logistic regression did not improve with data scaling.

Predictor variables with large ranges that do not effect the target variable, a regression algorithm will make the corresponding coefficients small so that they do not effect predictions so much.

Logit Recap¶

Without scaling¶

# Set sc = False
# do not scale the features 
sc = False 

# Load the data 
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';') 
X = df.drop('quality' , 1).values # drop target variable 

# Scale, if desired 
if sc == True: 
  X = scale(X) 

# Target value 
y1 = df['quality'].values # original target variable 
y = y1 <= 5  # new target variable: is the rating <= 5? 

# Split (80/20) the data into a test set and a train
# X_train, X_test, y_train, y_test
train_test_split(X, y, test_size=0.2, random_state=42) 

# Train logistic regression model 
lr = linear_model.LogisticRegression() 
lr = lr.fit(X_train, y_train) 

# Print performance on the test set
print('Logistic Regression score for training set: %f' % lr.score(X_train, y_train)) 
y_true, y_pred = y_test, lr.predict(X_test) 
print(classification_report(y_true, y_pred))

Logistic Regression score for training set: 0.752932
             precision    recall  f1-score   support

      False       0.78      0.74      0.76       179
       True       0.69      0.74      0.71       141

avg / total       0.74      0.74      0.74       320

Noise and scaling¶

The noisier the symthesized data, the more important scaling will be.

Measurements can be in meters and and miles, with small or large ranges. If we scale the data, they end up being the same.

scikit-learn’s make_blobs function to generate 2000 data points that are in 4 clusters (each data point has 2 predictor variables and 1 target variable).

%pylab inline

1	`Populating the interactive namespace from numpy and matplotlib`

# Generate some clustered data (blobs!)
import numpy as np
from sklearn.datasets.samples_generator import make_blobs

n_samples=2000
X, y = make_blobs(n_samples, centers=4, n_features=2, random_state=0)

print(X)

[[-0.46530384  1.73299482]
 [-0.33963733  3.84220272]
 [ 2.25309569  0.99541446]
 ..., 
 [ 1.03616476  4.09126428]
 [-0.5901088   3.68821314]
 [ 2.30405277  4.20250584]]

print(y)

1	`[2 0 1 ..., 0 2 0]`

Plotting the synthesized data¶

Each axis is a predictor variable and the colour is a key to the target variable

All possible target variables are equally represented. In this case (or even if they are approximately equally represented), we say that the class y is balanced.

import matplotlib.pyplot as plt

plt.style.use('ggplot')

plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7);
plt.subplot(1, 2, 2);
plt.hist(y)

plt.show()

Plot histograms of the features.

import pandas as pd

# Convert to a DataFrame
df = pd.DataFrame(X)

# Plot it
pd.DataFrame.hist(df, figsize=(20,5))

1 2	`array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f366d3dbba8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f366d30ca58>]], dtype=object)`

Split into test & train sets, and plot both sets (train set > test set; 80/20).

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.title('training set')
plt.scatter(X_train[:,0] , X_train[:,1],  c = y_train, alpha = 0.7);
plt.subplot(1, 2, 2);
plt.scatter(X_test[:,0] , X_test[:,1],  c = y_test, alpha = 0.7);
plt.title('test set')

plt.show()

k-Nearest Neighbours¶

Let’s instantiate a k-Nearest Neighbours classifier and train it on our train set.

from sklearn import neighbors, linear_model

knn = neighbors.KNeighborsClassifier()
knn_model = knn.fit(X_train, y_train)

Fit the knn_model to the test set and compute the accuracy.

knn_model.score(X_test, y_test)

1	`0.93500000000000005`

print('k-NN score for test set: %f' % knn_model.score(X_test, y_test))

1	`k-NN score for test set: 0.935000`

Check out a variety of other metrics.

from sklearn.metrics import classification_report

y_true, y_pred = y_test, knn_model.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

          0       0.87      0.90      0.88       106
          1       0.98      0.93      0.95       102
          2       0.90      0.92      0.91       100
          3       1.00      1.00      1.00        92

avg / total       0.94      0.94      0.94       400

Re-fit knn_model to the train set and compute the accuracy.

print('k-NN score for train set: %f' % knn_model.score(X_train, y_train))

1	`k-NN score for train set: 0.941875`

from sklearn.metrics import classification_report

y_true, y_pred = y_train, knn_model.predict(X_train)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

          0       0.88      0.90      0.89       394
          1       0.97      0.96      0.96       398
          2       0.94      0.93      0.93       400
          3       0.99      0.98      0.98       408

avg / total       0.94      0.94      0.94      1600

Scale the data, run the k-NN, and measure the performance¶

print(X)

[[-0.46530384  1.73299482]
 [-0.33963733  3.84220272]
 [ 2.25309569  0.99541446]
 ..., 
 [ 1.03616476  4.09126428]
 [-0.5901088   3.68821314]
 [ 2.30405277  4.20250584]]

from sklearn.preprocessing import scale

Xs = scale(X)
print(Xs)

[[-0.26508542 -0.82638395]
 [-0.19594894 -0.0519305 ]
 [ 1.23046484 -1.09720678]
 ..., 
 [ 0.5609601   0.03951927]
 [-0.33374791 -0.10847199]
 [ 1.25849931  0.08036466]]

from sklearn.cross_validation import train_test_split

Xs_train, Xs_test, y_train, y_test = train_test_split(Xs,
                                                      y,
                                                      test_size=0.2,
                                                      random_state=42)

plt.figure(figsize=(20,5));

plt.subplot(1, 2, 1 );
plt.scatter(Xs_train[:,0] , Xs_train[:,1],  c = y_train, alpha = 0.7);
plt.title('scaled training set')

plt.subplot(1, 2, 2);
plt.scatter(Xs_test[:,0] , Xs_test[:,1],  c = y_test, alpha = 0.7);
plt.title('scaled test set')

plt.show()

knn_model_s = knn.fit(Xs_train, y_train)

print('k-NN score for test set: %f' % knn_model_s.score(Xs_test, y_test))

1	`k-NN score for test set: 0.935000`

It doesn’t perform any better with scaling.

This is most likely because both features were already around the same range.

Add noise to the signal¶

Adding a third variable of Gaussian noise with mean 0 and variable standard deviation $\sigma$ . We call $\sigma$ the strength of the noise and we see that the stronger the noise, the worse the performance of k-Nearest Neighbours.

# Strength of noise term
ns = 10**(3)

# Add noise column to predictor variables
newcol = np.transpose([ns*np.random.randn(n_samples)])
Xn = np.concatenate((X, newcol), axis = 1)

print(Xn)

[[ -4.65303843e-01   1.73299482e+00  -9.41949646e+01]
 [ -3.39637332e-01   3.84220272e+00  -1.00446506e+03]
 [  2.25309569e+00   9.95414462e-01   2.95697211e+02]
 ..., 
 [  1.03616476e+00   4.09126428e+00  -1.16020635e+02]
 [ -5.90108797e-01   3.68821314e+00   5.60244701e+02]
 [  2.30405277e+00   4.20250584e+00  -8.97600798e+02]]

Plot the 3D data.

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111, projection='3d' , alpha = 0.5)
ax.scatter(Xn[:,0], Xn[:,1], Xn[:,2], c = y)

1	`<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f366d409cf8>`

Run the k-NN and measure the performance¶

# Split into train-test sets
Xn_train, Xn_test, y_train, y_test = train_test_split(Xn,
                                                      y, 
                                                      test_size=0.2, 
                                                      random_state=42)

# Run the model
knn = neighbors.KNeighborsClassifier()
knn_model = knn.fit(Xn_train, y_train)

# Evaluate
print('k-NN score for test set: %f' % knn_model.score(Xn_test, y_test))

1	`k-NN score for test set: 0.337500`

Horrible!

Scale the data, add noise, run the k-NN, and measure the performance¶

# Scale
Xns = scale(Xn)

print(Xns)

[[-0.26508542 -0.82638395 -0.07164275]
 [-0.19594894 -0.0519305  -0.98584539]
 [ 1.23046484 -1.09720678  0.31993383]
 ..., 
 [ 0.5609601   0.03951927 -0.09356271]
 [-0.33374791 -0.10847199  0.58562421]
 [ 1.25849931  0.08036466 -0.87851945]]

# Apply noise
s = int(.2*n_samples)
Xns_train = Xns[s:]
y_train = y[s:]
Xns_test = Xns[:s]
y_test = y[:s]

# Run the model
knn = neighbors.KNeighborsClassifier()
knn_models = knn.fit(Xns_train, y_train)

# Evaluate
print('k-NN score for test set: %f' % knn_models.score(Xns_test, y_test))

1	`k-NN score for test set: 0.917500`

After scaling the data, the model performs nearly as well as were there no noise introduced.

Noise strength vs. accuracy (and the need for scaling)¶

How the noise strength can effect model accuracy?

Create a function to split the data and run the model.

Use the function in a loop.

def accu( X, y):
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=0.2,
                                                        random_state=42)

    knn = neighbors.KNeighborsClassifier()
    knn_model = knn.fit(X_train, y_train)

    return(knn_model.score(X_test, y_test))

# Set the variables
noise = [10**i for i in np.arange(0,6)]
A1 = np.zeros(len(noise))
A2 = np.zeros(len(noise))
count = 0

print(noise)

1	`[1, 10, 100, 1000, 10000, 100000]`

print(A1)
print(A2)

1 2	`[ 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0.]`

# Run the loop
for ns in noise:
    newcol = np.transpose([ns*np.random.randn(n_samples)])
    Xn = np.concatenate((X, newcol), axis = 1)
    Xns = scale(Xn)
    A1[count] = accu( Xn, y)
    A2[count] = accu( Xns, y)
    count += 1

# Plot the results
plt.scatter( noise, A1 )
plt.plot( noise, A1, label = 'unscaled', linewidth = 2)
plt.scatter( noise, A2 , c = 'r')
plt.plot( noise, A2 , label = 'scaled', linewidth = 2)
plt.xscale('log')
plt.xlabel('Noise strength')
plt.ylabel('Accuracy')
plt.legend(loc=3);

print(A1)
print(A2)

1 2	`[ 0.9225 0.9175 0.8025 0.3275 0.22 0.2525] [ 0.91 0.9175 0.9325 0.9075 0.9325 0.92 ]`

The more noise there is in the nuisance variable, the more important it is to scale the data for the k-NN model.

More noise, more scaling.

Logit (Repeat the k-NN procedure)¶

# Change the exponent of 10 to alter the amount of noise
ns = 10**(3) # Strength of noise term

# Set sc = True if we want to scale the features
sc = True

# Import packages
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import neighbors, linear_model
from sklearn.preprocessing import scale
from sklearn.datasets.samples_generator import make_blobs

# Generate some data
n_samples=2000
X, y = make_blobs(n_samples, 
                  centers=4, 
                  n_features=2,
                  random_state=0)

# Add noise column to predictor variables
newcol = np.transpose([ns*np.random.randn(n_samples)])
Xn = np.concatenate((X, newcol), axis = 1)

# Scale if desired
if sc == True:
    Xn = scale(Xn)

# Train model and test after splitting
Xn_train, Xn_test, y_train, y_test = train_test_split(Xn, y, test_size=0.2, random_state=42)
lr = linear_model.LogisticRegression()
lr_model = lr.fit(Xn_train, y_train)
print('logistic regression score for test set: %f' % lr_model.score(Xn_test, y_test))

1	`logistic regression score for test set: 0.942500`