We use the wine quality dataset related to red and white vinho verde wine samples, from the north of Portugal.
# import the modules%pylabinlineimportpandasaspdimportmatplotlib.pyplotasplt# set the styleplt.style.use('ggplot')
1
Populating the interactive namespace from numpy and matplotlib
# import the datadf=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ',sep=';')df.head(3)
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
0
7.4
0.70
0.00
1.9
0.076
11.0
34.0
0.9978
3.51
0.56
9.4
5
1
7.8
0.88
0.00
2.6
0.098
25.0
67.0
0.9968
3.20
0.68
9.8
5
2
7.8
0.76
0.04
2.3
0.092
15.0
54.0
0.9970
3.26
0.65
9.8
5
# drop target variable# only keep the values; the DataFrame becomes a simple array (matrix)# index (axis=0 / ‘index’) or columns (axis=1 / ‘columns’).X=df.drop('quality',axis=1).values# print the arrayprint(X)
The last column is gone from the array. Make it a list instead (or a single-row array).
y1=df['quality'].values# print the single-row arrayprint(y1)
1
[5 5 5 ..., 6 5 6]
# row, col of the DataFramedf.shape
1
(1599, 12)
# plot all the columns or variablespd.DataFrame.hist(df,figsize=[15,15]);plt.show()
Notice the range of each variable; some are wider.
Any algorithm, such as k-NN, which cares about the distance between data points. This motivates scaling our data.
Let us turn it into a two-category variable consisting of ‘good’ (rating > 5) & ‘bad’ (rating <= 5) qualities.
print(y1)
1
[5 5 5 ..., 6 5 6]
# is the rating <= 5 ?y=y1<=5print(y)
1
[ True True True ..., False True False]
True is worth 1 and False is worth 0.
# plot two histograms# the original target variable# and the aggregated target variableplt.figure(figsize=(20,5));# left plotplt.subplot(1,2,1);plt.hist(y1);plt.xlabel('original target value')plt.ylabel('count')# right plotplt.subplot(1,2,2);plt.hist(y)plt.xlabel('aggregated target value')plt.show()
Again, on the right histogram, True = 1 and False = 0.
\text{Accuracy}=\frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
Accuracy is commonly defined for binary classification problems in terms of true positives & false negatives. It can also be defined in terms of a confusion matrix.
Other measures of model performance are derived from the confusion matrix: precision (true positives divided by the number of true & false positives) and recall (number of true positives divided by the number of true positives plus the number of false negatives).
The F1-score is the harmonic mean of the precision and the recall.
Preprocessing happens before running any model, such as a regression (predicting a continuous variable) or a classification (predicting a discrete variable) using one or another model (k-NN, logistic, decision tree, random forests etc.).
For numerical variables, it is common to either normalize or standardize the data.
Normalization: scaling a dataset so that its minimum is 0 and its maximum 1.
Stardardization: centering the data around 0 and to scale with respect to the standard deviation.
x_{standardized} = \frac{x-\mu}{\sigma}
where \mu and \sigma are the mean and standard deviation of the dataset.
There are other transformatoions, such as the log transformation or the Box-Cox transformation, to make the data look more Gaussian or a normally distributed.
fromsklearn.cross_validationimporttrain_test_split# split# 80% of the data for training (train set)# 20% for testing (test set)Xs_train,Xs_test,y_train,y_test=train_test_split(Xs,y,test_size=0.2,random_state=42)
# Set sc = False # Do not scale the features sc=False# Set the number of k in k-NNnk=5# Load data df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ',sep=';')# Drop target variable X=df.drop('quality',1).values# Scale, if desired ifsc==True:X=scale(X)# Target value y1=df['quality'].values# original target variable # New target variable: is the rating <= 5?y=y1<=5# Split (80/20) the data into a test set and a train set# X_train, X_test, y_train, y_test X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)# Train the k-NN modelknn=neighbors.KNeighborsClassifier(n_neighbors=nk)knn_model=knn.fit(X_train,y_train)# Print performance on the test set print('k-NN accuracy for test set: %f'%knn_model.score(X_test,y_test))y_true,y_pred=y_test,knn_model.predict(X_test)print(classification_report(y_true,y_pred))
# Set sc = True # to scale the features sc=True# Set the number of k in k-NNnk=5# Load data df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ',sep=';')# Drop target variable X=df.drop('quality',1).values# Scale, if desired ifsc==True:X=scale(X)# Target value y1=df['quality'].values# original target variable # New target variable: is the rating <= 5?y=y1<=5# Split (80/20) the data into a test set and a train set# X_train, X_test, y_train, y_test X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)# Train the k-NN modelknn=neighbors.KNeighborsClassifier(n_neighbors=nk)knn_model=knn.fit(X_train,y_train)# Print performance on the test set print('k-NN accuracy for test set: %f'%knn_model.score(X_test,y_test))y_true,y_pred=y_test,knn_model.predict(X_test)print(classification_report(y_true,y_pred))
Before addressing an alternative to k-NN, the logistic regression or Logit, let us briefly review the linear regresion with a different dataset.
# Import necessary packages%pylabinlineimportpandasaspdimportmatplotlib.pyplotasplt# set the styleplt.style.use('ggplot')# Import nmore packagesfromsklearnimportdatasetsfromsklearnimportlinear_modelimportnumpyasnp
1
Populating the interactive namespace from numpy and matplotlib
# Load the data# The data is part of the scikit-learn moduleboston=datasets.load_boston()yb=boston.target.reshape(-1,1)Xb=boston['data'][:,5].reshape(-1,1)print(yb[:10])
y1=df['quality'].values# Print the single-row arrayprint(y1)
1
[5 5 5 ..., 6 5 6]
df.shape
1
(1599, 12)
# plot the other columns or variablespd.DataFrame.hist(df,figsize=[15,15]);plt.show()# facultative in Jypyter
Let us turn it into a two-category variable consisting of ‘good’ (rating > 5) & ‘bad’ (rating <= 5) qualities.
# is the rating <= 5 ?y=y1<=5print(y)
1
[ True True True ..., False True False]
fromsklearn.cross_validationimporttrain_test_split# split# 80% of the data for training (train set)# 20% for testing (test set)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
# Fit the modellr=lr.fit(X_train,y_train)y_true,y_pred=y_train,lr.predict(X_train)# Evaluate the train setprint('Logistic Regression score for train set: %f'%lr.score(X_train,y_train))
fromsklearn.metricsimportclassification_report# Use the test sety_true,y_pred=y_test,lr.predict(X_test)# Evaluate the test setprint('Logistic Regression score for test set: %f'%lr.score(X_test,y_test))
# Run the logistic regression modellr_2=lr.fit(Xs_train,y_train)
# Fit the modely_true,y_pred=y_train,lr_2.predict(Xs_train)# Evaluate the train setprint('Logistic Regression score for train set: %f'%lr_2.score(Xs_train,y_train))
# Use the test sety_true,y_pred=y_test,lr_2.predict(Xs_test)# Evaluate the test setprint('Logistic Regression score for test set: %f'%lr_2.score(Xs_test,y_test))
This is very interesting! The performance of logistic regression did not improve with data scaling.
Predictor variables with large ranges that do not effect the target variable, a regression algorithm will make the corresponding coefficients small so that they do not effect predictions so much.
# Set sc = False# do not scale the features sc=False# Load the data df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ',sep=';')X=df.drop('quality',1).values# drop target variable # Scale, if desired ifsc==True:X=scale(X)# Target value y1=df['quality'].values# original target variable y=y1<=5# new target variable: is the rating <= 5? # Split (80/20) the data into a test set and a train# X_train, X_test, y_train, y_testtrain_test_split(X,y,test_size=0.2,random_state=42)# Train logistic regression model lr=linear_model.LogisticRegression()lr=lr.fit(X_train,y_train)# Print performance on the test setprint('Logistic Regression score for training set: %f'%lr.score(X_train,y_train))y_true,y_pred=y_test,lr.predict(X_test)print(classification_report(y_true,y_pred))
The noisier the symthesized data, the more important scaling will be.
Measurements can be in meters and and miles, with small or large ranges. If we scale the data, they end up being the same.
scikit-learn’s make_blobs function to generate 2000 data points that are in 4 clusters (each data point has 2 predictor variables and 1 target variable).
%pylabinline
1
Populating the interactive namespace from numpy and matplotlib
# Generate some clustered data (blobs!)importnumpyasnpfromsklearn.datasets.samples_generatorimportmake_blobsn_samples=2000X,y=make_blobs(n_samples,centers=4,n_features=2,random_state=0)print(X)
Each axis is a predictor variable and the colour is a key to the target variable
All possible target variables are equally represented. In this case (or even if they are approximately equally represented), we say that the class y is balanced.
importpandasaspd# Convert to a DataFramedf=pd.DataFrame(X)# Plot itpd.DataFrame.hist(df,figsize=(20,5))
12
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f366d3dbba8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f366d30ca58>]], dtype=object)
Split into test & train sets, and plot both sets (train set > test set; 80/20).
plt.figure(figsize=(20,5));plt.subplot(1,2,1);plt.scatter(Xs_train[:,0],Xs_train[:,1],c=y_train,alpha=0.7);plt.title('scaled training set')plt.subplot(1,2,2);plt.scatter(Xs_test[:,0],Xs_test[:,1],c=y_test,alpha=0.7);plt.title('scaled test set')plt.show()
knn_model_s=knn.fit(Xs_train,y_train)print('k-NN score for test set: %f'%knn_model_s.score(Xs_test,y_test))
1
k-NNscorefortestset: 0.935000
It doesn’t perform any better with scaling.
This is most likely because both features were already around the same range.
Adding a third variable of Gaussian noise with mean 0 and variable standard deviation \sigma. We call \sigma the strength of the noise and we see that the stronger the noise, the worse the performance of k-Nearest Neighbours.
# Strength of noise termns=10**(3)# Add noise column to predictor variablesnewcol=np.transpose([ns*np.random.randn(n_samples)])Xn=np.concatenate((X,newcol),axis=1)print(Xn)
# Apply noises=int(.2*n_samples)Xns_train=Xns[s:]y_train=y[s:]Xns_test=Xns[:s]y_test=y[:s]# Run the modelknn=neighbors.KNeighborsClassifier()knn_models=knn.fit(Xns_train,y_train)# Evaluateprint('k-NN score for test set: %f'%knn_models.score(Xns_test,y_test))
1
k-NNscorefortestset: 0.917500
After scaling the data, the model performs nearly as well as were there no noise introduced.
Noise strength vs. accuracy (and the need for scaling)¶
How the noise strength can effect model accuracy?
Create a function to split the data and run the model.
# Set the variablesnoise=[10**iforiinnp.arange(0,6)]A1=np.zeros(len(noise))A2=np.zeros(len(noise))count=0
print(noise)
1
[1, 10, 100, 1000, 10000, 100000]
print(A1)print(A2)
12
[ 0. 0. 0. 0. 0. 0.][ 0. 0. 0. 0. 0. 0.]
# Run the loopfornsinnoise:newcol=np.transpose([ns*np.random.randn(n_samples)])Xn=np.concatenate((X,newcol),axis=1)Xns=scale(Xn)A1[count]=accu(Xn,y)A2[count]=accu(Xns,y)count+=1
# Plot the resultsplt.scatter(noise,A1)plt.plot(noise,A1,label='unscaled',linewidth=2)plt.scatter(noise,A2,c='r')plt.plot(noise,A2,label='scaled',linewidth=2)plt.xscale('log')plt.xlabel('Noise strength')plt.ylabel('Accuracy')plt.legend(loc=3);
# Generate some datan_samples=2000X,y=make_blobs(n_samples,centers=4,n_features=2,random_state=0)
# Add noise column to predictor variablesnewcol=np.transpose([ns*np.random.randn(n_samples)])Xn=np.concatenate((X,newcol),axis=1)
# Scale if desiredifsc==True:Xn=scale(Xn)
# Train model and test after splittingXn_train,Xn_test,y_train,y_test=train_test_split(Xn,y,test_size=0.2,random_state=42)lr=linear_model.LogisticRegression()lr_model=lr.fit(Xn_train,y_train)print('logistic regression score for test set: %f'%lr_model.score(Xn_test,y_test))