Foreword 
Code snippets and excerpts from the tutorial. Python 3. From DataCamp.
 
Import the Data 
sep, delimiter. 
delimiter, delimiter. 
names, column names to use. 
index_col, column to use as the row labels.
 
read_table(), general delimited files.
 
read_excel(), Excel files. 
read_fwf(), Fixed-Width Formatted data. 
read_clipboard, data copied to the clipboard. 
read_sql(), SQL query. 
 
Input-output documentation .
% pylab  inline 
import  numpy  as  np 
import  pandas  as  pd 
 
Populating the interactive namespace from numpy and matplotlib
 
digits 
# Load in the data with `read_csv()` 
digits  =  pd . read_csv ( "http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra" , 
                     header = None ) 
digits . head () 
 
  
    
       
      0 
      1 
      2 
      3 
      4 
      5 
      6 
      7 
      8 
      9 
      ... 
      55 
      56 
      57 
      58 
      59 
      60 
      61 
      62 
      63 
      64 
     
   
  
    
      0 
      0 
      1 
      6 
      15 
      12 
      1 
      0 
      0 
      0 
      7 
      ... 
      0 
      0 
      0 
      6 
      14 
      7 
      1 
      0 
      0 
      0 
     
    
      1 
      0 
      0 
      10 
      16 
      6 
      0 
      0 
      0 
      0 
      7 
      ... 
      0 
      0 
      0 
      10 
      16 
      15 
      3 
      0 
      0 
      0 
     
    
      2 
      0 
      0 
      8 
      15 
      16 
      13 
      0 
      0 
      0 
      1 
      ... 
      0 
      0 
      0 
      9 
      14 
      0 
      0 
      0 
      0 
      7 
     
    
      3 
      0 
      0 
      0 
      3 
      11 
      16 
      0 
      0 
      0 
      0 
      ... 
      0 
      0 
      0 
      0 
      1 
      15 
      2 
      0 
      0 
      4 
     
    
      4 
      0 
      0 
      5 
      14 
      4 
      0 
      0 
      0 
      0 
      0 
      ... 
      0 
      0 
      0 
      4 
      12 
      14 
      7 
      0 
      0 
      6 
     
   
5 rows × 65 columns
 
Find out about the dataset .
iris 
Another classical dataset.
iris  =  pd . read_csv ( "http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data" ) 
iris . columns  =  [ 'Sepal_Length' ,  'Sepal_Width' ,  'Petal_Length' ,  'Petal_Width' ,  'Class' ] 
iris . head () 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      0 
      4.9 
      3.0 
      1.4 
      0.2 
      Iris-setosa 
     
    
      1 
      4.7 
      3.2 
      1.3 
      0.2 
      Iris-setosa 
     
    
      2 
      4.6 
      3.1 
      1.5 
      0.2 
      Iris-setosa 
     
    
      3 
      5.0 
      3.6 
      1.4 
      0.2 
      Iris-setosa 
     
    
      4 
      5.4 
      3.9 
      1.7 
      0.4 
      Iris-setosa 
     
   
 
Basic Description of the Data 
Describing The Data 
Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Class            object
dtype: object
 
def  get_var_category ( series ): 
    unique_count  =  series . nunique ( dropna = False ) 
    total_count  =  len ( series ) 
    if  pd . api . types . is_numeric_dtype ( series ): 
        return  'Numerical' 
    elif  pd . api . types . is_datetime64_dtype ( series ): 
        return  'Date' 
    elif  unique_count == total_count : 
        return  'Text (Unique)' 
    else : 
        return  'Categorical' 
def  print_categories ( df ): 
    for  column_name  in  df . columns : 
        print ( column_name ,  ": " ,  get_var_category ( df [ column_name ])) 
 
Sepal_Length  :   Numerical 
Sepal_Width  :   Numerical 
Petal_Length  :   Numerical 
Petal_Width  :   Numerical 
Class  :   Categorical 
 
  
    
       
      0 
      1 
      2 
      3 
      4 
      5 
      6 
      7 
      8 
      9 
      ... 
      55 
      56 
      57 
      58 
      59 
      60 
      61 
      62 
      63 
      64 
     
   
  
    
      count 
      3823.0 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      ... 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
      3823.000000 
     
    
      mean 
      0.0 
      0.301334 
      5.481821 
      11.805912 
      11.451478 
      5.505362 
      1.387392 
      0.142297 
      0.002093 
      1.960502 
      ... 
      0.148313 
      0.000262 
      0.283024 
      5.855872 
      11.942977 
      11.461156 
      6.700497 
      2.105676 
      0.202197 
      4.497253 
     
    
      std 
      0.0 
      0.866986 
      4.631601 
      4.259811 
      4.537556 
      5.613060 
      3.371444 
      1.051598 
      0.088572 
      3.052353 
      ... 
      0.767761 
      0.016173 
      0.928046 
      4.980012 
      4.334508 
      4.991934 
      5.775815 
      4.028266 
      1.150694 
      2.869831 
     
    
      min 
      0.0 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      ... 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
     
    
      25% 
      0.0 
      0.000000 
      1.000000 
      10.000000 
      9.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      ... 
      0.000000 
      0.000000 
      0.000000 
      1.000000 
      10.000000 
      9.000000 
      0.000000 
      0.000000 
      0.000000 
      2.000000 
     
    
      50% 
      0.0 
      0.000000 
      5.000000 
      13.000000 
      13.000000 
      4.000000 
      0.000000 
      0.000000 
      0.000000 
      0.000000 
      ... 
      0.000000 
      0.000000 
      0.000000 
      5.000000 
      13.000000 
      13.000000 
      6.000000 
      0.000000 
      0.000000 
      4.000000 
     
    
      75% 
      0.0 
      0.000000 
      9.000000 
      15.000000 
      15.000000 
      10.000000 
      0.000000 
      0.000000 
      0.000000 
      3.000000 
      ... 
      0.000000 
      0.000000 
      0.000000 
      10.000000 
      15.000000 
      16.000000 
      12.000000 
      2.000000 
      0.000000 
      7.000000 
     
    
      max 
      0.0 
      8.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      5.000000 
      15.000000 
      ... 
      12.000000 
      1.000000 
      10.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      16.000000 
      9.000000 
     
   
8 rows × 65 columns
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
     
   
  
    
      count 
      149.000000 
      149.000000 
      149.000000 
      149.000000 
     
    
      mean 
      5.848322 
      3.051007 
      3.774497 
      1.205369 
     
    
      std 
      0.828594 
      0.433499 
      1.759651 
      0.761292 
     
    
      min 
      4.300000 
      2.000000 
      1.000000 
      0.100000 
     
    
      25% 
      5.100000 
      2.800000 
      1.600000 
      0.300000 
     
    
      50% 
      5.800000 
      3.000000 
      4.400000 
      1.300000 
     
    
      75% 
      6.400000 
      3.300000 
      5.100000 
      1.800000 
     
    
      max 
      7.900000 
      4.400000 
      6.900000 
      2.500000 
     
   
 
iris [[ "Sepal_Length" ,  "Sepal_Width" ]] . describe () 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
     
   
  
    
      count 
      149.000000 
      149.000000 
     
    
      mean 
      5.848322 
      3.051007 
     
    
      std 
      0.828594 
      0.433499 
     
    
      min 
      4.300000 
      2.000000 
     
    
      25% 
      5.100000 
      2.800000 
     
    
      50% 
      5.800000 
      3.000000 
     
    
      75% 
      6.400000 
      3.300000 
     
    
      max 
      7.900000 
      4.400000 
     
   
 
length  =  len ( digits ) 
print ( length ) 
 
count  =  digits [ 2 ] . count () 
print ( count ) 
 
number_of_missing_values  =  length  -  count 
pct_of_missing_values  =  float ( number_of_missing_values  /  length ) 
pct_of_missing_values  =  " {0:.1f} %" . format ( pct_of_missing_values * 100 ) 
print ( pct_of_missing_values ) 
 
print ( "Minimum value: " ,  iris [ "Sepal_Length" ] . min ()) 
print ( "Maximum value: " ,  iris [ "Sepal_Length" ] . max ()) 
 
Minimum value:  4.3
Maximum value:  7.9
 
print ( iris [ "Sepal_Length" ] . mode ()) 
 
print ( iris [ "Sepal_Length" ] . mean ()) 
 
print ( iris [ "Sepal_Length" ] . median ()) 
 
print ( iris [ "Sepal_Length" ] . std ()) 
 
print ( iris [ "Sepal_Length" ] . quantile ([ .25 ,  .5 ,  .75 ])) 
 
0.25     5.1 
0.50     5.8 
0.75     6.4 
Name :  Sepal_Length ,  dtype :  float64 
 
import  seaborn  as  sns 
sns . set ( color_codes = True ) 
sns . set_palette ( sns . color_palette ( "muted" )) 
sns . distplot ( iris [ "Sepal_Length" ] . dropna ()) 
 
<matplotlib.axes._subplots.AxesSubplot at 0x7f87b39b0320>
 
iris [[ "Sepal_Length" ,  "Sepal_Width" ]] . corr () 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
     
   
  
    
      Sepal_Length 
      1.000000 
      -0.103784 
     
    
      Sepal_Width 
      -0.103784 
      1.000000 
     
   
 
import  pandas_profiling  
# Print a full report 
pandas_profiling . ProfileReport ( iris ) 
 
    
    
    
        Dataset info
        
            
            
                Number of variables 
                5  
             
            
                Number of observations 
                149  
             
            
                Total Missing (%) 
                0.0%  
             
            
                Total size in memory 
                5.9 KiB  
             
            
                Average record size in memory 
                40.5 B  
             
             
        
     
    
        Variables types
        
            
            
                Numeric 
                3  
             
            
                Categorical 
                1  
             
            
                Date 
                0  
             
            
                Text (Unique) 
                0  
             
            
                Rejected 
                1  
             
             
        
     
    
        Warnings
        Petal_Width is highly correlated with Petal_Length (ρ = 0.96231) Rejected Dataset has 3 duplicate rows Warning    
     
 
    
    
    
    
        
            Distinct count 
            3 
         
        
            Unique (%) 
            2.0% 
         
        
            Missing (%) 
            0.0% 
         
        
            Missing (n) 
            0 
         
    
 
    
        
    Iris-virginica 
    
        
            50
        
     
 
    Iris-versicolor 
    
        
            50
        
     
 
    Iris-setosa 
    
        
            49
        
     
 
    
 
 
    
    
        
            
                
                    Distinct count 
                    43 
                 
                
                    Unique (%) 
                    28.9% 
                 
                
                    Missing (%) 
                    0.0% 
                 
                
                    Missing (n) 
                    0 
                 
                
                    Infinite (%) 
                    0.0% 
                 
                
                    Infinite (n) 
                    0 
                 
            
         
        
            
                
                    Mean 
                    3.7745 
                 
                
                    Minimum 
                    1 
                 
                
                    Maximum 
                    6.9 
                 
                
                    Zeros (%) 
                    0.0% 
                 
            
         
     
 
    
    
        
            
                Quantile statistics
                
                    
                        Minimum 
                        1 
                     
                    
                        5-th percentile 
                        1.3 
                     
                    
                        Q1 
                        1.6 
                     
                    
                        Median 
                        4.4 
                     
                    
                        Q3 
                        5.1 
                     
                    
                        95-th percentile 
                        6.1 
                     
                    
                        Maximum 
                        6.9 
                     
                    
                        Range 
                        5.9 
                     
                    
                        Interquartile range 
                        3.5 
                     
                
             
            
                Descriptive statistics
                
                    
                        Standard deviation 
                        1.7597 
                     
                    
                        Coef of variation 
                        0.46619 
                     
                    
                        Kurtosis 
                        -1.385 
                     
                    
                        Mean 
                        3.7745 
                     
                    
                        MAD 
                        1.5526 
                     
                    
                        Skewness 
                        -0.28946 
                     
                    
                        Sum 
                        562.4 
                     
                    
                        Variance 
                        3.0964 
                     
                    
                        Memory size 
                        1.2 KiB 
                     
                
             
         
        
        
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        1.5 
        14 
        9.4% 
        
             
         
 
        1.4 
        11 
        7.4% 
        
             
         
 
        5.1 
        8 
        5.4% 
        
             
         
 
        4.5 
        8 
        5.4% 
        
             
         
 
        1.3 
        7 
        4.7% 
        
             
         
 
        1.6 
        7 
        4.7% 
        
             
         
 
        5.6 
        6 
        4.0% 
        
             
         
 
        4.0 
        5 
        3.4% 
        
             
         
 
        4.9 
        5 
        3.4% 
        
             
         
 
        4.7 
        5 
        3.4% 
        
             
         
 
        Other values (33) 
        73 
        49.0% 
        
             
         
 
         
        
            Minimum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        1.0 
        1 
        0.7% 
        
             
         
 
        1.1 
        1 
        0.7% 
        
             
         
 
        1.2 
        2 
        1.3% 
        
             
         
 
        1.3 
        7 
        4.7% 
        
             
         
 
        1.4 
        11 
        7.4% 
        
             
         
 
            Maximum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        6.3 
        1 
        0.7% 
        
             
         
 
        6.4 
        1 
        0.7% 
        
             
         
 
        6.6 
        1 
        0.7% 
        
             
         
 
        6.7 
        2 
        1.3% 
        
             
         
 
        6.9 
        1 
        0.7% 
        
             
         
 
         
     
 
 
    
        Petal_Width 
            Highly correlated 
        
     
    This variable is highly correlated with Petal_Length and should be ignored for analysis 
 
 
    
    
        
            
                
                    Distinct count 
                    35 
                 
                
                    Unique (%) 
                    23.5% 
                 
                
                    Missing (%) 
                    0.0% 
                 
                
                    Missing (n) 
                    0 
                 
                
                    Infinite (%) 
                    0.0% 
                 
                
                    Infinite (n) 
                    0 
                 
            
         
        
            
                
                    Mean 
                    5.8483 
                 
                
                    Minimum 
                    4.3 
                 
                
                    Maximum 
                    7.9 
                 
                
                    Zeros (%) 
                    0.0% 
                 
            
         
     
 
    
    
        
            
                Quantile statistics
                
                    
                        Minimum 
                        4.3 
                     
                    
                        5-th percentile 
                        4.6 
                     
                    
                        Q1 
                        5.1 
                     
                    
                        Median 
                        5.8 
                     
                    
                        Q3 
                        6.4 
                     
                    
                        95-th percentile 
                        7.26 
                     
                    
                        Maximum 
                        7.9 
                     
                    
                        Range 
                        3.6 
                     
                    
                        Interquartile range 
                        1.3 
                     
                
             
            
                Descriptive statistics
                
                    
                        Standard deviation 
                        0.82859 
                     
                    
                        Coef of variation 
                        0.14168 
                     
                    
                        Kurtosis 
                        -0.55356 
                     
                    
                        Mean 
                        5.8483 
                     
                    
                        MAD 
                        0.68748 
                     
                    
                        Skewness 
                        0.3031 
                     
                    
                        Sum 
                        871.4 
                     
                    
                        Variance 
                        0.68657 
                     
                    
                        Memory size 
                        1.2 KiB 
                     
                
             
         
        
        
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        5.0 
        10 
        6.7% 
        
             
         
 
        6.3 
        9 
        6.0% 
        
             
         
 
        5.1 
        8 
        5.4% 
        
             
         
 
        6.7 
        8 
        5.4% 
        
             
         
 
        5.7 
        8 
        5.4% 
        
             
         
 
        5.5 
        7 
        4.7% 
        
             
         
 
        5.8 
        7 
        4.7% 
        
             
         
 
        6.4 
        7 
        4.7% 
        
             
         
 
        6.0 
        6 
        4.0% 
        
             
         
 
        4.9 
        6 
        4.0% 
        
             
         
 
        Other values (25) 
        73 
        49.0% 
        
             
         
 
         
        
            Minimum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        4.3 
        1 
        0.7% 
        
             
         
 
        4.4 
        3 
        2.0% 
        
             
         
 
        4.5 
        1 
        0.7% 
        
             
         
 
        4.6 
        4 
        2.7% 
        
             
         
 
        4.7 
        2 
        1.3% 
        
             
         
 
            Maximum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        7.3 
        1 
        0.7% 
        
             
         
 
        7.4 
        1 
        0.7% 
        
             
         
 
        7.6 
        1 
        0.7% 
        
             
         
 
        7.7 
        4 
        2.7% 
        
             
         
 
        7.9 
        1 
        0.7% 
        
             
         
 
         
     
 
 
    
    
        
            
                
                    Distinct count 
                    23 
                 
                
                    Unique (%) 
                    15.4% 
                 
                
                    Missing (%) 
                    0.0% 
                 
                
                    Missing (n) 
                    0 
                 
                
                    Infinite (%) 
                    0.0% 
                 
                
                    Infinite (n) 
                    0 
                 
            
         
        
            
                
                    Mean 
                    3.051 
                 
                
                    Minimum 
                    2 
                 
                
                    Maximum 
                    4.4 
                 
                
                    Zeros (%) 
                    0.0% 
                 
            
         
     
 
    
    
        
            
                Quantile statistics
                
                    
                        Minimum 
                        2 
                     
                    
                        5-th percentile 
                        2.34 
                     
                    
                        Q1 
                        2.8 
                     
                    
                        Median 
                        3 
                     
                    
                        Q3 
                        3.3 
                     
                    
                        95-th percentile 
                        3.8 
                     
                    
                        Maximum 
                        4.4 
                     
                    
                        Range 
                        2.4 
                     
                    
                        Interquartile range 
                        0.5 
                     
                
             
            
                Descriptive statistics
                
                    
                        Standard deviation 
                        0.4335 
                     
                    
                        Coef of variation 
                        0.14208 
                     
                    
                        Kurtosis 
                        0.31865 
                     
                    
                        Mean 
                        3.051 
                     
                    
                        MAD 
                        0.33199 
                     
                    
                        Skewness 
                        0.3501 
                     
                    
                        Sum 
                        454.6 
                     
                    
                        Variance 
                        0.18792 
                     
                    
                        Memory size 
                        1.2 KiB 
                     
                
             
         
        
        
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        3.0 
        26 
        17.4% 
        
             
         
 
        2.8 
        14 
        9.4% 
        
             
         
 
        3.2 
        13 
        8.7% 
        
             
         
 
        3.4 
        12 
        8.1% 
        
             
         
 
        3.1 
        12 
        8.1% 
        
             
         
 
        2.9 
        10 
        6.7% 
        
             
         
 
        2.7 
        9 
        6.0% 
        
             
         
 
        2.5 
        8 
        5.4% 
        
             
         
 
        3.8 
        6 
        4.0% 
        
             
         
 
        3.3 
        6 
        4.0% 
        
             
         
 
        Other values (13) 
        33 
        22.1% 
        
             
         
 
         
        
            Minimum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        2.0 
        1 
        0.7% 
        
             
         
 
        2.2 
        3 
        2.0% 
        
             
         
 
        2.3 
        4 
        2.7% 
        
             
         
 
        2.4 
        3 
        2.0% 
        
             
         
 
        2.5 
        8 
        5.4% 
        
             
         
 
            Maximum 5 values
    
    
        Value 
        Count 
        Frequency (%) 
          
     
     
    
        3.9 
        2 
        1.3% 
        
             
         
 
        4.0 
        1 
        0.7% 
        
             
         
 
        4.1 
        1 
        0.7% 
        
             
         
 
        4.2 
        1 
        0.7% 
        
             
         
 
        4.4 
        1 
        0.7% 
        
             
         
 
         
     
 
 
    
    
    
        
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      0 
      4.9 
      3.0 
      1.4 
      0.2 
      Iris-setosa 
     
    
      1 
      4.7 
      3.2 
      1.3 
      0.2 
      Iris-setosa 
     
    
      2 
      4.6 
      3.1 
      1.5 
      0.2 
      Iris-setosa 
     
    
      3 
      5.0 
      3.6 
      1.4 
      0.2 
      Iris-setosa 
     
    
      4 
      5.4 
      3.9 
      1.7 
      0.4 
      Iris-setosa 
     
   
     
 
 
# Print a full report 
pandas_profiling . ProfileReport ( digits ) 
 
>>> Full (lengthy) report here!!! <<< 
First and Last DataFrame Rows 
# Inspect the first 5 rows of `digits` 
first  =  digits . head ( 5 ) 
# Inspect the last 3 rows 
last  =  digits . tail ( 3 ) 
 
 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
10 
11 
12 
13 
14 
15    0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  62  \
0   0   1   6  15  12   1   0   0   0   7 ...   0   0   0   6  14   7   1   0   
1   0   0  10  16   6   0   0   0   0   7 ...   0   0   0  10  16  15   3   0   
2   0   0   8  15  16  13   0   0   0   1 ...   0   0   0   9  14   0   0   0   
3   0   0   0   3  11  16   0   0   0   0 ...   0   0   0   0   1  15   2   0   
4   0   0   5  14   4   0   0   0   0   0 ...   0   0   0   4  12  14   7   0
   63  64  
0   0   0  
1   0   0  
2   0   7  
3   0   4  
4   0   6
[5 rows x 65 columns]
 
      0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  \
3820   0   0   3  15   0   0   0   0   0   0 ...   0   0   0   4  14  16   9   
3821   0   0   6  16   2   0   0   0   0   0 ...   0   0   0   5  16  16  16   
3822   0   0   2  15  16  13   1   0   0   0 ...   0   0   0   4  14   1   0
      62  63  64  
3820   0   0   6  
3821   5   0   6  
3822   0   0   7
[3 rows x 65 columns]
 
Sample the Data 
# Take a sample of 5 
digits . sample ( 5 ) 
 
  
    
       
      0 
      1 
      2 
      3 
      4 
      5 
      6 
      7 
      8 
      9 
      ... 
      55 
      56 
      57 
      58 
      59 
      60 
      61 
      62 
      63 
      64 
     
   
  
    
      1249 
      0 
      0 
      14 
      14 
      13 
      15 
      5 
      0 
      0 
      0 
      ... 
      0 
      0 
      0 
      12 
      16 
      10 
      2 
      0 
      0 
      5 
     
    
      3702 
      0 
      0 
      0 
      9 
      16 
      12 
      2 
      0 
      0 
      0 
      ... 
      0 
      0 
      0 
      0 
      9 
      14 
      2 
      0 
      0 
      0 
     
    
      1605 
      0 
      0 
      7 
      16 
      13 
      2 
      0 
      0 
      0 
      2 
      ... 
      0 
      0 
      0 
      5 
      14 
      11 
      1 
      0 
      0 
      0 
     
    
      1890 
      0 
      0 
      3 
      15 
      15 
      5 
      0 
      0 
      0 
      0 
      ... 
      2 
      0 
      0 
      3 
      15 
      16 
      16 
      13 
      1 
      9 
     
    
      1295 
      0 
      0 
      7 
      15 
      13 
      3 
      0 
      0 
      0 
      0 
      ... 
      0 
      0 
      0 
      9 
      13 
      12 
      3 
      0 
      0 
      0 
     
   
5 rows × 65 columns
 
# import `sample` from `random` 
from  random  import  sample 
# Create a random index 
randomIndex  =  np . array ( sample ( range ( len ( digits )),  5 )) 
print ( randomIndex ) 
 
# Get 5 random rows 
digitsSample  =  digits . ix [ randomIndex ] 
# Print the sample 
print ( digitsSample ) 
 
 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
10 
11 
12 
13 
14 
15       0   1   2   3   4   5   6   7   8   9  ...  55  56  57  58  59  60  61  \
846    0   5  14  15   9   1   0   0   0   7 ...   0   0   4  12  16  12  10   
569    0   1   7  12  12   0   0   0   0   3 ...   0   0   0  10  16  13   7   
315    0   1   6  13  13   4   0   0   0   9 ...   0   0   0   4  14  16   9   
2932   0   0   4  12  10   1   0   0   0   0 ...   0   0   0   4  12  11   3   
2328   0   0   4  15  16  16  16  15   0   0 ...   0   0   0   5  15   3   0
      62  63  64  
846    4   0   2  
569    0   0   3  
315    2   0   2  
2932   0   0   0  
2328   0   0   7
[5 rows x 65 columns]
 
Queries 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      0 
      4.9 
      3.0 
      1.4 
      0.2 
      Iris-setosa 
     
    
      1 
      4.7 
      3.2 
      1.3 
      0.2 
      Iris-setosa 
     
   
 
# Petal length greater than sepal length? 
iris . query ( 'Petal_Length > Sepal_Length' ) 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
   
 
# reverse 
iris . query ( 'Sepal_Length > Petal_Length' ) 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      0 
      4.9 
      3.0 
      1.4 
      0.2 
      Iris-setosa 
     
    
      1 
      4.7 
      3.2 
      1.3 
      0.2 
      Iris-setosa 
     
    
      2 
      4.6 
      3.1 
      1.5 
      0.2 
      Iris-setosa 
     
    
      3 
      5.0 
      3.6 
      1.4 
      0.2 
      Iris-setosa 
     
    
      4 
      5.4 
      3.9 
      1.7 
      0.4 
      Iris-setosa 
     
    
      5 
      4.6 
      3.4 
      1.4 
      0.3 
      Iris-setosa 
     
    
      6 
      5.0 
      3.4 
      1.5 
      0.2 
      Iris-setosa 
     
    
      7 
      4.4 
      2.9 
      1.4 
      0.2 
      Iris-setosa 
     
    
      8 
      4.9 
      3.1 
      1.5 
      0.1 
      Iris-setosa 
     
    
      9 
      5.4 
      3.7 
      1.5 
      0.2 
      Iris-setosa 
     
    
      10 
      4.8 
      3.4 
      1.6 
      0.2 
      Iris-setosa 
     
    
      11 
      4.8 
      3.0 
      1.4 
      0.1 
      Iris-setosa 
     
    
      12 
      4.3 
      3.0 
      1.1 
      0.1 
      Iris-setosa 
     
    
      13 
      5.8 
      4.0 
      1.2 
      0.2 
      Iris-setosa 
     
    
      14 
      5.7 
      4.4 
      1.5 
      0.4 
      Iris-setosa 
     
    
      15 
      5.4 
      3.9 
      1.3 
      0.4 
      Iris-setosa 
     
    
      16 
      5.1 
      3.5 
      1.4 
      0.3 
      Iris-setosa 
     
    
      17 
      5.7 
      3.8 
      1.7 
      0.3 
      Iris-setosa 
     
    
      18 
      5.1 
      3.8 
      1.5 
      0.3 
      Iris-setosa 
     
    
      19 
      5.4 
      3.4 
      1.7 
      0.2 
      Iris-setosa 
     
    
      20 
      5.1 
      3.7 
      1.5 
      0.4 
      Iris-setosa 
     
    
      21 
      4.6 
      3.6 
      1.0 
      0.2 
      Iris-setosa 
     
    
      22 
      5.1 
      3.3 
      1.7 
      0.5 
      Iris-setosa 
     
    
      23 
      4.8 
      3.4 
      1.9 
      0.2 
      Iris-setosa 
     
    
      24 
      5.0 
      3.0 
      1.6 
      0.2 
      Iris-setosa 
     
    
      25 
      5.0 
      3.4 
      1.6 
      0.4 
      Iris-setosa 
     
    
      26 
      5.2 
      3.5 
      1.5 
      0.2 
      Iris-setosa 
     
    
      27 
      5.2 
      3.4 
      1.4 
      0.2 
      Iris-setosa 
     
    
      28 
      4.7 
      3.2 
      1.6 
      0.2 
      Iris-setosa 
     
    
      29 
      4.8 
      3.1 
      1.6 
      0.2 
      Iris-setosa 
     
    
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
     
    
      119 
      6.9 
      3.2 
      5.7 
      2.3 
      Iris-virginica 
     
    
      120 
      5.6 
      2.8 
      4.9 
      2.0 
      Iris-virginica 
     
    
      121 
      7.7 
      2.8 
      6.7 
      2.0 
      Iris-virginica 
     
    
      122 
      6.3 
      2.7 
      4.9 
      1.8 
      Iris-virginica 
     
    
      123 
      6.7 
      3.3 
      5.7 
      2.1 
      Iris-virginica 
     
    
      124 
      7.2 
      3.2 
      6.0 
      1.8 
      Iris-virginica 
     
    
      125 
      6.2 
      2.8 
      4.8 
      1.8 
      Iris-virginica 
     
    
      126 
      6.1 
      3.0 
      4.9 
      1.8 
      Iris-virginica 
     
    
      127 
      6.4 
      2.8 
      5.6 
      2.1 
      Iris-virginica 
     
    
      128 
      7.2 
      3.0 
      5.8 
      1.6 
      Iris-virginica 
     
    
      129 
      7.4 
      2.8 
      6.1 
      1.9 
      Iris-virginica 
     
    
      130 
      7.9 
      3.8 
      6.4 
      2.0 
      Iris-virginica 
     
    
      131 
      6.4 
      2.8 
      5.6 
      2.2 
      Iris-virginica 
     
    
      132 
      6.3 
      2.8 
      5.1 
      1.5 
      Iris-virginica 
     
    
      133 
      6.1 
      2.6 
      5.6 
      1.4 
      Iris-virginica 
     
    
      134 
      7.7 
      3.0 
      6.1 
      2.3 
      Iris-virginica 
     
    
      135 
      6.3 
      3.4 
      5.6 
      2.4 
      Iris-virginica 
     
    
      136 
      6.4 
      3.1 
      5.5 
      1.8 
      Iris-virginica 
     
    
      137 
      6.0 
      3.0 
      4.8 
      1.8 
      Iris-virginica 
     
    
      138 
      6.9 
      3.1 
      5.4 
      2.1 
      Iris-virginica 
     
    
      139 
      6.7 
      3.1 
      5.6 
      2.4 
      Iris-virginica 
     
    
      140 
      6.9 
      3.1 
      5.1 
      2.3 
      Iris-virginica 
     
    
      141 
      5.8 
      2.7 
      5.1 
      1.9 
      Iris-virginica 
     
    
      142 
      6.8 
      3.2 
      5.9 
      2.3 
      Iris-virginica 
     
    
      143 
      6.7 
      3.3 
      5.7 
      2.5 
      Iris-virginica 
     
    
      144 
      6.7 
      3.0 
      5.2 
      2.3 
      Iris-virginica 
     
    
      145 
      6.3 
      2.5 
      5.0 
      1.9 
      Iris-virginica 
     
    
      146 
      6.5 
      3.0 
      5.2 
      2.0 
      Iris-virginica 
     
    
      147 
      6.2 
      3.4 
      5.4 
      2.3 
      Iris-virginica 
     
    
      148 
      5.9 
      3.0 
      5.1 
      1.8 
      Iris-virginica 
     
   
149 rows × 5 columns
 
# alternatively 
iris [ iris . Sepal_Length  >  iris . Petal_Length ] 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      0 
      4.9 
      3.0 
      1.4 
      0.2 
      Iris-setosa 
     
    
      1 
      4.7 
      3.2 
      1.3 
      0.2 
      Iris-setosa 
     
    
      2 
      4.6 
      3.1 
      1.5 
      0.2 
      Iris-setosa 
     
    
      3 
      5.0 
      3.6 
      1.4 
      0.2 
      Iris-setosa 
     
    
      4 
      5.4 
      3.9 
      1.7 
      0.4 
      Iris-setosa 
     
    
      5 
      4.6 
      3.4 
      1.4 
      0.3 
      Iris-setosa 
     
    
      6 
      5.0 
      3.4 
      1.5 
      0.2 
      Iris-setosa 
     
    
      7 
      4.4 
      2.9 
      1.4 
      0.2 
      Iris-setosa 
     
    
      8 
      4.9 
      3.1 
      1.5 
      0.1 
      Iris-setosa 
     
    
      9 
      5.4 
      3.7 
      1.5 
      0.2 
      Iris-setosa 
     
    
      10 
      4.8 
      3.4 
      1.6 
      0.2 
      Iris-setosa 
     
    
      11 
      4.8 
      3.0 
      1.4 
      0.1 
      Iris-setosa 
     
    
      12 
      4.3 
      3.0 
      1.1 
      0.1 
      Iris-setosa 
     
    
      13 
      5.8 
      4.0 
      1.2 
      0.2 
      Iris-setosa 
     
    
      14 
      5.7 
      4.4 
      1.5 
      0.4 
      Iris-setosa 
     
    
      15 
      5.4 
      3.9 
      1.3 
      0.4 
      Iris-setosa 
     
    
      16 
      5.1 
      3.5 
      1.4 
      0.3 
      Iris-setosa 
     
    
      17 
      5.7 
      3.8 
      1.7 
      0.3 
      Iris-setosa 
     
    
      18 
      5.1 
      3.8 
      1.5 
      0.3 
      Iris-setosa 
     
    
      19 
      5.4 
      3.4 
      1.7 
      0.2 
      Iris-setosa 
     
    
      20 
      5.1 
      3.7 
      1.5 
      0.4 
      Iris-setosa 
     
    
      21 
      4.6 
      3.6 
      1.0 
      0.2 
      Iris-setosa 
     
    
      22 
      5.1 
      3.3 
      1.7 
      0.5 
      Iris-setosa 
     
    
      23 
      4.8 
      3.4 
      1.9 
      0.2 
      Iris-setosa 
     
    
      24 
      5.0 
      3.0 
      1.6 
      0.2 
      Iris-setosa 
     
    
      25 
      5.0 
      3.4 
      1.6 
      0.4 
      Iris-setosa 
     
    
      26 
      5.2 
      3.5 
      1.5 
      0.2 
      Iris-setosa 
     
    
      27 
      5.2 
      3.4 
      1.4 
      0.2 
      Iris-setosa 
     
    
      28 
      4.7 
      3.2 
      1.6 
      0.2 
      Iris-setosa 
     
    
      29 
      4.8 
      3.1 
      1.6 
      0.2 
      Iris-setosa 
     
    
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
     
    
      119 
      6.9 
      3.2 
      5.7 
      2.3 
      Iris-virginica 
     
    
      120 
      5.6 
      2.8 
      4.9 
      2.0 
      Iris-virginica 
     
    
      121 
      7.7 
      2.8 
      6.7 
      2.0 
      Iris-virginica 
     
    
      122 
      6.3 
      2.7 
      4.9 
      1.8 
      Iris-virginica 
     
    
      123 
      6.7 
      3.3 
      5.7 
      2.1 
      Iris-virginica 
     
    
      124 
      7.2 
      3.2 
      6.0 
      1.8 
      Iris-virginica 
     
    
      125 
      6.2 
      2.8 
      4.8 
      1.8 
      Iris-virginica 
     
    
      126 
      6.1 
      3.0 
      4.9 
      1.8 
      Iris-virginica 
     
    
      127 
      6.4 
      2.8 
      5.6 
      2.1 
      Iris-virginica 
     
    
      128 
      7.2 
      3.0 
      5.8 
      1.6 
      Iris-virginica 
     
    
      129 
      7.4 
      2.8 
      6.1 
      1.9 
      Iris-virginica 
     
    
      130 
      7.9 
      3.8 
      6.4 
      2.0 
      Iris-virginica 
     
    
      131 
      6.4 
      2.8 
      5.6 
      2.2 
      Iris-virginica 
     
    
      132 
      6.3 
      2.8 
      5.1 
      1.5 
      Iris-virginica 
     
    
      133 
      6.1 
      2.6 
      5.6 
      1.4 
      Iris-virginica 
     
    
      134 
      7.7 
      3.0 
      6.1 
      2.3 
      Iris-virginica 
     
    
      135 
      6.3 
      3.4 
      5.6 
      2.4 
      Iris-virginica 
     
    
      136 
      6.4 
      3.1 
      5.5 
      1.8 
      Iris-virginica 
     
    
      137 
      6.0 
      3.0 
      4.8 
      1.8 
      Iris-virginica 
     
    
      138 
      6.9 
      3.1 
      5.4 
      2.1 
      Iris-virginica 
     
    
      139 
      6.7 
      3.1 
      5.6 
      2.4 
      Iris-virginica 
     
    
      140 
      6.9 
      3.1 
      5.1 
      2.3 
      Iris-virginica 
     
    
      141 
      5.8 
      2.7 
      5.1 
      1.9 
      Iris-virginica 
     
    
      142 
      6.8 
      3.2 
      5.9 
      2.3 
      Iris-virginica 
     
    
      143 
      6.7 
      3.3 
      5.7 
      2.5 
      Iris-virginica 
     
    
      144 
      6.7 
      3.0 
      5.2 
      2.3 
      Iris-virginica 
     
    
      145 
      6.3 
      2.5 
      5.0 
      1.9 
      Iris-virginica 
     
    
      146 
      6.5 
      3.0 
      5.2 
      2.0 
      Iris-virginica 
     
    
      147 
      6.2 
      3.4 
      5.4 
      2.3 
      Iris-virginica 
     
    
      148 
      5.9 
      3.0 
      5.1 
      1.8 
      Iris-virginica 
     
   
149 rows × 5 columns
 
The Challenges of Data 
Missing Values 
# Identifiy missing values 
pd . isnull ( digits ) 
 
  
    
       
      0 
      1 
      2 
      3 
      4 
      5 
      6 
      7 
      8 
      9 
      ... 
      55 
      56 
      57 
      58 
      59 
      60 
      61 
      62 
      63 
      64 
     
   
  
    
      0 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      1 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      2 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      4 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      5 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      6 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      7 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      8 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      9 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      10 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      11 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      12 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      13 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      14 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      15 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      16 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      17 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      18 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      19 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      20 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      21 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      22 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      23 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      24 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      25 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      26 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      27 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      28 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      29 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
     
    
      3793 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3794 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3795 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3796 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3797 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3798 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3799 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3800 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3801 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3802 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3803 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3804 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3805 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3806 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3807 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3808 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3809 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3810 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3811 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3812 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3813 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3814 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3815 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3816 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3817 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3818 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3819 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3820 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3821 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
    
      3822 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      ... 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
      False 
     
   
3823 rows × 65 columns
 
Delete 
# Drop rows with missing values 
df . dropna ( axis = 0 ) 
# Drop columns with missing values  
df . dropna ( axis = 1 ) 
 
Impute 
Imputation: mean, median, another variable, estimate with regression ANOVA, logit, k-NN.
# Import NumPy 
import  numpy  as  np 
# Calculate the mean of the DataFrame variable Salary 
mean  =  np . mean ( df . Salary ) 
# Replace missing values with the mean 
df  =  df . Salary . fillna ( mean ) 
df  =  df . Salary . fillna ( mean ,  method = ffill ) 
 
ffill and bfill for forward and backward fill. 
 
from  scipy  import  interpolate 
# Fill the DataFrame 
df . interpolate () 
df . interpolate ( method = cubic ) 
 
cubic, polynomial. 
limit and limit_direction. 
 
Outliers 
Delete (data entry, processing errors), transform (assign weights, natural log to reduce variation) or impute them (replace extreme values with median, mean or mode values).
The Data’s Features 
Feature Engineering 
Increase the predictive power of learning algorithms by creating features from raw data that will help the learning process.
Encode categorical variables into numerical ones 
# Factorize the values  
labels , levels  =  pd . factorize ( iris . Class ) 
# Save the encoded variables in `iris.Class` 
iris . Class  =  labels 
# Print out the first rows 
iris . Class . head () 
 
0     0 
1     0 
2     0 
3     0 
4     0 
Name :  Class ,  dtype :  int 64 
 
Bin continuous variables in groups 
# Define the bins 
mybins  =  range ( 0 ,  df . age . max (),  10 ) 
# Cut the data from the DataFrame with the help of the bins 
df [ 'age_bucket' ]  =  pd . cut ( df . age ,  bins = mybins ) 
# Count the number of values per bucket 
df [ 'age_bucket' ] . value_counts () 
 
Scale features 
Center the data around 0.
from  sklearn.preprocessing  import  StandardScaler 
scaler  =  StandardScaler () . fit ( X ) 
rescaledX  =  scaler . transform ( X ) 
 
Feature Selection 
Select the key subset of original data features in an attempt to reduce the dimensionality of the training problem.
PCA combines similar (correlated) attributes and creates new ones that are considered superior to the original attributes of the dataset. 
Feature selection doesn’t combine attributes: it evaluates the quality and predictive power and selects the best set.
To find important features, calculate how much better or worse a model does when we leave one variable out of the equation.
# Import `RandomForestClassifier` 
from  sklearn.ensemble  import  RandomForestClassifier 
# Isolate Data, class labels and column values 
X  =  iris . iloc [:, 0 : 4 ] 
Y  =  iris . iloc [:, - 1 ] 
names  =  iris . columns . values 
# Build the model 
rfc  =  RandomForestClassifier () 
# Fit the model 
rfc . fit ( X ,  Y ) 
# Print the results 
print ( "Features sorted by their score:" ) 
print ( sorted ( zip ( map ( lambda  x :  round ( x ,  4 ),  rfc . feature_importances_ ),  names ),  reverse = True )) 
 
Features sorted by their score:
[(0.4899, 'Petal_Length'), (0.2752, 'Petal_Width'), (0.2185, 'Sepal_Length'), (0.016400000000000001, 'Sepal_Width')]
 
The best feature set is one that includes the petal length and petal width data.
# Isolate feature importances  
importance  =  rfc . feature_importances_ 
# Sort the feature importances  
sorted_importances  =  np . argsort ( importance ) 
# Insert padding 
padding  =  np . arange ( len ( names ) - 1 )  +  0.5 
# Plot the data 
plt . barh ( padding ,  importance [ sorted_importances ],  align = 'center' ) 
# Customize the plot 
plt . yticks ( padding ,  names [ sorted_importances ]) 
plt . xlabel ( "Relative Importance" ) 
plt . title ( "Variable Importance" ) 
# Show the plot 
plt . show () 
 
Patterns In the Data 
Visualization of the data; static with Matplotlib or Seaborn, interactive with Bokeh or Plotly.
Correlation Identification with PCA from scikit-learn 
Matplotlib 
Dimensionality Reduction techniques, such as Principal Component Analysis (PCA). From ‘many’ to two ‘principal components’.
# Import `PCA` from `sklearn.decomposition` 
from  sklearn.decomposition  import  PCA 
# Build the model 
pca  =  PCA ( n_components = 2 ) 
# Reduce the data, output is ndarray 
reduced_data  =  pca . fit_transform ( digits ) 
# Inspect shape of the `reduced_data` 
reduced_data . shape 
# print out the reduced data 
print ( reduced_data ) 
 
[[ 12.65674168  -4.63610357] 
 [ 16.82906354 -12.96575346] 
 [-19.08072301  10.58293767] 
 ...,  
 [ 23.90693984   6.06265415] 
 [ 29.1798759   -3.06847144] 
 [-25.23132536  11.60863909]] 
 
reduced_data  =  pd . DataFrame ( reduced_data ) 
 
import  matplotlib.pyplot  as  plt 
plt . scatter ( reduced_data [ 0 ],  reduced_data [ 1 ]) 
plt . show () 
 
Bokeh 
To be implemented in a webpage for example.
from  bokeh.charts  import  Scatter ,  output_file ,  show 
# Construct the scatter plot 
p  =  Scatter ( iris ,  x = 'Petal_Length' ,  y = 'Petal_Width' ,  color = "Class" ,  title = "Petal Length vs Petal Width" , 
            xlabel = "Sepal Length" ,  ylabel = "Sepal Width" ) 
# Output the file  
output_file ( 'scatter.html' ) 
# Show the scatter plot 
show ( p ) 
 
The GIF output:
 
Correlation Identification with Pandas 
The Pearson correlation assumes that the variables are normally distributed, that there is a straight line relationship between each of the variables and that the data is normally distributed about the regression line. 
The Spearman correlation, on the other hand, assumes that we have two ordinal variables or two variables that are related in some way, but not linearly. The Spearman coefficient is the sum of deviation squared by n times n minus 1.
The Kendall Tau correlation is a coefficient that represents the degree of concordance between two columns of ranked data. We can use the Spearman correlation to measure the degree of association between two variables. The Kendal Tau coefficient is calculated by the number of concordant pairs minus the number of discordant pairs divided by the total number of pairs.
Spearman’s coefficient will usually be larger than the Kendall’s Tau coefficient, but this is not always the case: we’ll get a smaller Spearman’s coefficient when the deviations are huge among the observations of the data. The Spearman correlation is very sensitive to this and this might come in handy in some cases!
The two last correlation measures require ranking the data.
# Pearson correlation 
iris . corr () 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      Sepal_Length 
      1.000000 
      -0.103784 
      0.871283 
      0.816971 
      0.781219 
     
    
      Sepal_Width 
      -0.103784 
      1.000000 
      -0.415218 
      -0.350733 
      -0.414532 
     
    
      Petal_Length 
      0.871283 
      -0.415218 
      1.000000 
      0.962314 
      0.948519 
     
    
      Petal_Width 
      0.816971 
      -0.350733 
      0.962314 
      1.000000 
      0.956014 
     
    
      Class 
      0.781219 
      -0.414532 
      0.948519 
      0.956014 
      1.000000 
     
   
 
iris2  =  iris . rank () 
# Kendall Tau correlation 
iris2 . corr ( 'kendall' ) 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      Sepal_Length 
      1.000000 
      -0.067636 
      0.718290 
      0.654197 
      0.669163 
     
    
      Sepal_Width 
      -0.067636 
      1.000000 
      -0.175665 
      -0.140207 
      -0.327228 
     
    
      Petal_Length 
      0.718290 
      -0.175665 
      1.000000 
      0.803041 
      0.822578 
     
    
      Petal_Width 
      0.654197 
      -0.140207 
      0.803041 
      1.000000 
      0.837934 
     
    
      Class 
      0.669163 
      -0.327228 
      0.822578 
      0.837934 
      1.000000 
     
   
 
# Spearman Rank correlation 
iris2 . corr ( 'spearman' ) 
 
  
    
       
      Sepal_Length 
      Sepal_Width 
      Petal_Length 
      Petal_Width 
      Class 
     
   
  
    
      Sepal_Length 
      1.000000 
      -0.152136 
      0.881759 
      0.833586 
      0.796546 
     
    
      Sepal_Width 
      -0.152136 
      1.000000 
      -0.294020 
      -0.267686 
      -0.426319 
     
    
      Petal_Length 
      0.881759 
      -0.294020 
      1.000000 
      0.936188 
      0.935220 
     
    
      Petal_Width 
      0.833586 
      -0.267686 
      0.936188 
      1.000000 
      0.937409 
     
    
      Class 
      0.796546 
      -0.426319 
      0.935220 
      0.937409 
      1.000000