Skip to content

Time Series Analysis

Foreword

Code snippets and excerpts from the tutorial. Python 3. From DataCamp.


Approach

Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over time while learning about trends and seasonality in time series data.

  • Source the data
  • Wrangle the data
  • Exploratory Data Analysis
  • Trends and seasonality in time series data
    • Identifying Trends
    • Seasonal patterns
      • First Order Differencing
      • Periodicity and Autocorrelation

Importing Packages and the Data

The data are from Google Trends.

# Import the packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Switch to the Seaborn defaults
sns.set()
# Import the data
# Check out the first rows
df = pd.read_csv('multiTimeline.csv', skiprows=1)
df.head(3)
Month diet: (Worldwide) gym: (Worldwide) finance: (Worldwide)
0 2004-01 100 31 48
1 2004-02 75 26 49
2 2004-03 67 24 47
# Check out the data types, number of rows and more
df.info()
1
2
3
4
5
6
7
8
9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 4 columns):
Month                   168 non-null object
diet: (Worldwide)       168 non-null int64
gym: (Worldwide)        168 non-null int64
finance: (Worldwide)    168 non-null int64
dtypes: int64(3), object(1)
memory usage: 5.3+ KB

Wrangle the Data

Rename the columns of the DataFrame df so that they have no whitespaces in them.

df.columns = ['month', 'diet', 'gym', 'finance']
df.head()
month diet gym finance
0 2004-01 100 31 48
1 2004-02 75 26 49
2 2004-03 67 24 47
3 2004-04 70 22 48
4 2004-05 72 22 43

Turn the month column into a DateTime data type (vs. object).

df.month = pd.to_datetime(df.month)

Make it the index of the DataFrame. Include the inplace argument when setting the index of the DataFrame df so that we alter the original index and set it to the month column.

df.set_index('month', inplace=True)
df.head()
diet gym finance
month
2004-01-01 100 31 48
2004-02-01 75 26 49
2004-03-01 67 24 47
2004-04-01 70 22 48
2004-05-01 72 22 43

Exploratory Data Analysis (EDA)

Use a built-in pandas visualization method .plot() to plot the data. Specify the label on the x-axis.

df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);

A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the peak.

Plot one column.

df[['diet']].plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);

Take a rolling average to remove the seasonality and see the trend. 12 months for example (before and after each point).

# Extract one column, 
# but do not create a Series with df['diet']
diet = df[['diet']]

# Chain: rolling, avg, plot
diet.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)

# Plot
plt.xlabel('Year', fontsize=20)
1
<matplotlib.text.Text at 0x7f3c0630c240>

# Another column
gym = df[['gym']]

# Chain
gym.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)

# Plot
plt.xlabel('Year', fontsize=20)
1
<matplotlib.text.Text at 0x7f3c0da837f0>

# Create a new df with both
df_rm = pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1)

# Chain
df_rm.plot(figsize=(20,10), linewidth=5, fontsize=20)

# Plot
plt.xlabel('Year', fontsize=20)
1
<matplotlib.text.Text at 0x7f3c0621c7f0>

Seasonal Patterns in Time Series

We can remove the trend from the time series by subtracting the rolling mean from the original signal, leaving the seasonality only and turning the data into a stationary time series (such as mean and variance don’t change over time). Many time series forecasting methods are based on the assumption that the time series is approximately stationary.

Another way to remove the trend is called “differencing”.

First-order differencing

Difference between one data point and the one before it.

Second-order differencing means that we would be looking at the difference between one data point and the two that precede it.

More on differencing.

# Differencing
diet.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)

# Plot
plt.xlabel('Year', fontsize=20);

Periodicity and Autocorrelation

A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.

Yet another way of thinking about this is that the time series is correlated with itself shifted by 12 months. That means that, if we took the time series and moved it 12 months backwards or forwards, it would map onto itself in some way: autocorrelation.

A Word about Correlation

from sklearn import datasets

iris = datasets.load_iris()

df_iris = pd.DataFrame(data = np.c_[iris['data'], iris['target']],
                       columns = iris['feature_names'] + ['target'])
df_iris.head(3)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
           fit_reg=False, data=df_iris)
1
<seaborn.axisgrid.FacetGrid at 0x7f3c060fb5f8>

sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
           fit_reg=True, data=df_iris);

sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
           fit_reg=False, data=df_iris, hue='target')
1
<seaborn.axisgrid.FacetGrid at 0x7f3c060b7438>

df_iris.corr()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
sepal length (cm) 1.000000 -0.109369 0.871754 0.817954 0.782561
sepal width (cm) -0.109369 1.000000 -0.420516 -0.356544 -0.419446
petal length (cm) 0.871754 -0.420516 1.000000 0.962757 0.949043
petal width (cm) 0.817954 -0.356544 0.962757 1.000000 0.956464
target 0.782561 -0.419446 0.949043 0.956464 1.000000
df_iris.groupby(['target']).corr()
petal length (cm) petal width (cm) sepal length (cm) sepal width (cm)
target
0.0 petal length (cm) 1.000000 0.306308 0.263874 0.176695
petal width (cm) 0.306308 1.000000 0.279092 0.279973
sepal length (cm) 0.263874 0.279092 1.000000 0.746780
sepal width (cm) 0.176695 0.279973 0.746780 1.000000
1.0 petal length (cm) 1.000000 0.786668 0.754049 0.560522
petal width (cm) 0.786668 1.000000 0.546461 0.663999
sepal length (cm) 0.754049 0.546461 1.000000 0.525911
sepal width (cm) 0.560522 0.663999 0.525911 1.000000
2.0 petal length (cm) 1.000000 0.322108 0.864225 0.401045
petal width (cm) 0.322108 1.000000 0.281108 0.537728
sepal length (cm) 0.864225 0.281108 1.000000 0.457228
sepal width (cm) 0.401045 0.537728 0.457228 1.000000

Periodicity and Autocorrelation (continued)

The time series again.

df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
1
<matplotlib.text.Text at 0x7f3c062fd390>

df.corr()
diet gym finance
diet 1.000000 -0.100764 -0.034639
gym -0.100764 1.000000 -0.284279
finance -0.034639 -0.284279 1.000000

diet and gym are negatively correlated. However, from looking at the times series, it looks as though their seasonal components would be positively correlated and their trends negatively correlated. The actual correlation coefficient is actually capturing both of those.

# first-order differences
df.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
1
<matplotlib.text.Text at 0x7f3c05fcfac8>

diet and gym are incredibly correlated once we remove the trend.

df.diff().corr()
diet gym finance
diet 1.000000 0.758707 0.373828
gym 0.758707 1.000000 0.301111
finance 0.373828 0.301111 1.000000

Autocorrelation

Autocorrelation of the ‘diet’ series: on the x-axis, we have the lag and on the y-axis, we have how correlated the time series is with itself at that lag.

So, this means that if the original time series repeats itself every two days, we would expect to see a spike in the autocorrelation function at 2 days.

pd.plotting.autocorrelation_plot(diet)
1
<matplotlib.axes._subplots.AxesSubplot at 0x7f3c05f587b8>

This is 12 months at which we have this huge peak in correlation. We have another peak at a 24 month interval, where it’s also correlated with itself. We have another peak at 36, but as we move further away, there’s less and less of a correlation.

The dotted lines in the above plot actually tell us about the statistical significance of the correlation.

Forecasts, ARIMA…

Use ARIMA modeling to make some time series forecasts as to what these search trends will look like over the coming years.