Time Series Analysis
Foreword
Code snippets and excerpts from the tutorial. Python 3. From DataCamp.
Approach¶
Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over time while learning about trends and seasonality in time series data.
- Source the data
- Wrangle the data
- Exploratory Data Analysis
- Trends and seasonality in time series data
- Identifying Trends
- Seasonal patterns
- First Order Differencing
- Periodicity and Autocorrelation
Importing Packages and the Data¶
The data are from Google Trends.
# Import the packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Switch to the Seaborn defaults
sns.set()
# Import the data
# Check out the first rows
df = pd.read_csv('multiTimeline.csv', skiprows=1)
df.head(3)
Month | diet: (Worldwide) | gym: (Worldwide) | finance: (Worldwide) | |
---|---|---|---|---|
0 | 2004-01 | 100 | 31 | 48 |
1 | 2004-02 | 75 | 26 | 49 |
2 | 2004-03 | 67 | 24 | 47 |
# Check out the data types, number of rows and more
df.info()
1 2 3 4 5 6 7 8 9 |
|
Wrangle the Data¶
Rename the columns of the DataFrame df
so that they have no whitespaces in them.
df.columns = ['month', 'diet', 'gym', 'finance']
df.head()
month | diet | gym | finance | |
---|---|---|---|---|
0 | 2004-01 | 100 | 31 | 48 |
1 | 2004-02 | 75 | 26 | 49 |
2 | 2004-03 | 67 | 24 | 47 |
3 | 2004-04 | 70 | 22 | 48 |
4 | 2004-05 | 72 | 22 | 43 |
Turn the month
column into a DateTime
data type (vs. object
).
df.month = pd.to_datetime(df.month)
Make it the index of the DataFrame. Include the inplace
argument when setting the index of the DataFrame df
so that we alter the original index and set it to the month
column.
df.set_index('month', inplace=True)
df.head()
diet | gym | finance | |
---|---|---|---|
month | |||
2004-01-01 | 100 | 31 | 48 |
2004-02-01 | 75 | 26 | 49 |
2004-03-01 | 67 | 24 | 47 |
2004-04-01 | 70 | 22 | 48 |
2004-05-01 | 72 | 22 | 43 |
Exploratory Data Analysis (EDA)¶
Use a built-in pandas
visualization method .plot()
to plot the data. Specify the label on the x-axis.
df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the peak.
Plot one column.
df[['diet']].plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
Trends and Seasonality in Time Series¶
Identifying Trends in Time Series¶
Take a rolling average to remove the seasonality and see the trend. 12 months for example (before and after each point).
# Extract one column,
# but do not create a Series with df['diet']
diet = df[['diet']]
# Chain: rolling, avg, plot
diet.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
# Plot
plt.xlabel('Year', fontsize=20)
1 |
|
# Another column
gym = df[['gym']]
# Chain
gym.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
# Plot
plt.xlabel('Year', fontsize=20)
1 |
|
# Create a new df with both
df_rm = pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1)
# Chain
df_rm.plot(figsize=(20,10), linewidth=5, fontsize=20)
# Plot
plt.xlabel('Year', fontsize=20)
1 |
|
Seasonal Patterns in Time Series¶
We can remove the trend from the time series by subtracting the rolling mean from the original signal, leaving the seasonality only and turning the data into a stationary time series (such as mean and variance don’t change over time). Many time series forecasting methods are based on the assumption that the time series is approximately stationary.
Another way to remove the trend is called “differencing”.
First-order differencing¶
Difference between one data point and the one before it.
Second-order differencing means that we would be looking at the difference between one data point and the two that precede it.
# Differencing
diet.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
# Plot
plt.xlabel('Year', fontsize=20);
Periodicity and Autocorrelation¶
A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Yet another way of thinking about this is that the time series is correlated with itself shifted by 12 months. That means that, if we took the time series and moved it 12 months backwards or forwards, it would map onto itself in some way: autocorrelation.
A Word about Correlation¶
from sklearn import datasets
iris = datasets.load_iris()
df_iris = pd.DataFrame(data = np.c_[iris['data'], iris['target']],
columns = iris['feature_names'] + ['target'])
df_iris.head(3)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
fit_reg=False, data=df_iris)
1 |
|
sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
fit_reg=True, data=df_iris);
sns.lmplot(x='sepal length (cm)', y='sepal width (cm)',
fit_reg=False, data=df_iris, hue='target')
1 |
|
df_iris.corr()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
sepal length (cm) | 1.000000 | -0.109369 | 0.871754 | 0.817954 | 0.782561 |
sepal width (cm) | -0.109369 | 1.000000 | -0.420516 | -0.356544 | -0.419446 |
petal length (cm) | 0.871754 | -0.420516 | 1.000000 | 0.962757 | 0.949043 |
petal width (cm) | 0.817954 | -0.356544 | 0.962757 | 1.000000 | 0.956464 |
target | 0.782561 | -0.419446 | 0.949043 | 0.956464 | 1.000000 |
df_iris.groupby(['target']).corr()
petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | ||
---|---|---|---|---|---|
target | |||||
0.0 | petal length (cm) | 1.000000 | 0.306308 | 0.263874 | 0.176695 |
petal width (cm) | 0.306308 | 1.000000 | 0.279092 | 0.279973 | |
sepal length (cm) | 0.263874 | 0.279092 | 1.000000 | 0.746780 | |
sepal width (cm) | 0.176695 | 0.279973 | 0.746780 | 1.000000 | |
1.0 | petal length (cm) | 1.000000 | 0.786668 | 0.754049 | 0.560522 |
petal width (cm) | 0.786668 | 1.000000 | 0.546461 | 0.663999 | |
sepal length (cm) | 0.754049 | 0.546461 | 1.000000 | 0.525911 | |
sepal width (cm) | 0.560522 | 0.663999 | 0.525911 | 1.000000 | |
2.0 | petal length (cm) | 1.000000 | 0.322108 | 0.864225 | 0.401045 |
petal width (cm) | 0.322108 | 1.000000 | 0.281108 | 0.537728 | |
sepal length (cm) | 0.864225 | 0.281108 | 1.000000 | 0.457228 | |
sepal width (cm) | 0.401045 | 0.537728 | 0.457228 | 1.000000 |
Periodicity and Autocorrelation (continued)¶
The time series again.
df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
1 |
|
df.corr()
diet | gym | finance | |
---|---|---|---|
diet | 1.000000 | -0.100764 | -0.034639 |
gym | -0.100764 | 1.000000 | -0.284279 |
finance | -0.034639 | -0.284279 | 1.000000 |
diet
and gym
are negatively correlated. However, from looking at the times series, it looks as though their seasonal components would be positively correlated and their trends negatively correlated. The actual correlation coefficient is actually capturing both of those.
# first-order differences
df.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
1 |
|
diet
and gym
are incredibly correlated once we remove the trend.
df.diff().corr()
diet | gym | finance | |
---|---|---|---|
diet | 1.000000 | 0.758707 | 0.373828 |
gym | 0.758707 | 1.000000 | 0.301111 |
finance | 0.373828 | 0.301111 | 1.000000 |
Autocorrelation¶
Autocorrelation of the ‘diet’ series: on the x-axis, we have the lag and on the y-axis, we have how correlated the time series is with itself at that lag.
So, this means that if the original time series repeats itself every two days, we would expect to see a spike in the autocorrelation function at 2 days.
pd.plotting.autocorrelation_plot(diet)
1 |
|
This is 12 months at which we have this huge peak in correlation. We have another peak at a 24 month interval, where it’s also correlated with itself. We have another peak at 36, but as we move further away, there’s less and less of a correlation.
The dotted lines in the above plot actually tell us about the statistical significance of the correlation.
Forecasts, ARIMA…¶
Use ARIMA modeling to make some time series forecasts as to what these search trends will look like over the coming years.