Geospatial Data
Foreword
Notes and code snippets. Python 3. From DataCamp.
Geospatial Data in Python¶
Install all the packages.
Some packages are pre-requisites to the others: to install GeoPandas, it requires Shapely, and to install Shapely, RTree, GDAL, and Fiona should be installed first.
- RTree: a ctypes Python wrapper of libspatialindex that provides a number of advanced spatial indexing features;
- GDAL: translator library for raster and vector geospatial data formats;
- Fiona: Fiona reads and writes spatial data files;
- Shapely: Geometric objects, predicates, and operations;
- GeoPandas: extends the datatypes used by pandas to allow spatial operations on geometric types;
- PySAL: a library of spatial analysis functions written in Python intended to support the development of high-level applications;
- Missingno: Missing data visualization module for Python.
The dataset¶
We take hurricane Florence‘s trajectory for plotting points on a map of the US States.
# Loading the packages
import geopandas
import numpy as np
import pandas as pd
from shapely.geometry import Point
import missingno as msn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Let’s look at the geospatial file from the US States (a GeoJSON file).
# Loading the file
country = geopandas.read_file("data/gz_2010_us_040_00_5m.json")
country.head(3)
GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | |
---|---|---|---|---|---|---|
0 | 0400000US01 | 01 | Alabama | 50645.326 | (POLYGON ((-88.124658 30.28364, -88.0868119999... | |
1 | 0400000US02 | 02 | Alaska | 570640.950 | (POLYGON ((-166.10574 53.988606, -166.075283 5... | |
2 | 0400000US04 | 04 | Arizona | 113594.084 | POLYGON ((-112.538593 37.000674, -112.534545 3... |
It is GeoDataFrame
, which has all the regular characteristics of a Pandas DataFrame
.
# Printing
type(country)
1 |
|
The column (geometry
) containing the coordinates is a GeoSeries
.
# Printing
type(country.geometry)
1 |
|
Each value in the GeoSeries
is a Shapely Object: a point, a segment, a polygon (and a multipolygon).
Each object can represent something: a point for a building, a segment for a street, a polygon for acity, and multipolygon for a country with multiple borders inside.
For more information about each Geometric object, consult this article.
# Printing
type(country.geometry[0])
1 |
|
Similar to a Pandas DataFrame
, a GeoDataFrame
can be plotted.
# Plotting the multipolygon
country.plot();
As we may see, the US map is relatively small compared to the frame.
It’s because the information includes Alaska, Hawaii, and Puerto Rico.
For this tutorial purpose, we can exclude Alaska and Hawaii as the hurricane did not go anywhere near those two states.
We can also add the figure size and colour to customize the plot
# Plotting
# Excluding Alaska and Hawaii with a conditional selection
country[country['NAME'].isin(['Alaska','Hawaii']) == False].plot(figsize=(30,20),
color='#3B3C6E');
# Loading the hurricane data
florence = pd.read_csv('data/florence.csv')
florence.head(3)
AdvisoryNumber | Date | Lat | Long | Wind | Pres | Movement | Type | Name | Received | Forecaster | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 08/30/2018 11:00 | 12.9 | 18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | 08/30/2018 10:45 | Avila |
1 | 1A | 08/30/2018 14:00 | 12.9 | 19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | 08/30/2018 13:36 | Avila |
2 | 2 | 08/30/2018 17:00 | 12.9 | 19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six | 08/30/2018 16:36 | Avila |
Exploratory Data Analysis¶
The first thing to do is EDA:
- Check the information, data type;
- Find out about missing values;
- Explore the descriptive statistics.
# Printing
florence.info()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Checking missing values using the missingno package.
There is only one missing value in column Forecaster
which we don’t need.
So we can ignore it.
# The package alias is msn (missingno)
# Printing
msn.bar(florence,
color='darkolivegreen');
# Descriptive statistics (numerical columns, Series, fields or features only)
# Printing
florence.describe()
Lat | Long | Wind | Pres | |
---|---|---|---|---|
count | 105.000000 | 105.000000 | 105.000000 | 105.000000 |
mean | 25.931429 | 56.938095 | 74.428571 | 981.571429 |
std | 7.975917 | 20.878865 | 36.560765 | 22.780667 |
min | 12.900000 | 18.400000 | 25.000000 | 939.000000 |
25% | 18.900000 | 41.000000 | 40.000000 | 956.000000 |
50% | 25.100000 | 60.000000 | 70.000000 | 989.000000 |
75% | 33.600000 | 76.400000 | 105.000000 | 1002.000000 |
max | 42.600000 | 82.900000 | 140.000000 | 1008.000000 |
We only need the date
, the coordinates: lat
(latitude) and long
(longitude), Wind
speed, Pres
sure, and Name
.
Movement
and Type
are optional, but the rest can be dropped.
# Droping all unused features
florence = florence.drop(['AdvisoryNumber', 'Forecaster', 'Received'],
axis=1)
# Printing
florence.head(3)
Date | Lat | Long | Wind | Pres | Movement | Type | Name | |
---|---|---|---|---|---|---|---|---|
0 | 08/30/2018 11:00 | 12.9 | 18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six |
1 | 08/30/2018 14:00 | 12.9 | 19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six |
2 | 08/30/2018 17:00 | 12.9 | 19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six |
It is important to check out the longitude and latitude.
Here the longitude is west.
Add -
to the Long
values to correctly plot the data.
The US are west of the Greenwich meridian (0 degree).
Everything east of the 0 degree is positive, everything west is negative.
The same goes for the northern/southern hemispheres.
# Modifying
florence['Long'] = 0 - florence['Long']
# Printing
florence.head(3)
Date | Lat | Long | Wind | Pres | Movement | Type | Name | |
---|---|---|---|---|---|---|---|---|
0 | 08/30/2018 11:00 | 12.9 | -18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six |
1 | 08/30/2018 14:00 | 12.9 | -19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six |
2 | 08/30/2018 17:00 | 12.9 | -19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six |
Let’s combine the latitude and longitude to create coordinates, which will subsequently be turned into a GeoSeries
for visualization purpose.
# Combining
florence['coordinates'] = florence[['Long', 'Lat']].values.tolist()
# Printing
florence.head(3)
Date | Lat | Long | Wind | Pres | Movement | Type | Name | coordinates | |
---|---|---|---|---|---|---|---|---|---|
0 | 08/30/2018 11:00 | 12.9 | -18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | [-18.4, 12.9] |
1 | 08/30/2018 14:00 | 12.9 | -19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | [-19.0, 12.9] |
2 | 08/30/2018 17:00 | 12.9 | -19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six | [-19.4, 12.9] |
# Changing the coordinates into GeoPoint
florence['coordinates'] = florence['coordinates'].apply(Point)
# Printing
florence.head(3)
Date | Lat | Long | Wind | Pres | Movement | Type | Name | coordinates | |
---|---|---|---|---|---|---|---|---|---|
0 | 08/30/2018 11:00 | 12.9 | -18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-18.4 12.9) |
1 | 08/30/2018 14:00 | 12.9 | -19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-19 12.9) |
2 | 08/30/2018 17:00 | 12.9 | -19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-19.4 12.9) |
Checking the type of the florence
and column coordinates
: it is a DataFrame
and a Series
.
# Printing
type(florence)
1 |
|
# Printing
type(florence['coordinates'])
1 |
|
Convert the DataFrame
into a GeoDataFrame
and check the types once again: it is a GeoDataFrame
and GeoSeries
.
# Converting
florence = geopandas.GeoDataFrame(florence,
geometry='coordinates')
# Printing
florence.head(3)
Date | Lat | Long | Wind | Pres | Movement | Type | Name | coordinates | |
---|---|---|---|---|---|---|---|---|---|
0 | 08/30/2018 11:00 | 12.9 | -18.4 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-18.4 12.9) |
1 | 08/30/2018 14:00 | 12.9 | -19.0 | 30 | 1007 | W at 12 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-19 12.9) |
2 | 08/30/2018 17:00 | 12.9 | -19.4 | 30 | 1007 | W at 9 MPH (280 deg) | Potential Tropical Cyclone | Six | POINT (-19.4 12.9) |
# Printing
type(florence)
1 |
|
# Printing
type(florence['coordinates'])
1 |
|
Notice that even though it’s now a GeoDataFrame
and a GeoSeries
, we can still filter the rows.
# Filtering by the Name column
# (a hurricane is of category 6 and lower values are tropical storms)
florence[florence['Name']=='Six']
# Agregating by the Name column
florence.groupby('Name').Type.count()
1 2 3 4 5 6 |
|
Let’s find the average wind speed.
# Printing
print("The average wind speed of Hurricane Florence is {} mph, and it can go up to {} mph maximum".\
format(round(florence.Wind.mean(),4),
florence.Wind.max()))
1 |
|
So the average wind speed of hurricane Florence is 74.43 miles per hour (119.78 km per hour) and the maximum is 140 miles per hour (225.308 km per hour).
To imagine how scary this wind speed is, the Beaufort Wind Scale, developed by U.K Royal Navy, shows the appearance of wind effects on the water and on land.
With the speed of 48 to 55 miles per hours, it can already break and uproot trees, and cause “considerable structural damage”.
Visualization¶
A GeoDataFrame
also has a plot
method.
# Plotting
florence.plot(figsize=(20,10));
Because this GeoDataFrame
only have coordinates information (location) of some points in time, we can only plot the position on a blank map.
Let’s plot the hurricane position on a US map to see where Florence transited and how strong the wind was at that time.
We use a US States map (data we loaded at the beginning) as the background and we overlay Florence’s positions on top.
# Plotting the US map
fig, ax = plt.subplots(1, figsize=(30,20))
base = country[country['NAME'].isin(['Alaska','Hawaii']) == False].plot(ax=ax,
color='#3B3C6E')
# Plotting the hurricane position
florence.plot(ax=base,
color='darkred',
marker="*",
markersize=20);
# Improving
fig, ax = plt.subplots(1, figsize=(20,20))
base = country[country['NAME'].isin(['Alaska','Hawaii']) == False].plot(ax=ax, color='#3B3C6E')
florence.plot(ax=base,
column='Wind',
marker="<",
markersize=30,
cmap='Oranges',
label="Wind speed(mph)")
_ = ax.axis('off')
plt.legend(labels=["Wind Speed (mph)\n\ndarker = stronger"],
fontsize=20,
facecolor='orangered',
markerscale=2,
loc='best',
scatterpoints=5,
fancybox=True,
edgecolor='#3B3C6E')
ax.set_title("Hurricane Florence",
fontsize=30)
# Saving
plt.savefig('Hurricane_footage.png',bbox_inches='tight');
So the hurricane was strongest when it is offshore, near the east coast.
As it approached the land, the hurricane started losing its strength, but with wind speeds ranging 60 to 77 miles per hour can still make horrible damages.
For more, consult Python Geospatial Development Essentials.