Los Angeles Best Practices

Best practices for data science and analysis projects at the City of Los Angeles

Working with Geospatial Data: Basics

Place matters. That’s why data analysis often includes a geospatial or geographic component. Before we wrangle with our data, let’s go over the basics and make sure we’re properly set up.

Below are short demos for getting started:

Getting Started

# Import Python packages
import pandas as pd
import geopandas as gpd
import boto3

Import and Export Data in Python

Local files

We import a tabular dataframe my_csv.csv and a geodataframe my_geojson.geojson or my_shapefile.shp.

df = pd.read_csv('../folder/my_csv.csv')

# GeoJSON
gdf = gpd.read_file('../folder/my_geojson.geojson')
gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson' )


# Shapefile (collection of files: .shx, .shp, .prj, .dbf, etc)
# The collection files must be put into a folder before importing
gdf = gpd.read_file('../folder/my_shapefile/')
gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' )

S3

Data can also be stored in an Amazon S3 as a bucket storage. To access data in S3, you’ll have to have AWS access credentials stored at ~/.aws/credentials per the documentation.

To read in our dataframe (df) and geodataframe (gdf) from S3:

df = pd.read_csv('s3://bucket-name/my_csv.csv')
gdf = gpd.read_file('s3://bucket-name/my_geojson.geojson')
gdf = gpd.read_file('zip+s3://bucket-name/my-shapefile.zip')


# To write a file to S3, first save the gdf locally
s3 = boto3.client('s3')

gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson')
s3.upload_file('../folder/my_geojson.geojson', 
    'bucket-name', 's3_filename.geojson') 

Additional general information about various file types can be found in the Data Management section.

Setting and Projecting Coordinate Reference System

A coordinate reference system (CRS) tells geopandas how to plot the coordinates on the Earth. Starting with a shapefile usually means that the CRS is already set. In that case, we are interested in re-projecting the gdf to a different CRS. The CRS is chosen specific to a region (i.e., USA, Southern Califonia, New York, etc) or for its map units (i.e., decimal degrees, US feet, meters, etc). Map units that are US feet or meters are easier to work when it comes to defining distances (100 ft buffer, etc).

In Python, there are 2 related concepts:

  1. Setting the CRS <–> corresponds to geographic coordinate system in ArcGIS
  2. Re-projecting the CRS <–> corresponds to datum transformation and projected coordinated system in ArcGIS

The ArcGIS equivalent of this is in 3 related concepts:

  1. geographic coordinate system
  2. datum transformation
  3. projected coordinate system

The geographic coordinate system is the coordinate system of the latitude and longitude points. Common ones are WGS84, NAD83, and NAD27.

Datum transformation is needed when the geographic coordinate systems of two layers do not match. A datum transformation is needed to convert NAD1983 into WGS84.

The projected coordinate system projects the coordinates onto the map. ArcGIS projects “on the fly”, and applies the first layer’s projection to all subsequent layers. The projection does not change the coordinates from WGS84, but displays the points from a 3D sphere onto a 2D map. The projection determines how the Earth’s sphere is unfolded and flattened.

In ArcGIS, layers must have the same geographic coordinate system and projected coordinate system before spatial analysis can occur. Since ArcGIS allows you to choose the map units (i.e., feet, miles, meters) for proximity analysis, projections are chosen primarily for the region to be mapped.

In Python, the geometry column holds information about the geographic coordinate system and its projection. All gdfs must be set to the same CRS before performing any spatial operations between them. Changing geometry from WGS84 to CA State Plane is a datum transformation (WGS84 to NAD83) and projection to CA State Plane Zone 5.

# Check to see what the CRS is
gdf.crs

# If there is a CRS set, you can change the projection
# Here, change to CA State Plane (units = US feet)
gdf = gdf.to_crs({'init':'epsg:2229'})

# If the CRS is not set, set it first, then re-project to a different EPSG. 
# Here, set to WGS84
gdf.crs = {'init' :'epsg:4326'}

Sometimes, the gdf does not have a CRS set, will need to be manually set. This might occur if you create the geometry column from latitude and longitude points. More on this in the intermediate tutorial:

There are lots of CRS available. The most common ones used in southern California are:

EPSG Name Map Units
4326 WGS84 decimal degrees
2229 CA State Plane Zone 5 US feet
3310 CA Albers meters


« Previous     Next »