# Working with Geospatial Data: Intro

Place matters. That’s why data analysis often includes a geospatial or geographic component. Data analysts are called upon to merge tabular and geospatial data, count the number of points within given boundaries, and create a map illustrating the results.

Below are short demos of common techniques to help get you started with exploring your geospatial data.

- Merge tabular and geospatial data
- Attach geographic characteristics to all points or lines that fall within a boundary (spatial join and dissolve)
- Aggregate and calculate summary statistics
- Buffers

## Getting Started

```
# Import Python packages
import pandas as pd
import geopandas as gpd
```

## Merge Tabular and Geospatial Data

We might have two files: Council District boundaries (geospatial) and population values (tabular). Through visual inspection, we know that `CD`

and `District`

are columns that help us make this match.

`df`

: population by council district

CD | Council_Member | Population |
---|---|---|

1 | Leslie Knope | 1,500 |

2 | Jeremy Jamm | 2,000 |

3 | Douglass Howser | 2,250 |

`gdf`

: council district boundaries

District | Geometry |
---|---|

1 | polygon |

2 | polygon |

3 | polygon |

We could merge these two dfs using the District and CD columns. If our left df is the gdf, then our merged df will also be a gdf.

```
merge = pd.merge(gdf, df, left_on = 'District', right_on = 'CD')
merge
```

District | Geometry | CD | Council_Member | Population |
---|---|---|---|---|

1 | polygon | 1 | Leslie Knope | 1,500 |

2 | polygon | 2 | Jeremy Jamm | 2,000 |

3 | polygon | 3 | Douglass Howser | 2,250 |

## Attach Geographic Characteristics to All Points or Lines That Fall Within a Boundary

Sometimes with a point shapefile (list of lat/lon points), we want to count how many points fall within the boundary. Unlike the previous example, these points aren’t attached with Council District information, so we need to generate that ourselves.

The ArcGIS equivalent of this is a **spatial join** between the point and polygon shapefiles, then **dissolving** to calculate summary statistics.

```
locations = gpd.read_file('../folder/paunch_burger_locations.geojson')
gdf = gpd.read_file('../folder/council_boundaries.geojson')
# Make sure both our gdfs are projected to the same coordinate reference system
# (EPSG:4326 = WGS84)
locations = locations.to_crs({'init':'epsg:4326'})
gdf = gdf.to_crs({'init':'epsg:4326'})
```

`locations`

lists the Paunch Burgers locations and their annual sales.

Store | City | Sales_millions | Geometry |
---|---|---|---|

1 | Pawnee | $5 | (x1,y1) |

2 | Pawnee | $2.5 | (x2, y2) |

3 | Pawnee | $2.5 | (x3, y3) |

4 | Eagleton | $2 | (x4, y4) |

5 | Pawnee | $4 | (x5, y5) |

6 | Pawnee | $6 | (x6, y6) |

7 | Indianapolis | $7 | (x7, y7) |

`gdf`

is the Council District boundaries.

District | Geometry |
---|---|

1 | polygon |

2 | polygon |

3 | polygon |

A spatial join finds the Council District the location falls within and attaches that information.

```
join = gpd.sjoin(locations, gdf, how = 'inner', op = 'intersects')
# how = 'inner' means that we only want to keep observations that matched,
# i.e locations that were within the council district boundaries.
# op = 'intersects' means that we are joining based on whether or not the location
# intersects with the council district.
```

The `join`

gdf looks like this. We lost Stores 4 (Eagleton) and 7 (Indianapolis) because they were outside of Pawnee City Council boundaries.

Store | City | Sales_millions | Geometry_x | District | Geometry_y |
---|---|---|---|---|---|

1 | Pawnee | $5 | (x1,y1) | 1 | polygon |

2 | Pawnee | $2.5 | (x2, y2) | 2 | polygon |

3 | Pawnee | $2.5 | (x3, y3) | 3 | polygon |

5 | Pawnee | $4 | (x5, y5) | 1 | polygon |

6 | Pawnee | $6 | (x6, y6) | 2 | polygon |

## Aggregate and Calculate Summary Statistics

We want to count the number of Paunch Burger locations and their total sales within each District.

```
summary = join.pivot_table(index = ['District', 'Geometry_y],
values = ['Store', 'Sales_millions'],
aggfunc = {'Store': 'count', 'Sales_millions': 'sum'}).reset_index()
OR
summary = join.groupby(['District', 'Geometry_y']).agg({'Store': 'count',
'Sales_millions': 'sum'}).reset_index()
summary.rename(column = {'Geometry_y': 'Geometry'}, inplace = True)
summary
```

District | Store | Sales_millions | Geometry |
---|---|---|---|

1 | 2 | $9 | polygon |

2 | 2 | $8.5 | polygon |

3 | 1 | $2.5 | polygon |

By keeping the `Geometry`

column, we’re able to export this as a GeoJSON or shapefile to map in ArcGIS or QGIS.

```
summary.to_file(driver = 'GeoJSON',
filename = '../folder/pawnee_sales_by_district.geojson')
summary.to_file(driver = 'ESRI Shapefile',
filename = '../folder/pawnee_sales_by_district.shp')
```

## Buffers

Buffers are areas of a certain distance around a given point, line, or polygon. Buffers are used to determine * proximity *. A 5 mile buffer around a point would be a circle of 5 mile radius centered at the point. This ESRI page shows how buffers for points, lines, and polygons look.

Some examples of questions that buffers help answer are:

- How many stores are within 1 mile of my house?
- Which streets are within 5 miles of the mall?
- Which census tracts or neighborhoods are within a half mile of the rail station?

Small buffers can also be used to determine whether 2 points are located in the same place. A shopping mall or the park might sit on a large property. If points are geocoded to various areas of the mall/park, they would show up as 2 distinct locations, when in reality, we consider them the same location.

We start with two point shapefiles: `locations`

(Paunch Burger locations) and `homes`

(home addresses for my 2 friends). The goal is to find out how many Paunch Burgers are located within a 2 miles of my friends.

`locations`

: Paunch Burger locations

Store | City | Sales_millions | Geometry |
---|---|---|---|

1 | Pawnee | $5 | (x1,y1) |

2 | Pawnee | $2.5 | (x2, y2) |

3 | Pawnee | $2.5 | (x3, y3) |

4 | Eagleton | $2 | (x4, y4) |

5 | Pawnee | $4 | (x5, y5) |

6 | Pawnee | $6 | (x6, y6) |

7 | Indianapolis | $7 | (x7, y7) |

`homes`

: friends’ addresses

Name | Geometry |
---|---|

Leslie Knope | (x8, y8) |

Ann Perkins | (x9, y9) |

First, prepare our point gdf and change it to the right projection. Pawnee is in Indiana, so we’ll use EPSG:2965. In southern California, use EPSG:2229.

```
# Use NAD83/Indiana East projection (units are in feet)
homes = homes.to_crs({'init':'epsg:2229'})
locations = locations.to_crs({'init':'epsg:2229'})
```

Next, draw a 2 mile buffer around `homes`

.

```
# Make a copy of the homes gdf
homes_buffer = homes.copy()
# Overwrite the existing geometry and change it from point to polygon
miles_to_feet = 5280
two_miles = 2 * miles_to_feet
homes_buffer['geometry'] = homes.geometry.buffer(two_miles)
```

**Select Points Within a Buffer**

Do a spatial join between `locations`

and `homes_buffer`

. Repeat the process of spatial join and aggregation in Python as illustrated in the previous section (spatial join and dissolve in ArcGIS).

```
sjoin = gpd.sjoin(locations, homes_buffer, how = 'inner', op = 'intersects')
sjoin
```

`sjoin`

looks like this.

- Geometry_x is the point geometry from our left df
`locations`

. - Geometry_y is the polygon geometry from our right df
`homes_buffer`

.

Store | Geometry_x | Name | Geometry_y |
---|---|---|---|

1 | (x1,y1) | Leslie Knope | polygon |

3 | (x3, y3) | Ann Perkins | polygon |

5 | (x5, y5) | Leslie Knope | polygon |

6 | (x6, y6) | Leslie Knope | polygon |

Count the number of Paunch Burger locations for each friend.

```
count = sjoin.pivot_table(index = 'Name',
values = 'Store', aggfunc = 'count').reset_index()
OR
count = sjoin.groupby('Name').agg({'Store':'count'}).reset_index()
```

The final `count`

:

Name | Store |
---|---|

Leslie Knope | 3 |

Ann Perkins | 1 |