Los Angeles Best Practices

Best practices for data science and analysis projects at the City of Los Angeles

Data Analysis: Intermediate

After polishing off the intro tutorial, you’re ready to devour some more techniques to simplify your life as a data analyst.

Getting Started

import numpy as np
import pandas as pd
import geopandas as gpd

Create a New Column Using a Dictionary to Map the Values

Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our df, we want a new column that shows the state.

df: person and birthplace

Person Birthplace
Leslie Knope Eagleton, Indiana
Tom Haverford South Carolina
Ann Perkins Michigan
Ben Wyatt Partridge, Minnesota

Write a Function

Quick refresher on functions

# Create a function called state_abbrev.
def state_abbrev(row):
    # The find function returns the index of where 'Indiana' is found in
    #the column. If it cannot find it, it returns -1.
    if row.Birthplace.find('Indiana') != -1:
        return 'IN'
    elif row.Birthplace.find('South Carolina') != -1:
        return 'SC'
    # For an exact match, we would write it this way.
    elif row.Birthplace == 'Michigan':
        return 'MI'
    elif row.Birthplace.find('Minnesota') != -1:
        return 'MI'

# Apply this function and create the State column.
df['State'] = df.apply(state_abbrev, axis = 1)

Use a Dictionary to Map the Values

But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation.

state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC',
                'Michigan': 'MI', 'Partridge, Minnesota': 'MN'}

df['State'] = df.Birthplace.map(state_abbrev1)

But, if wanted to avoid writing out all the possible combinations, we would first extract the state portion of the city-state text. Then we could map the state’s full name with its abbreviation.

# The split function splitd at the comma and expand the columns. 
# Everything is stored in a new df called 'fullname'.
fullname = df['Birthplace'].str.split(",", expand = True) 

# Add the City column into our df by extracting the first column (0) from fullname. 
df['City'] = fullname[0]

# Add the State column by extracting the second column (1) from fullname.
df['State_full'] = fullname[1]


# Tom Haverford's birthplace is South Carolina. We don't have city information.
# So, the City column would be incorrectly filled in with South Carolina, and
# the State would say None.
# Fix these so the Nones actually display the state information correctly.

df['State_full'] = df.apply(lambda row: row.City if row.State == None else 
                    row.State_full, axis = 1)

# Now, use a dictionary to map the values.
state_abbrev2 = {'Indiana': 'IN', 'South Carolina': 'SC',
                'Michigan': 'MI', 'Minnesota': 'MN'}

df['State'] = df.Birthplace.map(state_abbrev2)

All 3 methods would give us this df:

Person Birthplace State
Leslie Knope Eagleton, Indiana IN
Tom Haverford South Carolina SC
Ann Perkins Michigan MI
Ben Wyatt Partridge, Minnesota MN

Loop over Columns with a Dictionary

If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop.

columns = ['colA', 'colB', 'colC']

for c in columns:
    # Fill in missing values for all columns with zeros
    df[c] = df[c].fillna(0)
    # Multiply all columns by 0.5
    df[c] = df[c] * 0.5

Loop over Dataframes with a Dictionary

It’s easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses are located in each Council District.

This type of question is perfect for a loop. Each df is spatially joined council_district, followed by some aggregation.

business: list of Pawnee stores

Business longitude latitude Sales_millions geometry
Paunch Burger x1 y1 5 Point(x1, y1)
Sweetums x2 y2 30 Point(x2, y2)
Jurassic Fork x3 y3 2 Point(x3, y3)
Gryzzl x4 y4 40 Point(x4, y4)

tom: list of Tom Haverford businesses

Business longitude latitude Sales_millions geometry
Tom’s Bistro x1 y1 30 Point(x1, y1)
Entertainment 720 x2 y2 1 Point(x2, y2)
Rent-A-Swag x3 y3 4 Point(x3, y3)
# Save our existing dfs into a dictionary. The business df is named 
# 'pawnee"; the tom df is named 'tom'. 
dfs = {'pawnee': business, 'tom': tom}

# Create an empty dictionary called summary_dfs to hold the results
summary_dfs = {}

# Loop over key-value pairs 
## Keys: pawnee, tom (names given to dataframes)
## Values: business, tom (dataframes)

for key, value in dfs.items():
    # Use f string to define a variable join_df (result of our spatial join)
    ## join_{key} would be join_pawnee or join_tom in the loop
    join_df = "join_{key}"
    # Spatial join
    join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects')
    # Calculate summary stats with groupby, agg, then save it into summary_dfs, 
    # naming it 'pawnee' or 'tom'.
    summary_dfs[key] = join.groupby('ID').agg(
        {'Business': 'count', 'Sales_millions': 'sum'})

Now, our summary_dfs dictionary contains 2 items, which are the 2 dataframes with everything aggregated.

# To view the contents of this dictionary
for key, value in summary_dfs.items():
    display(key)
    display(value)

# To access the df
summary_dfs["pawnee"]
summary_dfs["tom"]

join_tom: result of spatial join between tom and council_district

Business longitude latitude Sales_millions geometry ID
Tom’s Bistro x1 y1 30 Point(x1, y1) 1
Entertainment 720 x2 y2 1 Point(x2, y2) 3
Rent-A-Swag x3 y3 4 Point(x3, y3) 3

summary_dfs["tom"]: result of the counting number of Tom’s businesses by CD

ID Business Sales_millions
1 1 30
3 2 5


« Previous     Next »