Sunday, January 1, 2017

AI-Driven Investing with Lending Club - Part 4: Third Party Data Integration


This is an extension of the series I wrote on my experience implementing a machine learning engine for selecting loans for investment from Lending Club.  In this article, I'll discuss the modifications I made to support integration of data outside of the Lending Club data set using the ZIP code field on loans.  The results were less than satisfying.  I believe this was mostly due to the three-digit limitation Lending Club puts on their ZIP code fields.

Data Source Selection

The zip code field on Lending Club's loan data opens up the possibility to link a loan's origination location to a wide-array of data sets out there.  Government data sets are generally available at no cost.  Proprietary ones, from such vendors as Corelogic, have a price tag.  Below are some of the sets I looked at:

  • Housing.  Zillow has numerous free sets here.  I focused on their foreclosure data.  This is a CSV file containing foreclosure data from 1998 to present.
  • Income.  The US Census Bureau maintains US household income data here.  While they provide a wide-range of income-related data points, I focused on median household income.  This data is structured as Excel files, per year.
  • Unemployment.  The Bureau of Labor Statistics maintains unemployment data here.  I used their County/State set.  It's structured similarly to the Census Bureau set - Excel files, per year.

Data Wrangling

With prospective data sets chosen, the real work begins.  Lending Club's loan data set only provides the first 3 digits of the loan origination's zip code.  Naturally, none of the external data sets I looked at organized their data on three-digit ZIPs.

  • Zillow.  Below is a snippet of their data headers.  RegionName corresponds to a 5-digit ZIP

  • RegionID RegionName City State
    61639 10025 New York NY

  • Census Bureau.  Some of their data headers below.  Like most government data, they use FIPS codes instead of ZIP for locations.  The FIPS code consists of five digits, concatenating the the State and County codes.

  • State FIPS Code County FIPS Code Postal Code Name
    01 000 AL Alabama

  • Bureau of Labor Statistics.  Similarly to the Census Bureau, they use FIPS for the location identifier.  Some headers below:

  • State County
    Code Code County Name/State Abbreviation
    01 001 Autauga County, AL

    Converting FIPS to ZIP

    I had to create a translation from FIPS to ZIP codes to use the government data.  That requires yet another data set.  HUD has a crosswalk between FIPS and ZIPs here.  Below is the header and a row from that set:

    01001 36051 0.031420007 0.014141414 0.010940919 0.029940933

    FIPS to ZIP Lookup Table

    Below is what I did to transform the FIPS/ZIP crosswalk into a look-up table based on 3 digit ZIPs.  All code is in Python 3.
    frame = pd.read_excel(filename, usecols=[0,1,2], converters={0 : lambda x: str(x), 1 : lambda x: str(x)})
    frame.columns = ['fips','zip','ratio']
    frame = frame.iloc[frame.groupby(['fips']).apply(lambda x: x['ratio'].idxmax())]
    frame.drop('ratio', 1, inplace=True)
    frame.set_index(['fips'], inplace=True)
    frame['zip'] = frame['zip'].map(lambda x: str(x).zfill(5)[0:3])
    Line 1:  The HUD crosswalk is in Excel format.  Read columns 0-2 of that file.  Column 0: FIPS code.  Column 1:  ZIP code.  Column 2:  Percentage of that FIPS area that is represented by residential addresses for that ZIP code.  Convert the FIPS and ZIP values to strings.
    Line 2: Rename those 3 columns.
    Line 3:  Thus begins the compromises that must be made to accommodate Lending Club's policy of only showing 3 digits of ZIP code.  Here, I'm grouping all the 5 digit ZIP's within a FIPS and then selecting the one that represents the largest percentage of residential addresses in that FIPS area.
    Line 4:  Drop the 'ratio' column as it's unneeded now.
    Line 5:  Set the FIPS column as the index.
    Line 6:  Add leading zeros, if necessary, and chop the last 2 digits off to yield a 3-digit ZIP code.

    Below are a few rows of the resulting frame.
    fips zip
    1001 360
    1003 365
    1005 360

    Example 3rd Party Data Extraction:  BLS Unemployment Data Set

    With the FIPS/ZIP crosswalk in hand, I'm now in a position to manipulate the government data sets.  Below is what I did for the BLS unemployment data to create a lookup table based on 3-digit ZIP code and Year.
    for filename in os.listdir(data_dir):
        if filename.startswith('laucnty'):
            temp = pd.read_excel(os.path.join(data_dir,filename), skiprows=6, usecols=[1,2,4,9], \
                                converters={1 : lambda x: str(x), 2: lambda x: str(x)}, header=None) 
            temp.columns = ['stFIPS', 'coFIPS', 'year', 'rate']
            temp['rate'] = temp['rate'].apply(lambda x: pd.to_numeric(x, errors='coerce')) 
            temp['FIPS'] = temp['stFIPS'] + temp['coFIPS']
            year = int(temp.get_value(0,'year'))
            temp.drop(['stFIPS', 'coFIPS', 'year'], axis=1, inplace=True)
            temp.rename(columns={'rate' : str(year)}, inplace=True)
            temp = temp[['FIPS',str(year)]]
            frame = pd.concat([frame,temp], join='inner', axis=1)
    xwalk = getCrosswalk(CROSSWALK_DIR)       
    frame = pd.concat([frame,xwalk], join='inner', axis=1)
    frame = frame.groupby(['zip']).mean().round(4).reset_index()
    frame = frame.apply(pd.to_numeric)
    frame = frame.reindex(range(0,1000))
    frame.index = x: str(x).zfill(3)[0:3])
    Line 1:  Iterate through the BLS files.  There's one file per year.
    Line 2:  The BLS files start with that prefix.
    Line 3:  These are Excel files.  Skip the 6 lines of whitespace/header.  Only use the State FIPS, County FIPS, Year, and Unemployment Rate columns.  Convert the FIPS columns to strings to maintain leading zeros.
    Line 5:  Name the columns accordingly.
    Line 6:  Force the rate column to float values.
    Line 7:  Create a new 'FIPS' column by concatenating the the State and County FIPS codes.
    Line 8:  Read the first row's value under the year and convert it to an int.
    Line 9:  Drop the unnecessary columns.
    Line 10-11:  Rename the rate column as a Year.  Reset the frame to just 2 columns:  FIPS and Year.
    Line 12:  Drop all NaN rows.
    Line 13:  Set the index to the FIPS column.
    Line 14:  Concat the resulting temp frame, representing 1 year's worth of unemployment to data, to a master frame by joining on the FIPS code.
    Line 16:  Create a FIPS/ZIP crosswalk frame as was discussed above.
    Line 17:  Concatenate the 3-digit ZIPs of that crosswalk to the master frame by joining on the FIPS code.
    Line 18:  Another compromise for 3 digit ZIPs.  Group the unemployment data on the resulting 3 digit ZIPs and set the groups to their average.
    Line 19:  Set all values to numeric - in particular, I want the ZIP to be turned into an int.
    Line 20:  Set the ZIP as the frame index.
    Line 21:  Fill in rows for the entire 3 digit range 0 - 999.  Rows (ZIPs) that were not present prior will be filled with NaN's.  Not every 3 digit number is a valid ZIP code; however, this step ensures all valid ZIP's that were missing in the BLS data set are now represented.
    Line 22:  Down the Year columns, fill in NaN's created in Line 21 with that column's (year's) average.  In this case, this a compromise for missing data along a ZIP code.
    Line 23:  Cast the 3-digit ZIP's back into strings and add leading zeros where necessary.

    A few rows from the resulting frame below:
    2007 2008 2009 2010 2011 2012 2013 2014 2015
    009 8.6625 9.35 12.2125 11.8875 11.45 9.9375 9.2625 9.225 8.325
    010 4.8 5.6 7.85 8.6 7.7 7.15 7.45 6.4 5.6
    011 4.8546 5.8157 9.089 9.4514 8.8045 8.0227 7.5131 6.3701 5.629

    Integrate with the Lending Club Historical Data

    In Part 1 of this series, I demonstrated how to pull the Lending Club data into a Pandas frame.  Below, I'm now integrating the unemployment data to that frame using a nifty function in Pandas called lookup().
    unFrame = getExternalData() 
    frame['unemp'] = unFrame.lookup(frame.zip_code, frame.issue_d)
    Line 1:  Get the unemployment data frame created in the step above.
    Line 2:  Create a new column in the Lending Club data frame by collating the loan ZIP code and issue date (converted to a 4-digit year value) against the unemployment frame.

    Analyzing Results

    The question that comes to mind now:  Is this external data meaningful, particularly in terms of predicting a loan's outcome.  In a uni-variate analysis , I have an independent variable (unemployment rates) that represents a continuous data set and the dependent variable (loan status) that is categorical - in fact, it's dichotomous (two values:  0 = default, 1 = paid).  This sort of scenario doesn't lend itself to quantitative methods of correlation analysis such as Pearson, Spearman, regression, etc.  Analysis by graphing does provide some insight, however.
    frame[['unemp', 'loan_status']].boxplot(by='loan_status')
    Line 1:  Create a boxplot graph of the unemployment rate in the 3-digit ZIP of a loan vs its eventual disposition: 0 = default, 1 = paid.
    Line 2:  Display the graph.

    Figure 1
    As can be seen in the graph, the results are very uninspiring.  With medians that are roughly the same, there's really no differentiation of a loan's disposition based on unemployment rate.  The huge number of outliers is also troubling.  I saw similar results with foreclosure rates and median household incomes.

    Nevertheless, I do believe a factor such as unemployment rates in an area does have a correlation to Lending Club loan disposition.  My opinion is that this correlation is lost with those last 2 digits of ZIP code that Lending Club masks in their data sets.

    No comments:

    Post a Comment