Saturday, May 14, 2016

AI-Driven Investing with Lending Club - Part 1: Data Wrangling


Summary

This article is part one in a three-part series covering my experiences with building a machine-learning engine for investing with Lending Club.  Lending Club is a peer-to-peer loan service.  I previously published an article that discusses how to use Lending Club's REST API for automated investing.  That code utilized user-defined criteria to select loans and then execute orders for those selected loans.  This series of articles takes that automation to the next level and uses machine learning techniques to add intelligence to the loan selection process.

All code for this series was developed in Python and utilizes the excellent data science and machine learning libraries available in Python (pandasscikit-learn, etc ).  

Admittedly, I went into this machine learning exercise as a newbie (and can still be considered as such :).  I think this article series is probably most appropriate for folks with a similar background in AI.  

Part 1:  Data Wrangling

Goal

The nature of Lending Club investing is typical of most investment schemes in that there is an inverse relationship between risk and reward.  Those loans that have been assessed to have a higher risk provide a higher reward.  

Risk in this domain is the possibility that a loan recipient does not pay off the loan, i.e., a default.  A default results in the investor(s) losing all of their remaining principle and the future interest payments for the defaulted loan.  Reward is in the form of interest rates assigned to a loan.  Net, assessed riskier loans pay higher interest rates.  Lending Club uses their own techniques to assess lendee creditworthiness in the form of loan grades/subgrades and then subsequently assign interest rates to prospective loans.

My goal, and I would suppose the goal of any investment scheme, was to develop a system for maximizing rewards.  I suspect there are multiple ways to achieve this goal, but the particular approach I used below:  
  1. Use machine learning techniques to classify prospective loans into two classes:  those that are predicted to be paid in full and those that predicted for default.
  2. Select for investment those loans with the highest interest rates that are predicted not to result in default.

Data Analysis

Conveniently, Lending Club provides loan statistics for download from their site here.  Additionally, they provide a data dictionary for the statistics.  All data is provided in the form of .csv files.  Data is provided from 2007 onward to the current quarter.  There is a large volume (115 items) of data features provided for each loan. 

Pandas provides some really nice tools for manipulating data sets.  Below is a code snippet for reading the Lending Club set into a pandas dataframe:
import pandas as pd
def load():
    data_dir = '../data'
    frame = pd.DataFrame()
   
    try:
        frame = pd.read_csv(os.path.join(data_dir,'combinedLoans.csv'),index_col=0,low_memory=False)
    except IOError:
        for filename in os.listdir(data_dir):
            if filename.startswith('LoanStats'):
                temp = pd.read_csv(os.path.join(data_dir,filename),header=1,low_memory=False)
                frame = frame.append(temp)              
        frame.to_csv(os.path.join(data_dir,'combinedLoans.csv'))
      
    return frame
Line 7:  Try to load a previously-built file that combined all the Lending Club csv's.
Line 9:  If that combined file does not exist, loop through the directory containing the csv's
Lines 10-12:  Read in each csv file and append its contents to a data frame.
Line 13:  Dump the resulting frame to a csv file.

Now with the full data set in memory, you can run any number of queries on the data.  Since I'm most interested in understanding the indicators for a default, it makes sense to look at comparisons of paid vs defaulted loans.  Below is a simple function for pulling out various data points and displaying them in graphs.
def loan_status(frame):
    temp = copy.copy(frame)
    temp['loan_status'] = temp['loan_status'].map(status_cleaner)
    temp['term'] = temp['term'].map(term_cleaner)
    temp['home_ownership'] = temp['home_ownership'].map(home_cleaner)
    temp = temp.dropna(subset=['loan_status', 'term', 'home_ownership'])
    
    #Total Defaults vs Paid
    counts = pd.DataFrame(temp['loan_status'].value_counts())
    counts.rename(columns={'loan_status':'Count'}, inplace=True)
    counts['%'] = (100*counts['Count'] / counts['Count'].sum()).round(2)
    print counts
    print ''
    temp['loan_status'].value_counts().plot(kind='pie', title='Default vs Paid', colors=['g', 'r'])
Line 2:  Make a copy of the input data frame.
Line 3:  Apply a function to the entire loan_status column.  It is this column that contains the current status of the loan.  Many loans are still active.  I'm only interested in loans that complete - either paid in full or defaulted.  The status_cleaner function, to be shown below, will accomplish this filtering.
Lines 4-5:  Additional columns that are filtered.
Line 6:  Drop all items in these columns that are nulls.
Line 9:  Aggregate counts on defaulted and paid loans.
Line 10:  Add a column with these counts
Line 11:  Add another column that represents the percentage of the counts.
Line 14:  Display a pie chart of the aggregated data.
def status_cleaner(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 'Default'
        if 'fully paid' in str.lower(x):
            return 'Paid'
    return np.nan

def term_cleaner(x): 
    if isinstance(x, str):
        if '36' in x:
            return '36'
        elif '60' in x:
            return '60'
        else:
            return np.nan
    return np.nan

def home_cleaner(x):
    if x == 'RENT':
        return 'Rent'
    elif x == 'OWN':
        return 'Own'
    elif x == 'MORTGAGE':
        return 'Mortgage'
    elif x == 'OTHER':
        return 'Other'
    else:
        return np.nan 
Lines 1-29:  Three simple functions that are applied to their associated columns in the data frame.  For this exercise, their primary function is to put nulls in any field that has bad data in it.  Those nulls then get swept away with the dropna() mentioned previously.

Below is the output of this snippet:
          Count      %
Paid     246652  81.94
Default   54352  18.06
So, after filtering out all of the loans that are either still active (not paid or defaulted) and/or have data errors in the 3 sample fields I selected- we can see there are roughly 300K loans in this data set that came to term.  ~80% of the time loans were paid in full.   That leaves a ~20% default rate, overall.  Thus paid loans out-number defaults roughly 4 to 1.  That's somewhat comforting.  If it were the other way around, I suspect it's doubtful that many would entrust their money with Lending Club.  However, this sort of imbalance in classes does cause issues with machine learning algorithms.  I will have a detailed discussion of the implications and fixes for class imbalances in Part 2.
    #default rate vs LC Grade
    grp = temp.groupby(['grade', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='line', title='Default Rate vs LC Grade', color='k')
Line 2:  Using the pandas groupby function, aggregate loans grade and status.
Line 3:  Create a new column in the data frame that is the sum of Defaults and Paid counts.
Line 4:  Create another new column that represents the percentage of defaults vs paid loans.
Line 7:  Display a line plot with Default % as the y-axis and Lending Club's assigned loan grade as the x-axis.

Below is the output.  As one would expect, as the Lending Club grading criterion goes down, the default rate goes up - roughly at 6% per grade.

loan_status  Default   Paid  Total  Default %
grade                                        
A               2980  46377  49357       6.04
B              11011  78604  89615      12.29
C              15185  63088  78273      19.40
D              12777  35786  48563      26.31
E               7766  15687  23453      33.11
F               3596   5663   9259      38.84
G               1037   1447   2484      41.75
Below is the code and output for two more examples of comparing default rate against a loan statistic.
    #default rate vs term
    grp = temp.groupby(['term', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='bar', title='Default Rate vs Loan Term', color=['r'])
    
    #default rate vs home ownership
    grp = temp.groupby(['home_ownership', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='barh', title='Default Rate vs Home Ownership', color=['r'])
    plt.show()
Lines 1-7:  Similar to the previous example, I create an aggregation on loan term vs default rate.
Lines 9-16:  Home ownership status vs default rate.

Output below:
loan_status  Default    Paid   Total  Default %
term                                           
36             34730  198922  233652      14.86
60             19622   47730   67352      29.13
loan_status     Default    Paid   Total  Default %
home_ownership                                    
Mortgage          23826  125005  148831      16.01
Other                38     141     179      21.23
Own                4922   21560   26482      18.59
Rent              25566   99946  125512      20.37
It looks like there may be some correlation between loan term and defaults.  Loans with a 60 month term are twice as likely to default.  On the other hand, it's unclear that there is any relationship between home ownership and defaults. 

Wrangling

After analyzing the data, the next step is to determine what portions of it you want to use and transforming those portions into something can be digested by a machine learning algorithm.  This exercise nets out to selecting + transforming the features (inputs) and labels (outputs).  

As far as features, not all the columns in the Lending Club data set make sense as a feature.  An example is the 'desc' field.  That field is just free form text with little structure.  It's unusable as an input.  

For the label field - that one is clear cut.  The 'loan_status' field is what I'll be using as a classifier.

I'll be  using the scikit-learn machine learning library.  That library expects all features and labels to be formatted as NumPy arrays of floats.  So, any fields that are strings must be transformed into floats.

Below is a portion of the code I implemented for reading in and transforming the 'loan_status' field in the Lending Club data set:
def loan_filter(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 0.
        if 'fully paid' in str.lower(x):
            return 1.
    return np.nan

try:           
    frame = pd.read_csv(os.path.join(self.data_dir,FILTERED_FILE),index_col=0,low_memory=False)
except IOError:
    frame = pd.DataFrame()
    for filename in os.listdir(self.data_dir):
        if filename.startswith('LoanStats'):
            temp = pd.read_csv(os.path.join(self.data_dir,filename),usecols=dat_cols,header=1,low_memory=False)              
            temp['loan_status'] = temp['loan_status'].map(filters.loan_filter)
Lines 1-7:  Similar to the mapping functions mentioned earlier, I'm using this function to filter out all erroneous data and narrow the field to only those loans that are paid off or defaulted.  The difference in this map is that I'm replacing the string in the field with a float.  0 signifies a default, 1 means paid.
Line 16:  Apply the map function to the 'loan_status' column.

Although machine learning algorithms won't be covered till Part 2, below is an example of data transform that is necessary if you plan to use any regression-type algorithm - e.g, logistic regression classifier.  Regression algorithms assume values are comparable.  Example:  a value of '2' is twice as large as a value of '1'.  Hence, simply assigning a set of numbers to a categorical feature won't work if a regression classifier is to be used (though, works fine for any decision tree-based classifiers).  The solution is to map the categories to dummy variables.  Some examples below.

The 'home_ownership' is a category-type field that has a value of OWN, RENT, MORTGAGE, or OTHER.  Pandas conveniently has a built-in function for creating dummy variables.
temp['home_ownership'] = temp['home_ownership'].map(home_cleaner)
temp.dropna(subset=['home_ownership'], inplace=True)
dummies = pd.get_dummies(temp['home_ownership'], drop_first=True)
del temp['home_ownership']
temp = pd.concat([temp,dummies], axis=1)
Line 1:  Normalize the values to 'own', 'rent', 'mortgage, 'other', or nan if it is an erroneous value.
Line 2:  Drop all the values that were set to nulls with mapping function in Line 1.
Line 3:  Invoke the pandas dummy function to transform the home_ownership column into 3 new columns corresponding to its values.  Example:  an 'own' column is created that has values of 1 or 0 depending on whether the home_ownership value was 'own' or something else.  The 'drop_first' parameter causes only 3 columns to be added vs four.  This is a fix for something called the dummy variable trap.
Line 4:  Drop the 'home_ownership' column from the frame.
Line 5:  Add the new dummy columns to the frame.

Here's another more sophisticated example of use of dummy variables.  In this case, I'm transforming the earliest credit line feature, which is a date, into a set of bins representing a decade from 1960 thru 2010.  The decade name becomes a dummy variable indicating if the earliest credit was established in that decade.
temp['earliest_cr_line'] = temp['earliest_cr_line'].map(filters.digit_filter)
temp.dropna(subset=['earliest_cr_line'], inplace=True)
bins = [0., 1970., 1980., 1990., 2000., 2010., 2020.]
group_names = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
categories= pd.cut(temp['earliest_cr_line'], bins, labels=group_names)
dummies = pd.get_dummies(categories, drop_first=True)
del temp['earliest_cr_line']
temp = pd.concat([temp,dummies], axis=1)
Line 1:  Same normalization/error cleaning step.
Line 2:  Remove nulls.
Line 3:  Define the intervals for the bins.
Line 4:  Define the category names for the bins/intervals.
Line 5:  Using the built-in pandas cut function, transform the column into categories.
Line 6:  Transform the categories to dummy variables
Line 7:  Drop the original credit line column
Line 8:  Add the new dummy columns to the frame.

This was just a portion of the data transform code I had to write for this Lending Club data set.  I noticed there were quite a few missing/erroneous values in the set.  All of that has to be cleaned up before the data is fit to be put into an algorithm.


Copyright ©1993-2024 Joey E Whelan, All rights reserved.