Summary
This article is part one in a three-part series covering my experiences with building a machine-learning engine for investing with Lending Club.  Lending Club is a peer-to-peer loan service.  I previously published an article that discusses how to use Lending Club's REST API for automated investing.  That code utilized user-defined criteria to select loans and then execute orders for those selected loans.  This series of articles takes that automation to the next level and uses machine learning techniques to add intelligence to the loan selection process.
All code for this series was developed in Python and utilizes the excellent data science and machine learning libraries available in Python (pandas, scikit-learn, etc ).  
Admittedly, I went into this machine learning exercise as a newbie (and can still be considered as such :).  I think this article series is probably most appropriate for folks with a similar background in AI.  
Part 1:  Data Wrangling
Goal
The nature of Lending Club investing is typical of most investment schemes in that there is an inverse relationship between risk and reward.  Those loans that have been assessed to have a higher risk provide a higher reward.  
Risk in this domain is the possibility that a loan recipient does not pay off the loan, i.e., a default.  A default results in the investor(s) losing all of their remaining principle and the future interest payments for the defaulted loan.  Reward is in the form of interest rates assigned to a loan.  Net, assessed riskier loans pay higher interest rates.  Lending Club uses their own techniques to assess lendee creditworthiness in the form of loan grades/subgrades and then subsequently assign interest rates to prospective loans.
My goal, and I would suppose the goal of any investment scheme, was to develop a system for maximizing rewards.  I suspect there are multiple ways to achieve this goal, but the particular approach I used below:  
- Use machine learning techniques to classify prospective loans into two classes: those that are predicted to be paid in full and those that predicted for default.
- Select for investment those loans with the highest interest rates that are predicted not to result in default.
Data Analysis
Conveniently, Lending Club provides loan statistics for download from their site here.  Additionally, they provide a data dictionary for the statistics.  All data is provided in the form of .csv files.  Data is provided from 2007 onward to the current quarter.  There is a large volume (115 items) of data features provided for each loan. 
Pandas provides some really nice tools for manipulating data sets.  Below is a code snippet for reading the Lending Club set into a pandas dataframe:
import pandas as pd
def load():
    data_dir = '../data'
    frame = pd.DataFrame()
   
    try:
        frame = pd.read_csv(os.path.join(data_dir,'combinedLoans.csv'),index_col=0,low_memory=False)
    except IOError:
        for filename in os.listdir(data_dir):
            if filename.startswith('LoanStats'):
                temp = pd.read_csv(os.path.join(data_dir,filename),header=1,low_memory=False)
                frame = frame.append(temp)              
        frame.to_csv(os.path.join(data_dir,'combinedLoans.csv'))
      
    return frame
Line 7:  Try to load a previously-built file that combined all the Lending Club csv's.
Line 9: If that combined file does not exist, loop through the directory containing the csv's
Lines 10-12: Read in each csv file and append its contents to a data frame.
Line 13: Dump the resulting frame to a csv file.
Line 9: If that combined file does not exist, loop through the directory containing the csv's
Lines 10-12: Read in each csv file and append its contents to a data frame.
Line 13: Dump the resulting frame to a csv file.
Now with the full data set in memory, you can run any number of queries on the data. Since I'm most interested in understanding the indicators for a default, it makes sense to look at comparisons of paid vs defaulted loans. Below is a simple function for pulling out various data points and displaying them in graphs.
def loan_status(frame):
    temp = copy.copy(frame)
    temp['loan_status'] = temp['loan_status'].map(status_cleaner)
    temp['term'] = temp['term'].map(term_cleaner)
    temp['home_ownership'] = temp['home_ownership'].map(home_cleaner)
    temp = temp.dropna(subset=['loan_status', 'term', 'home_ownership'])
    
    #Total Defaults vs Paid
    counts = pd.DataFrame(temp['loan_status'].value_counts())
    counts.rename(columns={'loan_status':'Count'}, inplace=True)
    counts['%'] = (100*counts['Count'] / counts['Count'].sum()).round(2)
    print counts
    print ''
    temp['loan_status'].value_counts().plot(kind='pie', title='Default vs Paid', colors=['g', 'r'])
Line 2:  Make a copy of the input data frame.Line 3: Apply a function to the entire loan_status column. It is this column that contains the current status of the loan. Many loans are still active. I'm only interested in loans that complete - either paid in full or defaulted. The status_cleaner function, to be shown below, will accomplish this filtering.
Lines 4-5: Additional columns that are filtered.
Line 6: Drop all items in these columns that are nulls.
Line 9: Aggregate counts on defaulted and paid loans.
Line 10: Add a column with these counts
Line 11: Add another column that represents the percentage of the counts.
Line 14: Display a pie chart of the aggregated data.
def status_cleaner(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 'Default'
        if 'fully paid' in str.lower(x):
            return 'Paid'
    return np.nan
def term_cleaner(x): 
    if isinstance(x, str):
        if '36' in x:
            return '36'
        elif '60' in x:
            return '60'
        else:
            return np.nan
    return np.nan
def home_cleaner(x):
    if x == 'RENT':
        return 'Rent'
    elif x == 'OWN':
        return 'Own'
    elif x == 'MORTGAGE':
        return 'Mortgage'
    elif x == 'OTHER':
        return 'Other'
    else:
        return np.nan 
Lines 1-29:  Three simple functions that are applied to their associated columns in the data frame.  For this exercise, their primary function is to put nulls in any field that has bad data in it.  Those nulls then get swept away with the dropna() mentioned previously.Below is the output of this snippet:
Count % Paid 246652 81.94 Default 54352 18.06So, after filtering out all of the loans that are either still active (not paid or defaulted) and/or have data errors in the 3 sample fields I selected- we can see there are roughly 300K loans in this data set that came to term. ~80% of the time loans were paid in full. That leaves a ~20% default rate, overall. Thus paid loans out-number defaults roughly 4 to 1. That's somewhat comforting. If it were the other way around, I suspect it's doubtful that many would entrust their money with Lending Club. However, this sort of imbalance in classes does cause issues with machine learning algorithms. I will have a detailed discussion of the implications and fixes for class imbalances in Part 2.
    #default rate vs LC Grade
    grp = temp.groupby(['grade', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='line', title='Default Rate vs LC Grade', color='k')
Line 2:  Using the pandas groupby function, aggregate loans grade and status.Line 3: Create a new column in the data frame that is the sum of Defaults and Paid counts.
Line 4: Create another new column that represents the percentage of defaults vs paid loans.
Line 7: Display a line plot with Default % as the y-axis and Lending Club's assigned loan grade as the x-axis.
Below is the output. As one would expect, as the Lending Club grading criterion goes down, the default rate goes up - roughly at 6% per grade.
loan_status Default Paid Total Default % grade A 2980 46377 49357 6.04 B 11011 78604 89615 12.29 C 15185 63088 78273 19.40 D 12777 35786 48563 26.31 E 7766 15687 23453 33.11 F 3596 5663 9259 38.84 G 1037 1447 2484 41.75Below is the code and output for two more examples of comparing default rate against a loan statistic.
    #default rate vs term
    grp = temp.groupby(['term', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='bar', title='Default Rate vs Loan Term', color=['r'])
    
    #default rate vs home ownership
    grp = temp.groupby(['home_ownership', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='barh', title='Default Rate vs Home Ownership', color=['r'])
    plt.show()
Lines 1-7:  Similar to the previous example, I create an aggregation on loan term vs default rate.Lines 9-16: Home ownership status vs default rate.
Output below:
loan_status Default Paid Total Default % term 36 34730 198922 233652 14.86 60 19622 47730 67352 29.13
loan_status Default Paid Total Default % home_ownership Mortgage 23826 125005 148831 16.01 Other 38 141 179 21.23 Own 4922 21560 26482 18.59 Rent 25566 99946 125512 20.37
It looks like there may be some correlation between loan term and defaults.  Loans with a 60 month term are twice as likely to default.  On the other hand, it's unclear that there is any relationship between home ownership and defaults. 
Wrangling
After analyzing the data, the next step is to determine what portions of it you want to use and transforming those portions into something can be digested by a machine learning algorithm.  This exercise nets out to selecting + transforming the features (inputs) and labels (outputs).  
As far as features, not all the columns in the Lending Club data set make sense as a feature.  An example is the 'desc' field.  That field is just free form text with little structure.  It's unusable as an input.  
For the label field - that one is clear cut.  The 'loan_status' field is what I'll be using as a classifier.
I'll be  using the scikit-learn machine learning library.  That library expects all features and labels to be formatted as NumPy arrays of floats.  So, any fields that are strings must be transformed into floats.
Below is a portion of the code I implemented for reading in and transforming the 'loan_status' field in the Lending Club data set:
def loan_filter(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 0.
        if 'fully paid' in str.lower(x):
            return 1.
    return np.nan
try:           
    frame = pd.read_csv(os.path.join(self.data_dir,FILTERED_FILE),index_col=0,low_memory=False)
except IOError:
    frame = pd.DataFrame()
    for filename in os.listdir(self.data_dir):
        if filename.startswith('LoanStats'):
            temp = pd.read_csv(os.path.join(self.data_dir,filename),usecols=dat_cols,header=1,low_memory=False)              
            temp['loan_status'] = temp['loan_status'].map(filters.loan_filter)
Lines 1-7:  Similar to the mapping functions mentioned earlier, I'm using this function to filter out all erroneous data and narrow the field to only those loans that are paid off or defaulted.  The difference in this map is that I'm replacing the string in the field with a float.  0 signifies a default, 1 means paid.
Line 16: Apply the map function to the 'loan_status' column.
Line 16: Apply the map function to the 'loan_status' column.
Although machine learning algorithms won't be covered till Part 2, below is an example of data transform that is necessary if you plan to use any regression-type algorithm - e.g, logistic regression classifier. Regression algorithms assume values are comparable. Example: a value of '2' is twice as large as a value of '1'. Hence, simply assigning a set of numbers to a categorical feature won't work if a regression classifier is to be used (though, works fine for any decision tree-based classifiers). The solution is to map the categories to dummy variables. Some examples below.
The 'home_ownership' is a category-type field that has a value of OWN, RENT, MORTGAGE, or OTHER. Pandas conveniently has a built-in function for creating dummy variables.
temp['home_ownership'] = temp['home_ownership'].map(home_cleaner) temp.dropna(subset=['home_ownership'], inplace=True) dummies = pd.get_dummies(temp['home_ownership'], drop_first=True) del temp['home_ownership'] temp = pd.concat([temp,dummies], axis=1)
Line 1:  Normalize the values to 'own', 'rent', 'mortgage, 'other', or nan if it is an erroneous value.
Line 2: Drop all the values that were set to nulls with mapping function in Line 1.
Line 3: Invoke the pandas dummy function to transform the home_ownership column into 3 new columns corresponding to its values. Example: an 'own' column is created that has values of 1 or 0 depending on whether the home_ownership value was 'own' or something else. The 'drop_first' parameter causes only 3 columns to be added vs four. This is a fix for something called the dummy variable trap.
Line 4: Drop the 'home_ownership' column from the frame.
Line 5: Add the new dummy columns to the frame.
Here's another more sophisticated example of use of dummy variables. In this case, I'm transforming the earliest credit line feature, which is a date, into a set of bins representing a decade from 1960 thru 2010. The decade name becomes a dummy variable indicating if the earliest credit was established in that decade.
Line 2: Drop all the values that were set to nulls with mapping function in Line 1.
Line 3: Invoke the pandas dummy function to transform the home_ownership column into 3 new columns corresponding to its values. Example: an 'own' column is created that has values of 1 or 0 depending on whether the home_ownership value was 'own' or something else. The 'drop_first' parameter causes only 3 columns to be added vs four. This is a fix for something called the dummy variable trap.
Line 4: Drop the 'home_ownership' column from the frame.
Line 5: Add the new dummy columns to the frame.
Here's another more sophisticated example of use of dummy variables. In this case, I'm transforming the earliest credit line feature, which is a date, into a set of bins representing a decade from 1960 thru 2010. The decade name becomes a dummy variable indicating if the earliest credit was established in that decade.
temp['earliest_cr_line'] = temp['earliest_cr_line'].map(filters.digit_filter) temp.dropna(subset=['earliest_cr_line'], inplace=True) bins = [0., 1970., 1980., 1990., 2000., 2010., 2020.] group_names = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s'] categories= pd.cut(temp['earliest_cr_line'], bins, labels=group_names) dummies = pd.get_dummies(categories, drop_first=True) del temp['earliest_cr_line'] temp = pd.concat([temp,dummies], axis=1)Line 1: Same normalization/error cleaning step.
Line 2: Remove nulls.
Line 3: Define the intervals for the bins.
Line 4: Define the category names for the bins/intervals.
Line 5: Using the built-in pandas cut function, transform the column into categories.
Line 6: Transform the categories to dummy variables
Line 7: Drop the original credit line column
Line 8: Add the new dummy columns to the frame.
This was just a portion of the data transform code I had to write for this Lending Club data set. I noticed there were quite a few missing/erroneous values in the set. All of that has to be cleaned up before the data is fit to be put into an algorithm.
Copyright ©1993-2024 Joey E Whelan, All rights reserved.



