Monday, May 23, 2016

Google Prediction API with Python


Summary

This post concerns usage of Google's cloud machine learning engine - Prediction API.  I'm going to demonstrate the basic functionality of their engine with Python.  I'll use the Lending Club data set that I described here for training and live/active loans for predictions.

Data Wrangling

The Prediction API is looking for feature training data in a particular format.  Specifically, it requires a csv file with the first value/column 1 being the label/output.  Labels must all be either double-quoted strings or numerical values.  If the labels are strings - then the prediction engine creates a classification model.  If the labels are numerical, then a regression model is created. Subsequent columns (features) in the csv can be either quoted strings or numerical.  There can be no header row and no index column.

I was able to re-use the wrangling code I discussed here with only a few minor modifications.  Those modifications were: move the label column to Column 1, map all the labels to the following strings:  "DEFAULT" and "PAID".  I left all the feature columns as floats.
def loan_toString_filter(x):
    if x == 0:
        return '"DEFAULT"'
    else:
        return '"PAID"'

samp['loan_status'] = samp['loan_status'].map(filters.loan_toString_filter)
samp.to_csv('../data/samp.csv', header=False, index=False, quoting=csv.QUOTE_NONE)
Lines 1-5:  Simple function for mapping '0's (defaults) and '1's (paid loans) to strings.  Note the double quotes inside of single.
Line 7:  Application of the map.
Line 8:  That 'quoting' parameter combined with embedded quotes in the input seems to work as far as yielding a quoted string in the resulting csv file.  I tried other methods (eq, csv.QUOTE_NONNUMERIC) that simply didn't work.  I think this may be a bug in Pandas.

Project Creation, API Key, Data Upload

Create a project in the Console and note the ID of that project (you'll use it later).  Create and download an OAuth key file using the API Manager interface in Console.  Additionally, you need to upload the data set you create in the wrangling step to Google Cloud.  Create a bucket and upload your .csv file into that via the Storage interface in Console.


Google API Authentication

Google requires OAuth 2.0 authentication for access to their API's, in general.  Fortunately, they provide a simple Python interface to package up that handshake.
from oauth2client.service_account import ServiceAccountCredentials
from httplib2 import Http
from apiclient.discovery import build

scopes = ['https://www.googleapis.com/auth/prediction', 'https://www.googleapis.com/auth/devstorage.full_control', \
          'https://www.googleapis.com/auth/cloud-platform']
credentials = ServiceAccountCredentials.from_json_keyfile_name('path/to/your/keyfile.json', scopes=scopes)
http_auth = credentials.authorize(Http())
service = build('prediction', 'v1.6', http=http_auth)
papi = service.trainedmodels()    
Lines 5-6:  Specify the scopes you want to authorize.  Different API calls require different minimum scopes.
Lines 7-8:  Create a credentials object using the key file you created in API Manager.
Lines 9-10:  Create a service object with the credentials.

Prediction Model Creation

At this point, you've completed all the housekeeping necessary to start making calls to the Prediction API.  Below is an API call to create a new prediction model.
def create_model(project, mod, storage):
    response = papi.insert(project=project, body={'storageDataLocation': storage, \
                                                         'id':mod}).execute()
    print json.dumps(response, sort_keys=True, indent=4)

create_model(project='loaner-1316', mod='1000sample', storage='loanerbucket/1000samp.csv')
Lines 1-4:  Simple function wrapper to the Prediction API call for instantiating a model and printing its results.
Line 6: Executing the function.

Pretty simple - because the underlying machine learning algorithm is essentially a black box.  There's no way to tweak it other than deciding on whether you want a classification or regression model.
Executing the 'insert' command starts the fitting of the model, which takes quite a while depending on size of your data set.  You can poll the status of the fitting using a 'get' command.  While the fitting is in progress, you'll get a 'trainingStatus' of 'RUNNING'.  Upon completion, the status will change to 'DONE'.
def get_model(project, mod):
    response = papi.get(project=project, id=mod).execute()
    print json.dumps(response, sort_keys=True, indent=4)

get_model(project='loaner-1316', mod='1000sample')
{
    "created": "2016-05-19T23:08:16.886Z", 
    "id": "1000sample", 
    "kind": "prediction#training", 
    "modelInfo": {
        "classificationAccuracy": "0.83", 
        "modelType": "classification", 
        "numberInstances": "1000", 
        "numberLabels": "2"
    }, 
    "selfLink": "https://www.googleapis.com/prediction/v1.6/projects/loaner-1316/trainedmodels/1000sample", 
    "trainingComplete": "2016-05-19T23:24:33.399Z", 
    "trainingStatus": "DONE"
}
Above is what the results look like for a model that has completed fitting/training.  Note the 83% accuracy score which sounds somewhat encouraging; however, it's not real.  As discussed here - the Lending Club data set has a fairly significant class imbalance.  The 'analyze' API call shows a clearer picture of what is going on.
def analyze_model(project, mod):
    response = papi.analyze(project=project, id=mod).execute()
    print json.dumps(response, sort_keys=True, indent=4)
"outputFeature": {
            "text": [
                {
                    "count": "171", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "829", 
                    "value": "PAID"
                }
            ]
        }

"id": "1000sample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "0.00", 
                "PAID": "16.83"
            }, 
            "PAID": {
                "DEFAULT": "0.00", 
                "PAID": "83.17"
            }
        }, 
Above is a snippet of the 'analyze' results.  This particular sample was just a random selection of 1000 entries from the 200K set.  The imbalance is still roughly 4:1 and thus the resulting confusion matrix is a mess.  Now we can see that the 'accuracy' here degenerated to simply the frequency of the majority class (paid loans).

Since we have no control of Google's underlying algorithm, the only way to influence performance here is to manipulate the training set.  Below are results of a perfectly balanced data set created via down-sampling.
 "outputFeature": {
            "text": [
                {
                    "count": "743", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "743", 
                    "value": "PAID"
                }
            ]
        }
    }, 
    "id": "downsample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "49.00", 
                "PAID": "31.50"
            }, 
            "PAID": {
                "DEFAULT": "26.00", 
                "PAID": "42.50"
            }
        }, 
So, simple down-sampling made a noticeable improvement.  Recall on Defaults is now ~61%.

Below is another set of 'analyze' results, this time with a data set created via SMOTE.
 
 "outputFeature": {
            "text": [
                {
                    "count": "2000", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "1000", 
                    "value": "PAID"
                }
            ]
        }

"id": "smotesample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "177.25", 
                "PAID": "18.75"
            }, 
            "PAID": {
                "DEFAULT": "70.75", 
                "PAID": "33.25"
            }
        }, 
For this data set, I fabricated a 2:1 imbalance towards the minority class (Defaults).  Recall on Defaults is now up to 90%.

Executing a Prediction

After tweaking data sets to get to some reasonable performance out of a model, you can run predictions against the model.
def predict(project, mod):
    authKey = 'yourLendingClubAPIKey'
    loanListURL = 'https://api.lendingclub.com/api/investor/v1/loans/listing'
    header = {'Authorization' : authKey, 'Content-Type': 'application/json'}
    payload = {'showAll' : 'false'}
    resp = requests.get(loanListURL, headers=header, params=payload)
    samp = cleaner.clean(resp.json()['loans'])
    
    loanid = samp.index[0]
    val = samp.iloc[0].tolist()
    body = {'input' : {'csvInstance': val}}
    response = papi.predict(project=project, id=mod, body=body).execute()
    print 'loanid', int(loanid)
    print json.dumps(response, sort_keys=True, indent=4)    

predict(project='loaner-1316', mod='smotesample')
Lines 2-4:  Parameter set-up for a Lending Club API call.
Line 6:  API call to Lending Club fetching the currently available loans.  My discussion on accessing Lending Club's REST interface here.
Line 7:  'Wrangling' step to get the Lending Club loan data in a format acceptable to this prediction model.  This was discussed in more detail here.
Lines 9-10:  Pull a single loan out of the list returned.
Line 12:  API call to Google for a prediction on this single loan.

Results of this prediction call below:
 

{
    "id": "smotesample", 
    "kind": "prediction#output", 
    "outputLabel": "DEFAULT", 
    "outputMulti": [
        {
            "label": "DEFAULT", 
            "score": "0.703670"
        }, 
        {
            "label": "PAID", 
            "score": "0.296330"
        }
    ], 
    "selfLink": "https://www.googleapis.com/prediction/v1.6/projects/loaner-1316/trainedmodels/smotesample/predict"
}

Sunday, May 15, 2016

AI-Driven Investing with Lending Club - Part 3: API Integration


Summary

This is the final post in the series discussing my experiences with applying machine learning to Lending Club's data set.  This article will discuss how I integrated the techniques developed in the previous posts to make real-time investments utilizing the Lending Club REST API.  This post leverages my previous discussion on accessing that API. 


Right Hand Meet Left.  Please.

After all the data wrangling necessary to get the Lending Club historical set suitable for machine learning, one would hope the lessons/code there would directly apply to the real-time loan list that is accessed via API.  Unfortunately, that's not the case - at all.

I was disappointed in all the inconsistencies I saw between the historical and API data schemas.  My opinion only, but this is a clear indication of communication issues within the Lending Club Engineering organization.  The reporting and API folks simply aren't on the same page.

Schema Issue - Inconsistent Field Names

For the historical data - the apparent 'standard' is underscore separated words .  For the API, it's CamelCase words.  Unfortunately, underscores vs capitalization is not the end to the inconsistencies.  Example 1:  the amount of a loan is called 'loan_amnt' in the historical data set.  It's called 'loanAmount' in the API.  Example 2:  'verification_status' in the historical data set is 'isIncV' in the API.  Net, careful comparison of each and every field name must be done.  Then, these field names need to be normalized.  

I chose to normalize the API schema to the historical set's naming conventions.  Below is a code snippet of how to get these field/columns names matched up:
api = list(api_cols)
dat = list(loader.dat_cols)
map_cols = dict(zip(api, dat))
frame = frame[api_cols]
frame.rename(columns=map_cols, inplace=True)
Lines 1-2:  I had to carefully create lists with the API and historical field/column names I'm using for my algorithm.  Those lists need to match up, exactly.  Extra painful.
Line 3:  Create a dictionary with the mapping between API and historical names.
Line 4:  For a frame loaded with API-read loan data, filter it to the columns you want to use.
Line 5:  Perform the mapping between API and historical naming conventions.

Schema Issue - Inconsistent Field Data Types and/or Values

Examples:

  • Loan Verification status.  Historical Values: 'Verified', 'Source Verified', 'Not Verified'.  API Values: 'VERIFIED', 'SOURCE_VERIFIED', 'VERIFIED'
  • Loan Term.  Historical Values (string):  '36 months', '60 months'.  API Values (integer): 36, 60
  • Employment Length:  Historical Values (string):  varying strings representing a period months.  Examples:  '10+ years', '3 years', '< 1 year'.  API Values (integer):  X, where X represents a number of months between 0 and 120.
And trust me, there are more.

Net, I had to write an entire  'cleaner' function specifically for the API data set to normalize it to the historical.

Modifying Existing API Code

After writing the 'cleaner' function, integrating it to my existing Python app for accessing the Lending Club REST API was straightforward.  I simply needed to modify the existing loan fetching code to use the 'cleaner' to create a filtered data set and call my machine learning classifier on that loan list.  Snippets below:
#snippet of __get_loans function
resp = requests.get(self.loanListURL, headers=self.header, params=payload)
resp.raise_for_status()
loans = cleaner.clean(self.alg, resp.json()['loans'])
logger.info('{} loans were found'.format(len(loans)))
logger.debug('Exiting __getLoans()')
return loans  
Line 4: Invoke my custom 'cleaner' function.  Pass it a scikit classifier to apply on the fetched loans.

# snippet of cleaner function
frame['prediction'] = alg.predict(features=frame.values)
frame = frame[(frame['prediction']==1)].sort_values(by='intRate',ascending=False)  
return [int(x) for x in frame.index.tolist()]
Line 2:  Run the scikit prediction algorithm on the fetched loans and add it as a column to the data frame.  The values will be 1 for predicted paid loans and 0 for predicted defaults.
Line 3:  Do a conditional selection on predicted paid loans and sort them by descending interest rate.
Line 4:  Return the sorted Loan ID's.

Finally, I modified the main code block to utilize the Python apscheduler to execute the code at the 4 times Lending Club posts new loans (I'm in the Mountain Timezone).  I was using UNIX cron prior.
def job(alg):
    try:
        lc = LendingClub(ConfigData(CONFIG_FILENAME), alg)
        while lc.hasCash() and lc.hasLoans():
            lc.buy()
    except:
        logger.exception('')


if __name__ == '__main__':
    ldr = loader.Loader('../data')
    frame = ldr.load()
    alg = MlAlg(clf=clf)              
    features_train, labels_train  = ldr.get_trainset(frame, label_col='loan_status')
    alg.fit(features_train, labels_train)
    sched = BlockingScheduler()
    sched.add_job(lambda: job(alg), 'cron', hour='7,11,15,19', minute=0, second=0)
    sched.start()


AI-Driven Investing with Lending Club - Part 2: Algorithm Analysis & Class Balancing


Summary

This is Part 2 in a three-part series on my foray into machine learning.  Part 1 covered data analysis and transformation necessary for machine learning.  This article will cover my experiences with algorithm testing and one particular pitfall with the Lending Club data - class imbalance.

Part 2:  Algorithm Analysis & Class Balancing
I used the scikit-learn machine learning library.  It's a really nice library and well-documented.  I found the simplest way to get all these libraries (scikit, pandas, etc) functional in a development environment was to use the Anaconda Python distribution.  Everything just works.

As far as a starting point, the first step to using the scikit algorithms is to separate the data into 'feature' and 'label'  training and test sets.  The scikit algorithms expect inputs as numpy arrays of floats.  Scikit has a built-in split function for separating features/labels into training and test sets, but I ended up writing my own simple functions for this so I could get finer-grain access to the split data to resolve an issue described later in this article.
def get_testset(self, frame, label_col, test_size):
    self.logger.debug('frame.shape: {}, label_col: {}, test_size: {}'.format(frame.shape, label_col, test_size))
    test_frame = frame.sample(n=int(len(frame.index)*test_size), replace=False, random_state=2016) 
    labels_test = test_frame[label_col].values
    features_test = test_frame.drop(label_col, axis=1).values
    new_frame = frame.drop(test_frame.index.tolist())
    self.logger.debug('Test Set Size: {}'.format(features_test.shape))
        
    return features_test, labels_test, new_frame
    
def get_trainset(self, frame, label_col):
    self.logger.debug('frame.shape: {}, label_col: {}'.format(frame.shape, label_col))
    labels_train = frame[label_col].values
    features_train = frame.drop(label_col, axis=1).values    
    self.logger.debug('Training Set Size: {}'.format(features_train.shape))

    return features_train, labels_train
Line 3:  Using the pandas random sample function, extract out a portion of the frame for a test set.
Line 4:  Extract the test labels (in this case, this is the 'loan_status' column).
Line 5:  Extract the test features (removing the test labels)
Line 6:  Remove the test from the frame and save it to a new frame.
Line 9:  Return the test features, labels, and new frame with the test set removed.
Line 13:  Extract the training labels.
Line 14:  Extract the training features.

Initial Algorithm Testing

So after the data is in acceptable format for sci-kit, invoking an algorithm on the data is really simple.  It amounts to defining a classifier, fitting the classifier on the training set, and then testing it on test set.  Below is some sample code using a couple different algorithms:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
def init_test():
    loader = Loader('../data')
    frame = loader.load()
    features_test, labels_test, frame = loader.get_testset(frame, label_col='loan_status', test_size=.1)
    features_train, labels_train  = loader.get_trainset(frame, label_col='loan_status')
    
    clf1 = RandomForestClassifier()
    clf1.fit(features_train, labels_train)
    score1 = clf1.score(features_test, labels_test)
    print 'RandomForest', score1
    
    clf2 = AdaBoostClassifier()
    clf2.fit(features_train, labels_train)
    score2 = clf2.score(features_test, labels_test)
    print 'AdaBoost', score2
init_test()
Lines 1-2:  Load a pandas data frame with the Lending Club data set.
Lines 3-4:  Separate the frame into training and test numpy arrays.
Lines 9-12:  Invoke a RandomForest classifier on the training set and then score its accuracy against the test set.
Lines 14-17:  Same thing, but with an AdaBoost classifier.

Below is the output:
RandomForest 0.793147812828
AdaBoost 0.818402744687
At first blush, these results appear to be really encouraging.  With absolutely no algorithm parameter tuning yet, we're seeing roughly 80% accuracy.  Alas, the results are deceiving.

The scoring method used here is where the deceit begins.  Scikit's score method produces an 'accuracy' output that is simply the ratio of correct predictions divided by total predictions.  That sounds reasonable, but turns out to be a terrible scoring method for this data set.  The real picture becomes clearer after reviewing a confusion matrix.  Below is code and output for confusion matrices on these two algorithms.
from sklearn.metrics import confusion_matrix
predictions1 = clf1.predict(features_test)
matrix1 = confusion_matrix(labels_test, predictions1)
print 'RandomForest Confusion Matrix'
print matrix1
    
predictions2 = clf2.predict(features_test)
matrix2 = confusion_matrix(labels_test, predictions2)
print 'AdaBoost Confusion Matrix'
print matrix2
RandomForest Confusion Matrix
[[  546  3295]
 [ 1046 16099]]
AdaBoost Confusion Matrix
[[  193  3648]
 [  163 16982]]

Now we can begin to see where the issues lie.  The correct predictions lie on the main diagonal.  The False Positive (predicting a defaulted loan is paid) figures (3295 and 3648, respectively) is where the pain comes for the investor.
Further details can be viewed with the scikit classification report.  Code and output below.
from sklearn.metrics import classification_report
report1 = classification_report(labels_test, predictions1, target_names=['Default', 'Paid'])
print 'Random Forest Classification Report'
print report1
    
report2 = classification_report(labels_test, predictions2, target_names=['Default', 'Paid'])
print 'AdaBoost Classification Report'
print report2
Random Forest Classification Report
             precision    recall  f1-score   support

    Default       0.34      0.14      0.20      3841
       Paid       0.83      0.94      0.88     17145

avg / total       0.74      0.79      0.76     20986

AdaBoost Classification Report
             precision    recall  f1-score   support

    Default       0.54      0.05      0.09      3841
       Paid       0.82      0.99      0.90     17145

avg / total       0.77      0.82      0.75     20986

The numbers in this report are a directly computed from the confusion matrix.  Precision is calculated down the columns, Recall across the rows, and the F1-score is the weighted average of precision and recall.  'Support' is total number of samples in the respective class.  In this test set, there were 3841 defaulted and 17,145 paid loans.  Out of both of these algorithm samples, the best was a 14% recognition rate of a defaulted loan.  That's horrendous!  Clearly, the current performance of machine learning is unacceptable for making investment decisions.  Yet, why are we seeing a accuracy scores of ~80%?

Class Imbalance

The hint at the answer to this question was seen in Part 1 and further exposed in the 'support' column of these classification reports.  There's roughly a 4 to 1 imbalance between defaulted and paid loans.  This imbalance ends up skewing the machine learning models towards the majority class - in this case paid loans.  The algorithm 'accuracy' actually mimics the frequency of the majority class.  Recall from Part 1 that we saw that approximately 81% of the sample set were paid loans.  The accuracy of these algorithms is also roughly 80%.  That means - the prediction power of these models is currently no better than chance.  In other words, you could randomly pick loans and achieve the same level of performance as selection via machine learning.  Does that mean machine learning is inappropriate for this use case?

Corrections for Class Imbalance

Fortunately, there are some remedies for this problem.  I group the fixes into three classes:  Algorithm,  Data, and Thresholds.  Corrections that can be implemented directly to the machine learning algorithm include parameter tuning and class weighting.  Data-centric corrections include modifications to the sampling proportions with the actual data (up/down sampling) or creation of additional data points in the minority class via artificial means.  A third method involves directly manipulating the probability threshold of the minority class.

It's important to note that most of these techniques are of the variety 'rob Peter to pay Paul'.  With the exception of making improvements to algorithm and parameter selection - which can lead to true overall performance improvements  - the other techniques are sacrificing the prediction accuracy of the majority class to improve that of the minority.

Algorithm Correction:  Parameter Tuning

Scikit provides a nice tool to automate parameter tuning process.  Their GridSearch class automates a combinatorial sampling of parameter options and performs a k-fold validation on each to determine the best parameter set.  Below is an example with AdaBoost and RandomForest.
def grid_test():
    loader = Loader('../data')
    frame = loader.load()
    features_test, labels_test, frame = loader.get_testset(frame, label_col='loan_status', test_size=.1)
    features_train, labels_train  = loader.get_trainset(frame, label_col='loan_status')
    
    rf_param_grid = {'n_estimators':[25,35,45],'max_features':[4,5,6],'min_samples_split':[5, 6, 7]}
    
    clf1 = GridSearchCV(RandomForestClassifier(n_jobs=-1), rf_param_grid, n_jobs=-1)
    clf1.fit(features_train, labels_train)
    print 'Random Forest Best Params'
    print clf1.best_params_
    
    ab_param_grid = {'n_estimators' : [10, 40, 50, 70, 80]}
    clf2 = GridSearchCV(AdaBoostClassifier(), ab_param_grid, n_jobs=-1)
    clf2.fit(features_train, labels_train)
    print 'AdaBoost Best Params'
    print clf2.best_params_
Random Forest Best Params
{'max_features': 4, 'min_samples_split': 5, 'n_estimators': 45}

AdaBoost Best Params
{'n_estimators': 10}

Running a classification report with those parameters yields the following:
Random Forest Classification Report
             precision    recall  f1-score   support

    Default       0.55      0.02      0.04      3841
       Paid       0.82      1.00      0.90     17145

avg / total       0.77      0.82      0.74     20986

AdaBoost Classification Report
             precision    recall  f1-score   support

    Default       0.51      0.02      0.04      3841
       Paid       0.82      1.00      0.90     17145

avg / total       0.76      0.82      0.74     20986

Not a stunning improvement.  In fact, you could say performance was degraded with the parameter tuning.  It appears parameters were tuned to maximize performance of the majority class, which isn't particularly helpful in this scenario.  However, this tuning technique can be applied in combination with some of the other balancing corrections discussed below.

Algorithm Correction:  Class Weighting

Some of the scikit classifier implementations allow direct manipulation of the weights assigned to classes.  We know upfront that there is a 4:1 imbalance in this data set.  You can configure that weight specifically in the 'class_weight' parameter or simply pass 'balanced' as the value and the weighting is calculated automatically.  Below is an example of before and after with the DecisionTree classifier and some selected parameter tuning.
clf1 = tree.DecisionTreeClassifier(max_depth=6,min_samples_split=150, min_samples_leaf=7, \
                                               criterion='gini')
clf1.fit(features_train, labels_train)
score1 = clf1.score(features_test, labels_test)
print 'Decision Tree', score1
    
clf2 = tree.DecisionTreeClassifier(max_depth=6,min_samples_split=150, min_samples_leaf=7, \
                                               criterion='gini',class_weight='balanced')
clf2.fit(features_train, labels_train)
Decision Tree Classification Report
             precision    recall  f1-score   support

    Default       0.47      0.03      0.05      3841
       Paid       0.82      0.99      0.90     17145

avg / total       0.76      0.82      0.74     20986

Decision Tree(Balanced Class Weights) Classification Report
             precision    recall  f1-score   support

    Default       0.26      0.66      0.38      3841
       Paid       0.88      0.59      0.71     17145

avg / total       0.77      0.60      0.65     20986
For this particular data and parameter set, the improvement on prediction of the minority class (defaulted loans) is fairly dramatic.

Data Correction:  Sampling

Another angle for correcting errors due to class imbalance is to eliminate the balance in the data itself.  That can be done in either direction:  increase the number of minority class samples by replicating samples of that class or decrease the number of majority samples by taking only a subset of that class.  Pandas has a 'sample' function that makes up/down sampling trivial.  Below is a snippet of how to do that. For up-sampling, the 'n' parameter would be set to a value equal to (or greater than) the number of samples in the majority class.  Similarly, for down-sampling that parameter would be set to something equal to (or less than) the minority class size.
# Up-sampling the minority class
synframe = min_frame.sample(n=num_samples, replace=True)
# Down-sampling the majority class
synframe = maj_frame.sample(n=num_samples, replace=False)
Below are the results applying down-sampling with DecisionTree.  The majority class is down-sampled to be exactly equal in size with the minority.  Once again, a fairly good improvement with predicting the defaulted loan class.
Decision Tree(Down-Sampling) Classification Report
             precision    recall  f1-score   support

    Default       0.27      0.62      0.38      3841
       Paid       0.88      0.63      0.73     17145

avg / total       0.77      0.63      0.67     20986

Data Correction:  Synthetic Sampling

A different technique to sampling involves creating net new samples of the minority class.  SMOTE is such a technique.  In a nutshell, SMOTE creates new minority samples by taking the nearest neighbors of a given minority sample and creating new points on the lines connecting those neighbors to the sample.  Below is an implementation of SMOTE I wrote.  I took some liberties with the parameters of the published SMOTE algorithm to enable a consistent interface with the other data balancing techniques I was testing (such as up/down sampling).
import random
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from math import ceil
import numpy as np
from balancer import Balancer
import logging
from time import time

class Smote(Balancer):
     
    def __init__(self, label_col, min_class, mult='balanced', k=5):       
        super(Smote, self).__init__(label_col, min_class) 
        self.logger = logging.getLogger()
        self.mult = mult
        self.k = k
    
    def balance(self, frame):
        ti = time()
        maj_frame = frame[frame[self.label_col] != self.min_class]
        min_frame = frame[frame[self.label_col] == self.min_class]
        # these artificial sample size reductions added for demo purposes only (slow)
        maj_frame = maj_frame.sample(n=2000, replace=False, random_state=2016)
        min_frame = min_frame.sample(n=1000, replace=False, random_state=2016)
        
        if self.mult == 'balanced':
            self.mult = int(ceil(len(maj_frame.index)/len(min_frame.index)))
        else:
            self.mult = int(ceil(self.mult))
            
        synframe = pd.concat([maj_frame, min_frame], axis=0)
        clf = NearestNeighbors(n_neighbors=self.k, n_jobs=-1)
        clf.fit(min_frame.drop(self.label_col, axis=1).values)
        random.seed(2016)
        
        for i in range(0, len(min_frame.index)):
            point = min_frame.drop(self.label_col, axis=1).iloc[i]
            ind = clf.kneighbors(np.array(point).reshape(1,-1), return_distance=False)
            for _ in range(1,self.mult):
                ri = random.randint(1, self.k-1)
                neighbor = min_frame.drop(self.label_col, axis=1).iloc[ind[0,ri]]
                synpoint = point + random.random() * (neighbor - point)
                synpoint = synpoint.append(pd.Series([self.min_class], index=[self.label_col]))
                synframe = synframe.append(synpoint, ignore_index=True)
        
        self.logger.debug('time: {:0.4f} #min_class: {}, #maj_class: {}'.format(time() - ti, \
                    len(synframe[synframe[self.label_col] == self.min_class].index), \
                    len(synframe[synframe[self.label_col] != self.min_class].index)))    
        return synframe.sample(frac=1, random_state=2016).reset_index(drop=True)
The problem (at least with my implementation) with SMOTE is that it is unbearably slow.  Hence, I have some artificial down-sampling lines above to reduce the sample input sets.  To produce 2000 synthetic samples, the code above takes 127 seconds on my machine.

To speed up the creation of synthetic samples, I re-wrote my SMOTE implementation to support multi-processing.  Refactored code below:
import random
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from math import ceil
import numpy as np
from balancer import Balancer
import logging
from multiprocessing import Queue
from multiprocessing import JoinableQueue
from multiprocessing import Process
from multiprocessing import cpu_count
from time import time

class Smote(Balancer):
     
    def __init__(self, label_col, min_class, min_size, maj_size, mult='balanced', k=5):       
        super(Smote, self).__init__(label_col, min_class) 
        self.logger = logging.getLogger()
        self.logger.debug('label_col: {}, min_class: {}, min_size: {}, maj_size: {}, mult: {}, k: {}'.format(\
                    label_col, min_class, min_size, maj_size, mult, k))
        self.mult = mult
        self.k = k
        self.min_size = min_size
        self.maj_size = maj_size
    
    def balance(self, frame):
        ti = time()
        maj_frame = frame[frame[self.label_col] != self.min_class]
        min_frame = frame[frame[self.label_col] == self.min_class]
        
        maj_frame = maj_frame.sample(n=self.maj_size, replace=False, random_state=2016)
        min_frame = min_frame.sample(n=self.min_size, replace=False, random_state=2016)
          
        if self.mult == 'balanced':
            self.mult = int(ceil(len(maj_frame.index)/len(min_frame.index)))
        else:
            self.mult = int(ceil(self.mult))
            
        synframe = pd.concat([maj_frame, min_frame], axis=0)
        clf = NearestNeighbors(n_neighbors=self.k, n_jobs=-1)
        clf.fit(min_frame.drop(self.label_col, axis=1).values)
        random.seed(2016)
        
        workq = JoinableQueue()
        doneq = Queue()
        workers = []
        for _ in range(0, cpu_count()):
            worker = Process(target=do_work, args=(workq, doneq, clf, min_frame, self.k, self.mult, self.label_col, self.min_class))
            worker.start()
            workers.append(worker)
                  
        for i in range(0, len(min_frame.index)):
            point = min_frame.drop(self.label_col, axis=1).iloc[i]
            workq.put(point)
            
        for _ in range(0, len(workers)):
            workq.put('STOP')    
        
        workq.join()
        
        while not doneq.empty():
            synpoint = doneq.get(block=False)
            synframe = synframe.append(synpoint, ignore_index=True)
                
        self.logger.debug('time: {:0.4f} #min_size: {}, #maj_size: {}'.format(time() - ti, \
                    len(synframe[synframe[self.label_col] == self.min_class].index), \
                    len(synframe[synframe[self.label_col] != self.min_class].index)))    
        return synframe.sample(frac=1, random_state=2016).reset_index(drop=True)
        
         
def do_work(workq, doneq, clf, min_frame, k, mult, label_col, min_class):
    while True:
        point = workq.get()
        if isinstance(point,str):
            workq.task_done()
            break
        
        ind = clf.kneighbors(np.array(point).reshape(1,-1), return_distance=False)
        for _ in range(1,mult):
            ri = random.randint(1, k-1)
            neighbor = min_frame.drop(label_col, axis=1).iloc[ind[0,ri]]
            synpoint = point + random.random() * (neighbor - point)
            synpoint = synpoint.append(pd.Series([min_class], index=[label_col]))
            doneq.put(synpoint)
        workq.task_done()   

The time to create 2000 synthetic samples with this multi-processing version of SMOTE was reduced to 22 seconds with an eight-core processor.

Below is the classification report of the Decision Tree classifier with the same parameters I've used throughout.  Once again, some healthy improvements with predicting defaults.
Decision Tree(SMOTE) Classification Report
             precision    recall  f1-score   support

    Default       0.29      0.40      0.34      3841
       Paid       0.85      0.78      0.82     17145

avg / total       0.75      0.71      0.73     20986

Threshold Manipulation

The final technique I'll be discussing for addressing class imbalance is rigging prediction probabilities to favor the minority class.  For a two-class label set such as this - by default - the classification algorithm will assign the class to whichever has a .5 or greater prediction probability.  We can artificially make the algorithm more sensitive to the minority class by lowering its probability threshold.  Example:  Instead .5 being the threshold to predict a loan as a default, make the threshold .3.   

To implement this, I created a wrapper class (MlAlg) for classifiers where I overrode the prediction function.
class MlAlg(object):
    
    def predict(self, features):
        ti = time()
       
        def custom_predict(probability, prediction):
            if probability >= self.thresh:
                return self.min_class
            else:
                return prediction

        vfunc = np.vectorize(custom_predict, otypes=[np.int])
        prediction = vfunc(self.clf.predict_proba(features)[:, self.min_pos], self.clf.predict(features))
        self.logger.debug('prediction time: {:0.4f}'.format(time()-ti))
        return prediction
Lines 6-10:  A function implements the new prediction threshold.  If the probability is greater than or equal to the new/manipulated threshold - return the minority class as the prediction.
Lines 12-13:  Using numpy's vectorize function, evaluate the classifier's returned probabilities for the minority class and modify the ultimate prediction with the 'custom_predict' function.

Below are the results of Decision Tree with a modified probability threshold of .25 for defaulted loans.  This did yield an improvement in defaulted loan prediction performance as compared to the baseline (.5).
Decision Tree(.25 Threshold) Classification Report
             precision    recall  f1-score   support

    Default       0.33      0.37      0.34      3841
       Paid       0.85      0.83      0.84     17145

avg / total       0.76      0.75      0.75     20986

Conclusion

There's a considerable amount of trial/error testing required to find the right algorithm/parameter fit for a given data set.  If that data set has significant imbalances in its classes, even further trial/error is required to find a good fit between algorithm and data balancing techniques.

Saturday, May 14, 2016

AI-Driven Investing with Lending Club - Part 1: Data Wrangling


Summary

This article is part one in a three-part series covering my experiences with building a machine-learning engine for investing with Lending Club.  Lending Club is a peer-to-peer loan service.  I previously published an article that discusses how to use Lending Club's REST API for automated investing.  That code utilized user-defined criteria to select loans and then execute orders for those selected loans.  This series of articles takes that automation to the next level and uses machine learning techniques to add intelligence to the loan selection process.

All code for this series was developed in Python and utilizes the excellent data science and machine learning libraries available in Python (pandasscikit-learn, etc ).  

Admittedly, I went into this machine learning exercise as a newbie (and can still be considered as such :).  I think this article series is probably most appropriate for folks with a similar background in AI.  

Part 1:  Data Wrangling

Goal

The nature of Lending Club investing is typical of most investment schemes in that there is an inverse relationship between risk and reward.  Those loans that have been assessed to have a higher risk provide a higher reward.  

Risk in this domain is the possibility that a loan recipient does not pay off the loan, i.e., a default.  A default results in the investor(s) losing all of their remaining principle and the future interest payments for the defaulted loan.  Reward is in the form of interest rates assigned to a loan.  Net, assessed riskier loans pay higher interest rates.  Lending Club uses their own techniques to assess lendee creditworthiness in the form of loan grades/subgrades and then subsequently assign interest rates to prospective loans.

My goal, and I would suppose the goal of any investment scheme, was to develop a system for maximizing rewards.  I suspect there are multiple ways to achieve this goal, but the particular approach I used below:  
  1. Use machine learning techniques to classify prospective loans into two classes:  those that are predicted to be paid in full and those that predicted for default.
  2. Select for investment those loans with the highest interest rates that are predicted not to result in default.

Data Analysis

Conveniently, Lending Club provides loan statistics for download from their site here.  Additionally, they provide a data dictionary for the statistics.  All data is provided in the form of .csv files.  Data is provided from 2007 onward to the current quarter.  There is a large volume (115 items) of data features provided for each loan. 

Pandas provides some really nice tools for manipulating data sets.  Below is a code snippet for reading the Lending Club set into a pandas dataframe:
import pandas as pd
def load():
    data_dir = '../data'
    frame = pd.DataFrame()
   
    try:
        frame = pd.read_csv(os.path.join(data_dir,'combinedLoans.csv'),index_col=0,low_memory=False)
    except IOError:
        for filename in os.listdir(data_dir):
            if filename.startswith('LoanStats'):
                temp = pd.read_csv(os.path.join(data_dir,filename),header=1,low_memory=False)
                frame = frame.append(temp)              
        frame.to_csv(os.path.join(data_dir,'combinedLoans.csv'))
      
    return frame
Line 7:  Try to load a previously-built file that combined all the Lending Club csv's.
Line 9:  If that combined file does not exist, loop through the directory containing the csv's
Lines 10-12:  Read in each csv file and append its contents to a data frame.
Line 13:  Dump the resulting frame to a csv file.

Now with the full data set in memory, you can run any number of queries on the data.  Since I'm most interested in understanding the indicators for a default, it makes sense to look at comparisons of paid vs defaulted loans.  Below is a simple function for pulling out various data points and displaying them in graphs.
def loan_status(frame):
    temp = copy.copy(frame)
    temp['loan_status'] = temp['loan_status'].map(status_cleaner)
    temp['term'] = temp['term'].map(term_cleaner)
    temp['home_ownership'] = temp['home_ownership'].map(home_cleaner)
    temp = temp.dropna(subset=['loan_status', 'term', 'home_ownership'])
    
    #Total Defaults vs Paid
    counts = pd.DataFrame(temp['loan_status'].value_counts())
    counts.rename(columns={'loan_status':'Count'}, inplace=True)
    counts['%'] = (100*counts['Count'] / counts['Count'].sum()).round(2)
    print counts
    print ''
    temp['loan_status'].value_counts().plot(kind='pie', title='Default vs Paid', colors=['g', 'r'])
Line 2:  Make a copy of the input data frame.
Line 3:  Apply a function to the entire loan_status column.  It is this column that contains the current status of the loan.  Many loans are still active.  I'm only interested in loans that complete - either paid in full or defaulted.  The status_cleaner function, to be shown below, will accomplish this filtering.
Lines 4-5:  Additional columns that are filtered.
Line 6:  Drop all items in these columns that are nulls.
Line 9:  Aggregate counts on defaulted and paid loans.
Line 10:  Add a column with these counts
Line 11:  Add another column that represents the percentage of the counts.
Line 14:  Display a pie chart of the aggregated data.
def status_cleaner(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 'Default'
        if 'fully paid' in str.lower(x):
            return 'Paid'
    return np.nan

def term_cleaner(x): 
    if isinstance(x, str):
        if '36' in x:
            return '36'
        elif '60' in x:
            return '60'
        else:
            return np.nan
    return np.nan

def home_cleaner(x):
    if x == 'RENT':
        return 'Rent'
    elif x == 'OWN':
        return 'Own'
    elif x == 'MORTGAGE':
        return 'Mortgage'
    elif x == 'OTHER':
        return 'Other'
    else:
        return np.nan 
Lines 1-29:  Three simple functions that are applied to their associated columns in the data frame.  For this exercise, their primary function is to put nulls in any field that has bad data in it.  Those nulls then get swept away with the dropna() mentioned previously.

Below is the output of this snippet:
          Count      %
Paid     246652  81.94
Default   54352  18.06
So, after filtering out all of the loans that are either still active (not paid or defaulted) and/or have data errors in the 3 sample fields I selected- we can see there are roughly 300K loans in this data set that came to term.  ~80% of the time loans were paid in full.   That leaves a ~20% default rate, overall.  Thus paid loans out-number defaults roughly 4 to 1.  That's somewhat comforting.  If it were the other way around, I suspect it's doubtful that many would entrust their money with Lending Club.  However, this sort of imbalance in classes does cause issues with machine learning algorithms.  I will have a detailed discussion of the implications and fixes for class imbalances in Part 2.
    #default rate vs LC Grade
    grp = temp.groupby(['grade', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='line', title='Default Rate vs LC Grade', color='k')
Line 2:  Using the pandas groupby function, aggregate loans grade and status.
Line 3:  Create a new column in the data frame that is the sum of Defaults and Paid counts.
Line 4:  Create another new column that represents the percentage of defaults vs paid loans.
Line 7:  Display a line plot with Default % as the y-axis and Lending Club's assigned loan grade as the x-axis.

Below is the output.  As one would expect, as the Lending Club grading criterion goes down, the default rate goes up - roughly at 6% per grade.

loan_status  Default   Paid  Total  Default %
grade                                        
A               2980  46377  49357       6.04
B              11011  78604  89615      12.29
C              15185  63088  78273      19.40
D              12777  35786  48563      26.31
E               7766  15687  23453      33.11
F               3596   5663   9259      38.84
G               1037   1447   2484      41.75
Below is the code and output for two more examples of comparing default rate against a loan statistic.
    #default rate vs term
    grp = temp.groupby(['term', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='bar', title='Default Rate vs Loan Term', color=['r'])
    
    #default rate vs home ownership
    grp = temp.groupby(['home_ownership', 'loan_status']).size().unstack(level=-1)
    grp['Total'] = grp['Default'] + grp['Paid']
    grp['Default %'] = (100*grp['Default']/grp['Total']).round(2)
    print grp
    print ''
    grp.plot(y='Default %', kind='barh', title='Default Rate vs Home Ownership', color=['r'])
    plt.show()
Lines 1-7:  Similar to the previous example, I create an aggregation on loan term vs default rate.
Lines 9-16:  Home ownership status vs default rate.

Output below:
loan_status  Default    Paid   Total  Default %
term                                           
36             34730  198922  233652      14.86
60             19622   47730   67352      29.13
loan_status     Default    Paid   Total  Default %
home_ownership                                    
Mortgage          23826  125005  148831      16.01
Other                38     141     179      21.23
Own                4922   21560   26482      18.59
Rent              25566   99946  125512      20.37
It looks like there may be some correlation between loan term and defaults.  Loans with a 60 month term are twice as likely to default.  On the other hand, it's unclear that there is any relationship between home ownership and defaults. 

Wrangling

After analyzing the data, the next step is to determine what portions of it you want to use and transforming those portions into something can be digested by a machine learning algorithm.  This exercise nets out to selecting + transforming the features (inputs) and labels (outputs).  

As far as features, not all the columns in the Lending Club data set make sense as a feature.  An example is the 'desc' field.  That field is just free form text with little structure.  It's unusable as an input.  

For the label field - that one is clear cut.  The 'loan_status' field is what I'll be using as a classifier.

I'll be  using the scikit-learn machine learning library.  That library expects all features and labels to be formatted as NumPy arrays of floats.  So, any fields that are strings must be transformed into floats.

Below is a portion of the code I implemented for reading in and transforming the 'loan_status' field in the Lending Club data set:
def loan_filter(x):
    if isinstance(x, str):
        if 'charged off' in str.lower(x):
            return 0.
        if 'fully paid' in str.lower(x):
            return 1.
    return np.nan

try:           
    frame = pd.read_csv(os.path.join(self.data_dir,FILTERED_FILE),index_col=0,low_memory=False)
except IOError:
    frame = pd.DataFrame()
    for filename in os.listdir(self.data_dir):
        if filename.startswith('LoanStats'):
            temp = pd.read_csv(os.path.join(self.data_dir,filename),usecols=dat_cols,header=1,low_memory=False)              
            temp['loan_status'] = temp['loan_status'].map(filters.loan_filter)
Lines 1-7:  Similar to the mapping functions mentioned earlier, I'm using this function to filter out all erroneous data and narrow the field to only those loans that are paid off or defaulted.  The difference in this map is that I'm replacing the string in the field with a float.  0 signifies a default, 1 means paid.
Line 16:  Apply the map function to the 'loan_status' column.

Although machine learning algorithms won't be covered till Part 2, below is an example of data transform that is necessary if you plan to use any regression-type algorithm - e.g, logistic regression classifier.  Regression algorithms assume values are comparable.  Example:  a value of '2' is twice as large as a value of '1'.  Hence, simply assigning a set of numbers to a categorical feature won't work if a regression classifier is to be used (though, works fine for any decision tree-based classifiers).  The solution is to map the categories to dummy variables.  Some examples below.

The 'home_ownership' is a category-type field that has a value of OWN, RENT, MORTGAGE, or OTHER.  Pandas conveniently has a built-in function for creating dummy variables.
temp['home_ownership'] = temp['home_ownership'].map(home_cleaner)
temp.dropna(subset=['home_ownership'], inplace=True)
dummies = pd.get_dummies(temp['home_ownership'], drop_first=True)
del temp['home_ownership']
temp = pd.concat([temp,dummies], axis=1)
Line 1:  Normalize the values to 'own', 'rent', 'mortgage, 'other', or nan if it is an erroneous value.
Line 2:  Drop all the values that were set to nulls with mapping function in Line 1.
Line 3:  Invoke the pandas dummy function to transform the home_ownership column into 3 new columns corresponding to its values.  Example:  an 'own' column is created that has values of 1 or 0 depending on whether the home_ownership value was 'own' or something else.  The 'drop_first' parameter causes only 3 columns to be added vs four.  This is a fix for something called the dummy variable trap.
Line 4:  Drop the 'home_ownership' column from the frame.
Line 5:  Add the new dummy columns to the frame.

Here's another more sophisticated example of use of dummy variables.  In this case, I'm transforming the earliest credit line feature, which is a date, into a set of bins representing a decade from 1960 thru 2010.  The decade name becomes a dummy variable indicating if the earliest credit was established in that decade.
temp['earliest_cr_line'] = temp['earliest_cr_line'].map(filters.digit_filter)
temp.dropna(subset=['earliest_cr_line'], inplace=True)
bins = [0., 1970., 1980., 1990., 2000., 2010., 2020.]
group_names = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
categories= pd.cut(temp['earliest_cr_line'], bins, labels=group_names)
dummies = pd.get_dummies(categories, drop_first=True)
del temp['earliest_cr_line']
temp = pd.concat([temp,dummies], axis=1)
Line 1:  Same normalization/error cleaning step.
Line 2:  Remove nulls.
Line 3:  Define the intervals for the bins.
Line 4:  Define the category names for the bins/intervals.
Line 5:  Using the built-in pandas cut function, transform the column into categories.
Line 6:  Transform the categories to dummy variables
Line 7:  Drop the original credit line column
Line 8:  Add the new dummy columns to the frame.

This was just a portion of the data transform code I had to write for this Lending Club data set.  I noticed there were quite a few missing/erroneous values in the set.  All of that has to be cleaned up before the data is fit to be put into an algorithm.