Summary
This is Part 2 in a three-part series on my foray into machine learning. Part 1 covered data analysis and transformation necessary for machine learning. This article will cover my experiences with algorithm testing and one particular pitfall with the Lending Club data - class imbalance.
Part 2: Algorithm Analysis & Class Balancing
Part 3: API Integration
Part 4: Third Party Data Integration
Part 5: Ensembling
Part 4: Third Party Data Integration
Part 5: Ensembling
Training/Test Data Split
I used the scikit-learn machine learning library. It's a really nice library and well-documented. I found the simplest way to get all these libraries (scikit, pandas, etc) functional in a development environment was to use the Anaconda Python distribution. Everything just works.
As far as a starting point, the first step to using the scikit algorithms is to separate the data into 'feature' and 'label' training and test sets. The scikit algorithms expect inputs as numpy arrays of floats. Scikit has a built-in split function for separating features/labels into training and test sets, but I ended up writing my own simple functions for this so I could get finer-grain access to the split data to resolve an issue described later in this article.
def get_testset(self, frame, label_col, test_size): self.logger.debug('frame.shape: {}, label_col: {}, test_size: {}'.format(frame.shape, label_col, test_size)) test_frame = frame.sample(n=int(len(frame.index)*test_size), replace=False, random_state=2016) labels_test = test_frame[label_col].values features_test = test_frame.drop(label_col, axis=1).values new_frame = frame.drop(test_frame.index.tolist()) self.logger.debug('Test Set Size: {}'.format(features_test.shape)) return features_test, labels_test, new_frame def get_trainset(self, frame, label_col): self.logger.debug('frame.shape: {}, label_col: {}'.format(frame.shape, label_col)) labels_train = frame[label_col].values features_train = frame.drop(label_col, axis=1).values self.logger.debug('Training Set Size: {}'.format(features_train.shape)) return features_train, labels_train
Line 3: Using the pandas random sample function, extract out a portion of the frame for a test set.
Line 4: Extract the test labels (in this case, this is the 'loan_status' column).
Line 5: Extract the test features (removing the test labels)
Line 6: Remove the test from the frame and save it to a new frame.
Line 9: Return the test features, labels, and new frame with the test set removed.
Line 13: Extract the training labels.
Line 14: Extract the training features.
Line 4: Extract the test labels (in this case, this is the 'loan_status' column).
Line 5: Extract the test features (removing the test labels)
Line 6: Remove the test from the frame and save it to a new frame.
Line 9: Return the test features, labels, and new frame with the test set removed.
Line 13: Extract the training labels.
Line 14: Extract the training features.
Initial Algorithm Testing
So after the data is in acceptable format for sci-kit, invoking an algorithm on the data is really simple. It amounts to defining a classifier, fitting the classifier on the training set, and then testing it on test set. Below is some sample code using a couple different algorithms:
from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import AdaBoostClassifier def init_test(): loader = Loader('../data') frame = loader.load() features_test, labels_test, frame = loader.get_testset(frame, label_col='loan_status', test_size=.1) features_train, labels_train = loader.get_trainset(frame, label_col='loan_status') clf1 = RandomForestClassifier() clf1.fit(features_train, labels_train) score1 = clf1.score(features_test, labels_test) print 'RandomForest', score1 clf2 = AdaBoostClassifier() clf2.fit(features_train, labels_train) score2 = clf2.score(features_test, labels_test) print 'AdaBoost', score2 init_test()
Lines 1-2: Load a pandas data frame with the Lending Club data set.
Lines 3-4: Separate the frame into training and test numpy arrays.
Lines 9-12: Invoke a RandomForest classifier on the training set and then score its accuracy against the test set.
Lines 14-17: Same thing, but with an AdaBoost classifier.
Below is the output:
The scoring method used here is where the deceit begins. Scikit's score method produces an 'accuracy' output that is simply the ratio of correct predictions divided by total predictions. That sounds reasonable, but turns out to be a terrible scoring method for this data set. The real picture becomes clearer after reviewing a confusion matrix. Below is code and output for confusion matrices on these two algorithms.
Now we can begin to see where the issues lie. The correct predictions lie on the main diagonal. The False Positive (predicting a defaulted loan is paid) figures (3295 and 3648, respectively) is where the pain comes for the investor.
Further details can be viewed with the scikit classification report. Code and output below.
The numbers in this report are a directly computed from the confusion matrix. Precision is calculated down the columns, Recall across the rows, and the F1-score is the weighted average of precision and recall. 'Support' is total number of samples in the respective class. In this test set, there were 3841 defaulted and 17,145 paid loans. Out of both of these algorithm samples, the best was a 14% recognition rate of a defaulted loan. That's horrendous! Clearly, the current performance of machine learning is unacceptable for making investment decisions. Yet, why are we seeing a accuracy scores of ~80%?
Lines 3-4: Separate the frame into training and test numpy arrays.
Lines 9-12: Invoke a RandomForest classifier on the training set and then score its accuracy against the test set.
Lines 14-17: Same thing, but with an AdaBoost classifier.
Below is the output:
RandomForest 0.793147812828 AdaBoost 0.818402744687At first blush, these results appear to be really encouraging. With absolutely no algorithm parameter tuning yet, we're seeing roughly 80% accuracy. Alas, the results are deceiving.
The scoring method used here is where the deceit begins. Scikit's score method produces an 'accuracy' output that is simply the ratio of correct predictions divided by total predictions. That sounds reasonable, but turns out to be a terrible scoring method for this data set. The real picture becomes clearer after reviewing a confusion matrix. Below is code and output for confusion matrices on these two algorithms.
from sklearn.metrics import confusion_matrix predictions1 = clf1.predict(features_test) matrix1 = confusion_matrix(labels_test, predictions1) print 'RandomForest Confusion Matrix' print matrix1 predictions2 = clf2.predict(features_test) matrix2 = confusion_matrix(labels_test, predictions2) print 'AdaBoost Confusion Matrix' print matrix2
RandomForest Confusion Matrix [[ 546 3295] [ 1046 16099]] AdaBoost Confusion Matrix [[ 193 3648] [ 163 16982]]
Now we can begin to see where the issues lie. The correct predictions lie on the main diagonal. The False Positive (predicting a defaulted loan is paid) figures (3295 and 3648, respectively) is where the pain comes for the investor.
Further details can be viewed with the scikit classification report. Code and output below.
from sklearn.metrics import classification_report report1 = classification_report(labels_test, predictions1, target_names=['Default', 'Paid']) print 'Random Forest Classification Report' print report1 report2 = classification_report(labels_test, predictions2, target_names=['Default', 'Paid']) print 'AdaBoost Classification Report' print report2
Random Forest Classification Report precision recall f1-score support Default 0.34 0.14 0.20 3841 Paid 0.83 0.94 0.88 17145 avg / total 0.74 0.79 0.76 20986 AdaBoost Classification Report precision recall f1-score support Default 0.54 0.05 0.09 3841 Paid 0.82 0.99 0.90 17145 avg / total 0.77 0.82 0.75 20986
The numbers in this report are a directly computed from the confusion matrix. Precision is calculated down the columns, Recall across the rows, and the F1-score is the weighted average of precision and recall. 'Support' is total number of samples in the respective class. In this test set, there were 3841 defaulted and 17,145 paid loans. Out of both of these algorithm samples, the best was a 14% recognition rate of a defaulted loan. That's horrendous! Clearly, the current performance of machine learning is unacceptable for making investment decisions. Yet, why are we seeing a accuracy scores of ~80%?
Class Imbalance
The hint at the answer to this question was seen in Part 1 and further exposed in the 'support' column of these classification reports. There's roughly a 4 to 1 imbalance between defaulted and paid loans. This imbalance ends up skewing the machine learning models towards the majority class - in this case paid loans. The algorithm 'accuracy' actually mimics the frequency of the majority class. Recall from Part 1 that we saw that approximately 81% of the sample set were paid loans. The accuracy of these algorithms is also roughly 80%. That means - the prediction power of these models is currently no better than chance. In other words, you could randomly pick loans and achieve the same level of performance as selection via machine learning. Does that mean machine learning is inappropriate for this use case?
Corrections for Class Imbalance
Fortunately, there are some remedies for this problem. I group the fixes into three classes: Algorithm, Data, and Thresholds. Corrections that can be implemented directly to the machine learning algorithm include parameter tuning and class weighting. Data-centric corrections include modifications to the sampling proportions with the actual data (up/down sampling) or creation of additional data points in the minority class via artificial means. A third method involves directly manipulating the probability threshold of the minority class.
It's important to note that most of these techniques are of the variety 'rob Peter to pay Paul'. With the exception of making improvements to algorithm and parameter selection - which can lead to true overall performance improvements - the other techniques are sacrificing the prediction accuracy of the majority class to improve that of the minority.
It's important to note that most of these techniques are of the variety 'rob Peter to pay Paul'. With the exception of making improvements to algorithm and parameter selection - which can lead to true overall performance improvements - the other techniques are sacrificing the prediction accuracy of the majority class to improve that of the minority.
Algorithm Correction: Parameter Tuning
Scikit provides a nice tool to automate parameter tuning process. Their GridSearch class automates a combinatorial sampling of parameter options and performs a k-fold validation on each to determine the best parameter set. Below is an example with AdaBoost and RandomForest.
def grid_test(): loader = Loader('../data') frame = loader.load() features_test, labels_test, frame = loader.get_testset(frame, label_col='loan_status', test_size=.1) features_train, labels_train = loader.get_trainset(frame, label_col='loan_status') rf_param_grid = {'n_estimators':[25,35,45],'max_features':[4,5,6],'min_samples_split':[5, 6, 7]} clf1 = GridSearchCV(RandomForestClassifier(n_jobs=-1), rf_param_grid, n_jobs=-1) clf1.fit(features_train, labels_train) print 'Random Forest Best Params' print clf1.best_params_ ab_param_grid = {'n_estimators' : [10, 40, 50, 70, 80]} clf2 = GridSearchCV(AdaBoostClassifier(), ab_param_grid, n_jobs=-1) clf2.fit(features_train, labels_train) print 'AdaBoost Best Params' print clf2.best_params_
Random Forest Best Params {'max_features': 4, 'min_samples_split': 5, 'n_estimators': 45} AdaBoost Best Params {'n_estimators': 10}
Running a classification report with those parameters yields the following:
Not a stunning improvement. In fact, you could say performance was degraded with the parameter tuning. It appears parameters were tuned to maximize performance of the majority class, which isn't particularly helpful in this scenario. However, this tuning technique can be applied in combination with some of the other balancing corrections discussed below.Random Forest Classification Report precision recall f1-score support Default 0.55 0.02 0.04 3841 Paid 0.82 1.00 0.90 17145 avg / total 0.77 0.82 0.74 20986 AdaBoost Classification Report precision recall f1-score support Default 0.51 0.02 0.04 3841 Paid 0.82 1.00 0.90 17145 avg / total 0.76 0.82 0.74 20986
Algorithm Correction: Class Weighting
Some of the scikit classifier implementations allow direct manipulation of the weights assigned to classes. We know upfront that there is a 4:1 imbalance in this data set. You can configure that weight specifically in the 'class_weight' parameter or simply pass 'balanced' as the value and the weighting is calculated automatically. Below is an example of before and after with the DecisionTree classifier and some selected parameter tuning.
clf1 = tree.DecisionTreeClassifier(max_depth=6,min_samples_split=150, min_samples_leaf=7, \ criterion='gini') clf1.fit(features_train, labels_train) score1 = clf1.score(features_test, labels_test) print 'Decision Tree', score1 clf2 = tree.DecisionTreeClassifier(max_depth=6,min_samples_split=150, min_samples_leaf=7, \ criterion='gini',class_weight='balanced') clf2.fit(features_train, labels_train)
Decision Tree Classification Report precision recall f1-score support Default 0.47 0.03 0.05 3841 Paid 0.82 0.99 0.90 17145 avg / total 0.76 0.82 0.74 20986 Decision Tree(Balanced Class Weights) Classification Report precision recall f1-score support Default 0.26 0.66 0.38 3841 Paid 0.88 0.59 0.71 17145 avg / total 0.77 0.60 0.65 20986
For this particular data and parameter set, the improvement on prediction of the minority class (defaulted loans) is fairly dramatic.
Data Correction: Sampling
Another angle for correcting errors due to class imbalance is to eliminate the balance in the data itself. That can be done in either direction: increase the number of minority class samples by replicating samples of that class or decrease the number of majority samples by taking only a subset of that class. Pandas has a 'sample' function that makes up/down sampling trivial. Below is a snippet of how to do that. For up-sampling, the 'n' parameter would be set to a value equal to (or greater than) the number of samples in the majority class. Similarly, for down-sampling that parameter would be set to something equal to (or less than) the minority class size.
# Up-sampling the minority class synframe = min_frame.sample(n=num_samples, replace=True) # Down-sampling the majority class synframe = maj_frame.sample(n=num_samples, replace=False)Below are the results applying down-sampling with DecisionTree. The majority class is down-sampled to be exactly equal in size with the minority. Once again, a fairly good improvement with predicting the defaulted loan class.
Decision Tree(Down-Sampling) Classification Report precision recall f1-score support Default 0.27 0.62 0.38 3841 Paid 0.88 0.63 0.73 17145 avg / total 0.77 0.63 0.67 20986
Data Correction: Synthetic Sampling
A different technique to sampling involves creating net new samples of the minority class. SMOTE is such a technique. In a nutshell, SMOTE creates new minority samples by taking the nearest neighbors of a given minority sample and creating new points on the lines connecting those neighbors to the sample. Below is an implementation of SMOTE I wrote. I took some liberties with the parameters of the published SMOTE algorithm to enable a consistent interface with the other data balancing techniques I was testing (such as up/down sampling).
import random import pandas as pd from sklearn.neighbors import NearestNeighbors from math import ceil import numpy as np from balancer import Balancer import logging from time import time class Smote(Balancer): def __init__(self, label_col, min_class, mult='balanced', k=5): super(Smote, self).__init__(label_col, min_class) self.logger = logging.getLogger() self.mult = mult self.k = k def balance(self, frame): ti = time() maj_frame = frame[frame[self.label_col] != self.min_class] min_frame = frame[frame[self.label_col] == self.min_class] # these artificial sample size reductions added for demo purposes only (slow) maj_frame = maj_frame.sample(n=2000, replace=False, random_state=2016) min_frame = min_frame.sample(n=1000, replace=False, random_state=2016) if self.mult == 'balanced': self.mult = int(ceil(len(maj_frame.index)/len(min_frame.index))) else: self.mult = int(ceil(self.mult)) synframe = pd.concat([maj_frame, min_frame], axis=0) clf = NearestNeighbors(n_neighbors=self.k, n_jobs=-1) clf.fit(min_frame.drop(self.label_col, axis=1).values) random.seed(2016) for i in range(0, len(min_frame.index)): point = min_frame.drop(self.label_col, axis=1).iloc[i] ind = clf.kneighbors(np.array(point).reshape(1,-1), return_distance=False) for _ in range(1,self.mult): ri = random.randint(1, self.k-1) neighbor = min_frame.drop(self.label_col, axis=1).iloc[ind[0,ri]] synpoint = point + random.random() * (neighbor - point) synpoint = synpoint.append(pd.Series([self.min_class], index=[self.label_col])) synframe = synframe.append(synpoint, ignore_index=True) self.logger.debug('time: {:0.4f} #min_class: {}, #maj_class: {}'.format(time() - ti, \ len(synframe[synframe[self.label_col] == self.min_class].index), \ len(synframe[synframe[self.label_col] != self.min_class].index))) return synframe.sample(frac=1, random_state=2016).reset_index(drop=True)
The problem (at least with my implementation) with SMOTE is that it is unbearably slow. Hence, I have some artificial down-sampling lines above to reduce the sample input sets. To produce 2000 synthetic samples, the code above takes 127 seconds on my machine.
To speed up the creation of synthetic samples, I re-wrote my SMOTE implementation to support multi-processing. Refactored code below:
The time to create 2000 synthetic samples with this multi-processing version of SMOTE was reduced to 22 seconds with an eight-core processor.
Below is the classification report of the Decision Tree classifier with the same parameters I've used throughout. Once again, some healthy improvements with predicting defaults.
To speed up the creation of synthetic samples, I re-wrote my SMOTE implementation to support multi-processing. Refactored code below:
import random import pandas as pd from sklearn.neighbors import NearestNeighbors from math import ceil import numpy as np from balancer import Balancer import logging from multiprocessing import Queue from multiprocessing import JoinableQueue from multiprocessing import Process from multiprocessing import cpu_count from time import time class Smote(Balancer): def __init__(self, label_col, min_class, min_size, maj_size, mult='balanced', k=5): super(Smote, self).__init__(label_col, min_class) self.logger = logging.getLogger() self.logger.debug('label_col: {}, min_class: {}, min_size: {}, maj_size: {}, mult: {}, k: {}'.format(\ label_col, min_class, min_size, maj_size, mult, k)) self.mult = mult self.k = k self.min_size = min_size self.maj_size = maj_size def balance(self, frame): ti = time() maj_frame = frame[frame[self.label_col] != self.min_class] min_frame = frame[frame[self.label_col] == self.min_class] maj_frame = maj_frame.sample(n=self.maj_size, replace=False, random_state=2016) min_frame = min_frame.sample(n=self.min_size, replace=False, random_state=2016) if self.mult == 'balanced': self.mult = int(ceil(len(maj_frame.index)/len(min_frame.index))) else: self.mult = int(ceil(self.mult)) synframe = pd.concat([maj_frame, min_frame], axis=0) clf = NearestNeighbors(n_neighbors=self.k, n_jobs=-1) clf.fit(min_frame.drop(self.label_col, axis=1).values) random.seed(2016) workq = JoinableQueue() doneq = Queue() workers = [] for _ in range(0, cpu_count()): worker = Process(target=do_work, args=(workq, doneq, clf, min_frame, self.k, self.mult, self.label_col, self.min_class)) worker.start() workers.append(worker) for i in range(0, len(min_frame.index)): point = min_frame.drop(self.label_col, axis=1).iloc[i] workq.put(point) for _ in range(0, len(workers)): workq.put('STOP') workq.join() while not doneq.empty(): synpoint = doneq.get(block=False) synframe = synframe.append(synpoint, ignore_index=True) self.logger.debug('time: {:0.4f} #min_size: {}, #maj_size: {}'.format(time() - ti, \ len(synframe[synframe[self.label_col] == self.min_class].index), \ len(synframe[synframe[self.label_col] != self.min_class].index))) return synframe.sample(frac=1, random_state=2016).reset_index(drop=True) def do_work(workq, doneq, clf, min_frame, k, mult, label_col, min_class): while True: point = workq.get() if isinstance(point,str): workq.task_done() break ind = clf.kneighbors(np.array(point).reshape(1,-1), return_distance=False) for _ in range(1,mult): ri = random.randint(1, k-1) neighbor = min_frame.drop(label_col, axis=1).iloc[ind[0,ri]] synpoint = point + random.random() * (neighbor - point) synpoint = synpoint.append(pd.Series([min_class], index=[label_col])) doneq.put(synpoint) workq.task_done()
The time to create 2000 synthetic samples with this multi-processing version of SMOTE was reduced to 22 seconds with an eight-core processor.
Below is the classification report of the Decision Tree classifier with the same parameters I've used throughout. Once again, some healthy improvements with predicting defaults.
Decision Tree(SMOTE) Classification Report precision recall f1-score support Default 0.29 0.40 0.34 3841 Paid 0.85 0.78 0.82 17145 avg / total 0.75 0.71 0.73 20986
Threshold Manipulation
The final technique I'll be discussing for addressing class imbalance is rigging prediction probabilities to favor the minority class. For a two-class label set such as this - by default - the classification algorithm will assign the class to whichever has a .5 or greater prediction probability. We can artificially make the algorithm more sensitive to the minority class by lowering its probability threshold. Example: Instead .5 being the threshold to predict a loan as a default, make the threshold .3.
To implement this, I created a wrapper class (MlAlg) for classifiers where I overrode the prediction function.
class MlAlg(object): def predict(self, features): ti = time() def custom_predict(probability, prediction): if probability >= self.thresh: return self.min_class else: return prediction vfunc = np.vectorize(custom_predict, otypes=[np.int]) prediction = vfunc(self.clf.predict_proba(features)[:, self.min_pos], self.clf.predict(features)) self.logger.debug('prediction time: {:0.4f}'.format(time()-ti)) return prediction
Lines 6-10: A function implements the new prediction threshold. If the probability is greater than or equal to the new/manipulated threshold - return the minority class as the prediction.
Lines 12-13: Using numpy's vectorize function, evaluate the classifier's returned probabilities for the minority class and modify the ultimate prediction with the 'custom_predict' function.
Below are the results of Decision Tree with a modified probability threshold of .25 for defaulted loans. This did yield an improvement in defaulted loan prediction performance as compared to the baseline (.5).
Lines 12-13: Using numpy's vectorize function, evaluate the classifier's returned probabilities for the minority class and modify the ultimate prediction with the 'custom_predict' function.
Below are the results of Decision Tree with a modified probability threshold of .25 for defaulted loans. This did yield an improvement in defaulted loan prediction performance as compared to the baseline (.5).
Decision Tree(.25 Threshold) Classification Report precision recall f1-score support Default 0.33 0.37 0.34 3841 Paid 0.85 0.83 0.84 17145 avg / total 0.76 0.75 0.75 20986
Conclusion
There's a considerable amount of trial/error testing required to find the right algorithm/parameter fit for a given data set. If that data set has significant imbalances in its classes, even further trial/error is required to find a good fit between algorithm and data balancing techniques.