Tech Tips: AI-Driven Investing with Lending Club

Summary

This is the fifth installment in the series I've written on my experience implementing a machine learning engine for selecting loans for investment from Lending Club. In this article, I'll discuss and apply a couple techniques for combining the results of multiple classifiers.

Part 1: Data Wrangling

Part 2: Algorithm Analysis & Class Balancing

Part 3: API Integration
Part 4: Third Party Data Integration
Part 5: Ensembling

Explanation of Ensembling

There's quite a bit of literature out there on ensembling as it has been a popular (and winning) technique in Kaggle competitions. In a nutshell, ensembling attempts to combine the strengths of multiple classifiers to produce a classifier that is better than its parts.

The simplest ensemble, and in fact has an out-of-box implementation in scikit-learn - VotingClassifier, is voting. In 'hard' voting, the majority prediction from multiple classifiers is taken to be the final prediction. In 'soft' voting, the probabilities from multiple classifiers are summed to come up with the final prediction.

A more complex ensembling technique is known as stacking (sometimes also referred to as 'blending'). In this model, classifiers are 'stacked' in levels. The outputs of the classifiers in the lower levels become the inputs to the classifier(s) in the upper levels.

An interesting variation of stacking was developed for the Netflix Prize competition in 2009 and was in fact the 2nd place winner. Feature-weighted Linear Stacking takes 'meta-features' and combines them via linear regression to a stacked model. Video here from Joe Sill, a member of the 2nd place team, describing the technique.

Implementation

I decided to approach this with a Python class (Ensemble) that implements both voting and stacking. I settled on four classifiers: GBC, MLP, RBM, and SVD. I ran all four through a grid-search to find a good parameter set for each classifier for my particular data set.

Below is some code that tests each classifier in isolation and prints out a Pearson correlation matrix for the four classifiers. Good classification performance per classifier and low-correlation between classifiers are goals for an ensemble.

        pool = mp.Pool(processes=mp.cpu_count())
        results = []

        for name, clf in self.estimators.items():
            try:
                self.estimators[name] = joblib.load('./models/' + name + '.pkl')
            except FileNotFoundError:  
                logging.debug('{} not pickled'.format(name))    
                results.append(pool.apply_async(lvl1_fit, args=(clf, name, features_train, labels_train)))           
           
        pool.close()
        pool.join() 
        for result in results:
            item = result.get()
            name = item['name']
            self.estimators[name] = item['fittedclf']
        
        #Print confusion matrix and score for each clf.  
        corr_list = []
        clf_list = []
        for name, clf in self.estimators.items():
            preds = clf.predict(features_test)
            self.confusion_matrix(name, labels_test, preds)
            print()
            self.classification_report(name, labels_test, preds)
            corr_list.append((name, preds))
            clf_list.append(name)
        
        #Print a matrix of correlations between clfs
        frame = pd.DataFrame(index=clf_list, columns=clf_list)
    
        for pair in itertools.combinations(corr_list,2):
            res = pearsonr(pair[0][1],pair[1][1])[0]
            frame[pair[0][0]][pair[1][0]] = res
            frame[pair[1][0]][pair[0][0]] = res
        frame['mean'] = frame.mean(skipna=True,axis=1)
        pd.options.display.width = 180
        print('Correlation Matrix')
        print(frame)

Lines 1-16: Load up the fitted classifiers. 'estimators' is an OrderedDict of scikit-learn classifier objects. If the fitted classifiers have already been cached to disk, use that. Otherwise, fit them from scratch on the training set in a multiprocessing pool.
Lines 18-27: Generate predictions for each classifier for the test set, then print a confusion matrix and classification report for each.
Lines 29-39: Generate the Pearson correlation between each pair of classifiers and then organize the results in a matrix.

Excerpt of the output below:

svd Confusion Matrix (66392 samples): 
[[ 8599  4654]
 [22715 30424]]

svd Classification Report
             precision    recall  f1-score   support

    Default       0.27      0.65      0.39     13253
       Paid       0.87      0.57      0.69     53139

avg / total       0.75      0.59      0.63     66392

Correlation Matrix
          gbc       mlp       rbm       svd      mean
gbc       NaN  0.746548  0.603429  0.532055  0.627344
mlp  0.746548       NaN  0.516596  0.538988  0.600711
rbm  0.603429  0.516596       NaN  0.408401  0.509475
svd  0.532055  0.538988  0.408401       NaN  0.493148

Voting

The voting portion is quite simple as it is simply a wrapper around the out-of-box implementation from scikit-learn.

Fitting of the voting classifier below:

        try:
            self.voteclf = joblib.load('./models/voteclf.pkl')
        except FileNotFoundError: 
            ti = time() 
            self.voteclf = VotingClassifier(estimators=list(self.estimators.items()), voting='soft',n_jobs=-1)      
            self.voteclf.fit(features_train, labels_train)
            joblib.dump(self.voteclf, './models/voteclf.pkl') #cache the fitted model to disk
        logging.debug('Exiting __fit_vote()')

Lines 1-9: If the voting classifier has already been fitted and cached to disk, load it. Otherwise, fit it from scratch and dump the fitted model to disk.

Prediction with the voting classifier below:

preds = self.__predict_with_threshold(self.voteclf, features)

def __predict_with_threshold(self, clf, features):
        ti = time()
        predictions = Ensemble.__custom_predict(clf.predict_proba(features)[:, MINORITY_POS], \
                                                clf.predict(features), self.threshold)
        return predictions

__custom_predict = np.vectorize(vfunc, otypes=[np.int])

def vfunc(probability, prediction, threshold):
    if probability >= threshold:
        return MINORITY_CLASS
    else:
        return prediction

Line 1: Pass the fitted voting classifier to a custom predict function that operates with thresholds.
Lines 3-15: Custom prediction function that uses a threshold to decide whether the minority class should be chosen as the prediction. This is a method to deal with unbalanced classes.

Stacking

The stacking implementation is significantly more complex. Fitting, for example, requires fitting of each of the four 1st level classifiers and then fitting each of them in K-Fold cross-validation to generate the training data for the 2nd level classifier. I'm using an unbalanced data set from Lending Club that has over 600K records. Balancing that takes its size to a ~1M records. Fitting on data sets of this size mandates use of multiprocessing and caching of the fitted classifiers to disk.

        pool = mp.Pool(processes=mp.cpu_count())
        results = [] #array for holding the result objects from the pool processes
        
        #fit 1st level estimators with a multiprocessing pool of workers
        for name, clf in self.estimators.items():
            try:
                self.estimators[name] = joblib.load('./models/' + name + '.pkl')
            except FileNotFoundError:  
                logging.debug('Level 1: {} not pickled'.format(name))    
                results.append(pool.apply_async(lvl1_fit, args=(clf, name, features_train, labels_train)))           
           
        pool.close()
        pool.join() 
       
        for result in results:
            item = result.get()
            name = item['name']
            self.estimators[name] = item['fittedclf'] #reassign a fitted clf to the estimator dictionary
        
        #fit 2nd level estimator with a multiprocessing pool of workers that perform a k-fold cross-val of 
        #training data
        pool = mp.Pool(processes=mp.cpu_count())
        del results[:]
        try:
            self.lrc = joblib.load('./models/lrc.pkl') #try to load the 2nd level estimator from disk
        except FileNotFoundError: #2nd level estimator not fitted yet
            logging.debug('Level 2: LRC not pickled') 
            folds = list(StratifiedKFold(n_splits=5).split(features_train, labels_train)) 
            #define a frame for holding the k-fold test results of the 1st level classifiers
            lvl2_frame = pd.DataFrame(index=range(0,len(features_train)), 
                                      columns=list(self.estimators.keys()))  
            lvl2_frame[LABEL_COL] = labels_train  
             
            #launch multiprocessing pool workers (1 per fold) that fit 1st level classifers and perform
            #predictions that become the training data for the 2nd level classifier (Logistic Regression)   
            for name,clf in self.estimators.items():
                fold = 1
                for train_idx, test_idx in folds:
                    X_train, X_test = features_train[train_idx], features_train[test_idx]
                    Y_train = labels_train[train_idx]
                    col_loc = lvl2_frame.columns.get_loc(name)
                    results.append(pool.apply_async(lvl2_fit, args=(clf, name, fold, test_idx, \
                                                                    col_loc, X_train, Y_train, X_test)))
                    fold = fold + 1
            pool.close()
            pool.join() 
           
            #fetch worker results and put them into a frame that will be used to train a 2nd Level/Logistic
            #regression classifier
            for result in results:
                item = result.get()
                name = item['name']
                test_idx = item['test_idx']
                col_loc = item['col_loc']
                preds = item['preds']
                lvl2_frame.iloc[test_idx, col_loc] = preds

            self.lrc = LogisticRegression(C=2.0)
            ti = time()
            X = lvl2_frame.drop(LABEL_COL, axis=1).values
            Y = lvl2_frame[LABEL_COL].values
            self.lrc.fit(X, Y)     
            logging.debug('LRC fit time: {:0.4f}'.format(time()-ti))
            joblib.dump(self.lrc, './models/lrc.pkl')  #cache the Logistical Regressor to disk

def lvl1_fit(clf, name, features_train, labels_train):
    logging.debug('Entering lvl1_fit() {}'.format(name))
    ti = time()
    fittedclf = clf.fit(features_train, labels_train)
    logging.debug('{} fit time: {:0.4f}'.format(name, time()-ti))
    joblib.dump(fittedclf, './models/' + name + '.pkl') #cache the fitted model to disk
    logging.debug('Exiting lvl1_fit() {}'.format(name))
    return {'name': name, 'fittedclf': fittedclf}

def lvl2_fit(clf, name, fold, test_idx, col_loc, features_train, labels_train, features_test):  
    logging.debug('Entering lvl2_fit() {} fold {}'.format(name, fold))
    ti = time()
    clf.fit(features_train, labels_train)
    logging.debug('{} fold {} fit time: {:0.4f}'.format(name, fold, time()-ti))
    preds = clf.predict_proba(features_test)[:, MINORITY_POS]
    logging.debug('Exiting lvl2_fit() {} fold {}'.format(name, fold))
    return {'name': name, 'test_idx' : test_idx, 'col_loc' : col_loc, 'preds' : preds}

Lines 1-18: Attempt to load fitted classifiers from disk. If they don't exist, use a pool of workers to fit each classifier to the full training set.
Lines 20-56: With K-Fold (5 folds) cross validation, fit each of the classifiers and then generate predictions with the test set in that fold. Save the predictions to a data frame.
Lines 58-64: Fit the 2nd level classifier (Logistic Regression) to the predictions from the 1st level classifiers. Dump the fitted Logistic classifier to disk.
Lines 66-73: Function for fitting the 1st level classifiers and dumping them to disk.
Lines 75-82: Function called within the K-folding for fitting 1st level classifiers and generating predictions to use to train the 2nd level classifier.

Generating predictions from the stacked ensemble requires running the features through all of the 1st level classifiers and then sending their output (predictions) to the 2nd level classifier - Logistic Regression.

 def __predict_stack(self, features):
        lvl1_frame = pd.DataFrame()
        #1st level predictions
        for name, clf in self.estimators.items():
            lvl1_frame[name] = clf.predict_proba(features)[:, MINORITY_POS]
            
        #2nd level predictions
        preds = self.__predict_with_threshold(self.lrc, lvl1_frame.values)
   
        return preds

Source: https://github.com/joeywhelan/Ensemble/

Tech Tips

Friday, September 8, 2017

AI-Driven Investing with Lending Club - Part 5: Ensembling

Summary

Explanation of Ensembling

Implementation

Voting

Stacking