Summary
This is the fifth installment in the series I've written on my experience implementing a machine learning engine for selecting loans for investment from Lending Club. In this article, I'll discuss and apply a couple techniques for combining the results of multiple classifiers.
Part 3: API Integration
Part 4: Third Party Data Integration
Part 5: Ensembling
Part 4: Third Party Data Integration
Part 5: Ensembling
Explanation of Ensembling
There's quite a bit of literature out there on ensembling as it has been a popular (and winning) technique in Kaggle competitions. In a nutshell, ensembling attempts to combine the strengths of multiple classifiers to produce a classifier that is better than its parts.
The simplest ensemble, and in fact has an out-of-box implementation in scikit-learn - VotingClassifier, is voting. In 'hard' voting, the majority prediction from multiple classifiers is taken to be the final prediction. In 'soft' voting, the probabilities from multiple classifiers are summed to come up with the final prediction.
A more complex ensembling technique is known as stacking (sometimes also referred to as 'blending'). In this model, classifiers are 'stacked' in levels. The outputs of the classifiers in the lower levels become the inputs to the classifier(s) in the upper levels.
An interesting variation of stacking was developed for the Netflix Prize competition in 2009 and was in fact the 2nd place winner. Feature-weighted Linear Stacking takes 'meta-features' and combines them via linear regression to a stacked model. Video here from Joe Sill, a member of the 2nd place team, describing the technique.
Implementation
I decided to approach this with a Python class (Ensemble) that implements both voting and stacking. I settled on four classifiers: GBC, MLP, RBM, and SVD. I ran all four through a grid-search to find a good parameter set for each classifier for my particular data set.
Below is some code that tests each classifier in isolation and prints out a Pearson correlation matrix for the four classifiers. Good classification performance per classifier and low-correlation between classifiers are goals for an ensemble.
pool = mp.Pool(processes=mp.cpu_count()) results = [] for name, clf in self.estimators.items(): try: self.estimators[name] = joblib.load('./models/' + name + '.pkl') except FileNotFoundError: logging.debug('{} not pickled'.format(name)) results.append(pool.apply_async(lvl1_fit, args=(clf, name, features_train, labels_train))) pool.close() pool.join() for result in results: item = result.get() name = item['name'] self.estimators[name] = item['fittedclf'] #Print confusion matrix and score for each clf. corr_list = [] clf_list = [] for name, clf in self.estimators.items(): preds = clf.predict(features_test) self.confusion_matrix(name, labels_test, preds) print() self.classification_report(name, labels_test, preds) corr_list.append((name, preds)) clf_list.append(name) #Print a matrix of correlations between clfs frame = pd.DataFrame(index=clf_list, columns=clf_list) for pair in itertools.combinations(corr_list,2): res = pearsonr(pair[0][1],pair[1][1])[0] frame[pair[0][0]][pair[1][0]] = res frame[pair[1][0]][pair[0][0]] = res frame['mean'] = frame.mean(skipna=True,axis=1) pd.options.display.width = 180 print('Correlation Matrix') print(frame)
Lines 1-16: Load up the fitted classifiers. 'estimators' is an OrderedDict of scikit-learn classifier objects. If the fitted classifiers have already been cached to disk, use that. Otherwise, fit them from scratch on the training set in a multiprocessing pool.
Lines 18-27: Generate predictions for each classifier for the test set, then print a confusion matrix and classification report for each.
Lines 29-39: Generate the Pearson correlation between each pair of classifiers and then organize the results in a matrix.
Excerpt of the output below:
Lines 18-27: Generate predictions for each classifier for the test set, then print a confusion matrix and classification report for each.
Lines 29-39: Generate the Pearson correlation between each pair of classifiers and then organize the results in a matrix.
Excerpt of the output below:
svd Confusion Matrix (66392 samples): [[ 8599 4654] [22715 30424]] svd Classification Report precision recall f1-score support Default 0.27 0.65 0.39 13253 Paid 0.87 0.57 0.69 53139 avg / total 0.75 0.59 0.63 66392 Correlation Matrix gbc mlp rbm svd mean gbc NaN 0.746548 0.603429 0.532055 0.627344 mlp 0.746548 NaN 0.516596 0.538988 0.600711 rbm 0.603429 0.516596 NaN 0.408401 0.509475 svd 0.532055 0.538988 0.408401 NaN 0.493148
Voting
The voting portion is quite simple as it is simply a wrapper around the out-of-box implementation from scikit-learn.
Fitting of the voting classifier below:
Prediction with the voting classifier below:Fitting of the voting classifier below:
try: self.voteclf = joblib.load('./models/voteclf.pkl') except FileNotFoundError: ti = time() self.voteclf = VotingClassifier(estimators=list(self.estimators.items()), voting='soft',n_jobs=-1) self.voteclf.fit(features_train, labels_train) joblib.dump(self.voteclf, './models/voteclf.pkl') #cache the fitted model to disk logging.debug('Exiting __fit_vote()')Lines 1-9: If the voting classifier has already been fitted and cached to disk, load it. Otherwise, fit it from scratch and dump the fitted model to disk.
preds = self.__predict_with_threshold(self.voteclf, features) def __predict_with_threshold(self, clf, features): ti = time() predictions = Ensemble.__custom_predict(clf.predict_proba(features)[:, MINORITY_POS], \ clf.predict(features), self.threshold) return predictions __custom_predict = np.vectorize(vfunc, otypes=[np.int]) def vfunc(probability, prediction, threshold): if probability >= threshold: return MINORITY_CLASS else: return prediction
Line 1: Pass the fitted voting classifier to a custom predict function that operates with thresholds.
Lines 3-15: Custom prediction function that uses a threshold to decide whether the minority class should be chosen as the prediction. This is a method to deal with unbalanced classes.
Lines 3-15: Custom prediction function that uses a threshold to decide whether the minority class should be chosen as the prediction. This is a method to deal with unbalanced classes.
Stacking
The stacking implementation is significantly more complex. Fitting, for example, requires fitting of each of the four 1st level classifiers and then fitting each of them in K-Fold cross-validation to generate the training data for the 2nd level classifier. I'm using an unbalanced data set from Lending Club that has over 600K records. Balancing that takes its size to a ~1M records. Fitting on data sets of this size mandates use of multiprocessing and caching of the fitted classifiers to disk.pool = mp.Pool(processes=mp.cpu_count()) results = [] #array for holding the result objects from the pool processes #fit 1st level estimators with a multiprocessing pool of workers for name, clf in self.estimators.items(): try: self.estimators[name] = joblib.load('./models/' + name + '.pkl') except FileNotFoundError: logging.debug('Level 1: {} not pickled'.format(name)) results.append(pool.apply_async(lvl1_fit, args=(clf, name, features_train, labels_train))) pool.close() pool.join() for result in results: item = result.get() name = item['name'] self.estimators[name] = item['fittedclf'] #reassign a fitted clf to the estimator dictionary #fit 2nd level estimator with a multiprocessing pool of workers that perform a k-fold cross-val of #training data pool = mp.Pool(processes=mp.cpu_count()) del results[:] try: self.lrc = joblib.load('./models/lrc.pkl') #try to load the 2nd level estimator from disk except FileNotFoundError: #2nd level estimator not fitted yet logging.debug('Level 2: LRC not pickled') folds = list(StratifiedKFold(n_splits=5).split(features_train, labels_train)) #define a frame for holding the k-fold test results of the 1st level classifiers lvl2_frame = pd.DataFrame(index=range(0,len(features_train)), columns=list(self.estimators.keys())) lvl2_frame[LABEL_COL] = labels_train #launch multiprocessing pool workers (1 per fold) that fit 1st level classifers and perform #predictions that become the training data for the 2nd level classifier (Logistic Regression) for name,clf in self.estimators.items(): fold = 1 for train_idx, test_idx in folds: X_train, X_test = features_train[train_idx], features_train[test_idx] Y_train = labels_train[train_idx] col_loc = lvl2_frame.columns.get_loc(name) results.append(pool.apply_async(lvl2_fit, args=(clf, name, fold, test_idx, \ col_loc, X_train, Y_train, X_test))) fold = fold + 1 pool.close() pool.join() #fetch worker results and put them into a frame that will be used to train a 2nd Level/Logistic #regression classifier for result in results: item = result.get() name = item['name'] test_idx = item['test_idx'] col_loc = item['col_loc'] preds = item['preds'] lvl2_frame.iloc[test_idx, col_loc] = preds self.lrc = LogisticRegression(C=2.0) ti = time() X = lvl2_frame.drop(LABEL_COL, axis=1).values Y = lvl2_frame[LABEL_COL].values self.lrc.fit(X, Y) logging.debug('LRC fit time: {:0.4f}'.format(time()-ti)) joblib.dump(self.lrc, './models/lrc.pkl') #cache the Logistical Regressor to disk def lvl1_fit(clf, name, features_train, labels_train): logging.debug('Entering lvl1_fit() {}'.format(name)) ti = time() fittedclf = clf.fit(features_train, labels_train) logging.debug('{} fit time: {:0.4f}'.format(name, time()-ti)) joblib.dump(fittedclf, './models/' + name + '.pkl') #cache the fitted model to disk logging.debug('Exiting lvl1_fit() {}'.format(name)) return {'name': name, 'fittedclf': fittedclf} def lvl2_fit(clf, name, fold, test_idx, col_loc, features_train, labels_train, features_test): logging.debug('Entering lvl2_fit() {} fold {}'.format(name, fold)) ti = time() clf.fit(features_train, labels_train) logging.debug('{} fold {} fit time: {:0.4f}'.format(name, fold, time()-ti)) preds = clf.predict_proba(features_test)[:, MINORITY_POS] logging.debug('Exiting lvl2_fit() {} fold {}'.format(name, fold)) return {'name': name, 'test_idx' : test_idx, 'col_loc' : col_loc, 'preds' : preds}Lines 1-18: Attempt to load fitted classifiers from disk. If they don't exist, use a pool of workers to fit each classifier to the full training set.
Lines 20-56: With K-Fold (5 folds) cross validation, fit each of the classifiers and then generate predictions with the test set in that fold. Save the predictions to a data frame.
Lines 58-64: Fit the 2nd level classifier (Logistic Regression) to the predictions from the 1st level classifiers. Dump the fitted Logistic classifier to disk.
Lines 66-73: Function for fitting the 1st level classifiers and dumping them to disk.
Lines 75-82: Function called within the K-folding for fitting 1st level classifiers and generating predictions to use to train the 2nd level classifier.
Generating predictions from the stacked ensemble requires running the features through all of the 1st level classifiers and then sending their output (predictions) to the 2nd level classifier - Logistic Regression.
def __predict_stack(self, features): lvl1_frame = pd.DataFrame() #1st level predictions for name, clf in self.estimators.items(): lvl1_frame[name] = clf.predict_proba(features)[:, MINORITY_POS] #2nd level predictions preds = self.__predict_with_threshold(self.lrc, lvl1_frame.values) return preds
Source: https://github.com/joeywhelan/Ensemble/
Copyright ©1993-2024 Joey E Whelan, All rights reserved.