Sunday, May 15, 2016

AI-Driven Investing with Lending Club - Part 3: API Integration


Summary

This is the final post in the series discussing my experiences with applying machine learning to Lending Club's data set.  This article will discuss how I integrated the techniques developed in the previous posts to make real-time investments utilizing the Lending Club REST API.  This post leverages my previous discussion on accessing that API. 


Right Hand Meet Left.  Please.

After all the data wrangling necessary to get the Lending Club historical set suitable for machine learning, one would hope the lessons/code there would directly apply to the real-time loan list that is accessed via API.  Unfortunately, that's not the case - at all.

I was disappointed in all the inconsistencies I saw between the historical and API data schemas.  My opinion only, but this is a clear indication of communication issues within the Lending Club Engineering organization.  The reporting and API folks simply aren't on the same page.

Schema Issue - Inconsistent Field Names

For the historical data - the apparent 'standard' is underscore separated words .  For the API, it's CamelCase words.  Unfortunately, underscores vs capitalization is not the end to the inconsistencies.  Example 1:  the amount of a loan is called 'loan_amnt' in the historical data set.  It's called 'loanAmount' in the API.  Example 2:  'verification_status' in the historical data set is 'isIncV' in the API.  Net, careful comparison of each and every field name must be done.  Then, these field names need to be normalized.  

I chose to normalize the API schema to the historical set's naming conventions.  Below is a code snippet of how to get these field/columns names matched up:
api = list(api_cols)
dat = list(loader.dat_cols)
map_cols = dict(zip(api, dat))
frame = frame[api_cols]
frame.rename(columns=map_cols, inplace=True)
Lines 1-2:  I had to carefully create lists with the API and historical field/column names I'm using for my algorithm.  Those lists need to match up, exactly.  Extra painful.
Line 3:  Create a dictionary with the mapping between API and historical names.
Line 4:  For a frame loaded with API-read loan data, filter it to the columns you want to use.
Line 5:  Perform the mapping between API and historical naming conventions.

Schema Issue - Inconsistent Field Data Types and/or Values

Examples:

  • Loan Verification status.  Historical Values: 'Verified', 'Source Verified', 'Not Verified'.  API Values: 'VERIFIED', 'SOURCE_VERIFIED', 'VERIFIED'
  • Loan Term.  Historical Values (string):  '36 months', '60 months'.  API Values (integer): 36, 60
  • Employment Length:  Historical Values (string):  varying strings representing a period months.  Examples:  '10+ years', '3 years', '< 1 year'.  API Values (integer):  X, where X represents a number of months between 0 and 120.
And trust me, there are more.

Net, I had to write an entire  'cleaner' function specifically for the API data set to normalize it to the historical.

Modifying Existing API Code

After writing the 'cleaner' function, integrating it to my existing Python app for accessing the Lending Club REST API was straightforward.  I simply needed to modify the existing loan fetching code to use the 'cleaner' to create a filtered data set and call my machine learning classifier on that loan list.  Snippets below:
#snippet of __get_loans function
resp = requests.get(self.loanListURL, headers=self.header, params=payload)
resp.raise_for_status()
loans = cleaner.clean(self.alg, resp.json()['loans'])
logger.info('{} loans were found'.format(len(loans)))
logger.debug('Exiting __getLoans()')
return loans  
Line 4: Invoke my custom 'cleaner' function.  Pass it a scikit classifier to apply on the fetched loans.

# snippet of cleaner function
frame['prediction'] = alg.predict(features=frame.values)
frame = frame[(frame['prediction']==1)].sort_values(by='intRate',ascending=False)  
return [int(x) for x in frame.index.tolist()]
Line 2:  Run the scikit prediction algorithm on the fetched loans and add it as a column to the data frame.  The values will be 1 for predicted paid loans and 0 for predicted defaults.
Line 3:  Do a conditional selection on predicted paid loans and sort them by descending interest rate.
Line 4:  Return the sorted Loan ID's.

Finally, I modified the main code block to utilize the Python apscheduler to execute the code at the 4 times Lending Club posts new loans (I'm in the Mountain Timezone).  I was using UNIX cron prior.
def job(alg):
    try:
        lc = LendingClub(ConfigData(CONFIG_FILENAME), alg)
        while lc.hasCash() and lc.hasLoans():
            lc.buy()
    except:
        logger.exception('')


if __name__ == '__main__':
    ldr = loader.Loader('../data')
    frame = ldr.load()
    alg = MlAlg(clf=clf)              
    features_train, labels_train  = ldr.get_trainset(frame, label_col='loan_status')
    alg.fit(features_train, labels_train)
    sched = BlockingScheduler()
    sched.add_job(lambda: job(alg), 'cron', hour='7,11,15,19', minute=0, second=0)
    sched.start()


No comments:

Post a Comment