Monday, May 23, 2016

Google Prediction API with Python


Summary

This post concerns usage of Google's cloud machine learning engine - Prediction API.  I'm going to demonstrate the basic functionality of their engine with Python.  I'll use the Lending Club data set that I described here for training and live/active loans for predictions.

Data Wrangling

The Prediction API is looking for feature training data in a particular format.  Specifically, it requires a csv file with the first value/column 1 being the label/output.  Labels must all be either double-quoted strings or numerical values.  If the labels are strings - then the prediction engine creates a classification model.  If the labels are numerical, then a regression model is created. Subsequent columns (features) in the csv can be either quoted strings or numerical.  There can be no header row and no index column.

I was able to re-use the wrangling code I discussed here with only a few minor modifications.  Those modifications were: move the label column to Column 1, map all the labels to the following strings:  "DEFAULT" and "PAID".  I left all the feature columns as floats.
def loan_toString_filter(x):
    if x == 0:
        return '"DEFAULT"'
    else:
        return '"PAID"'

samp['loan_status'] = samp['loan_status'].map(filters.loan_toString_filter)
samp.to_csv('../data/samp.csv', header=False, index=False, quoting=csv.QUOTE_NONE)
Lines 1-5:  Simple function for mapping '0's (defaults) and '1's (paid loans) to strings.  Note the double quotes inside of single.
Line 7:  Application of the map.
Line 8:  That 'quoting' parameter combined with embedded quotes in the input seems to work as far as yielding a quoted string in the resulting csv file.  I tried other methods (eq, csv.QUOTE_NONNUMERIC) that simply didn't work.  I think this may be a bug in Pandas.

Project Creation, API Key, Data Upload

Create a project in the Console and note the ID of that project (you'll use it later).  Create and download an OAuth key file using the API Manager interface in Console.  Additionally, you need to upload the data set you create in the wrangling step to Google Cloud.  Create a bucket and upload your .csv file into that via the Storage interface in Console.


Google API Authentication

Google requires OAuth 2.0 authentication for access to their API's, in general.  Fortunately, they provide a simple Python interface to package up that handshake.
from oauth2client.service_account import ServiceAccountCredentials
from httplib2 import Http
from apiclient.discovery import build

scopes = ['https://www.googleapis.com/auth/prediction', 'https://www.googleapis.com/auth/devstorage.full_control', \
          'https://www.googleapis.com/auth/cloud-platform']
credentials = ServiceAccountCredentials.from_json_keyfile_name('path/to/your/keyfile.json', scopes=scopes)
http_auth = credentials.authorize(Http())
service = build('prediction', 'v1.6', http=http_auth)
papi = service.trainedmodels()    
Lines 5-6:  Specify the scopes you want to authorize.  Different API calls require different minimum scopes.
Lines 7-8:  Create a credentials object using the key file you created in API Manager.
Lines 9-10:  Create a service object with the credentials.

Prediction Model Creation

At this point, you've completed all the housekeeping necessary to start making calls to the Prediction API.  Below is an API call to create a new prediction model.
def create_model(project, mod, storage):
    response = papi.insert(project=project, body={'storageDataLocation': storage, \
                                                         'id':mod}).execute()
    print json.dumps(response, sort_keys=True, indent=4)

create_model(project='loaner-1316', mod='1000sample', storage='loanerbucket/1000samp.csv')
Lines 1-4:  Simple function wrapper to the Prediction API call for instantiating a model and printing its results.
Line 6: Executing the function.

Pretty simple - because the underlying machine learning algorithm is essentially a black box.  There's no way to tweak it other than deciding on whether you want a classification or regression model.
Executing the 'insert' command starts the fitting of the model, which takes quite a while depending on size of your data set.  You can poll the status of the fitting using a 'get' command.  While the fitting is in progress, you'll get a 'trainingStatus' of 'RUNNING'.  Upon completion, the status will change to 'DONE'.
def get_model(project, mod):
    response = papi.get(project=project, id=mod).execute()
    print json.dumps(response, sort_keys=True, indent=4)

get_model(project='loaner-1316', mod='1000sample')
{
    "created": "2016-05-19T23:08:16.886Z", 
    "id": "1000sample", 
    "kind": "prediction#training", 
    "modelInfo": {
        "classificationAccuracy": "0.83", 
        "modelType": "classification", 
        "numberInstances": "1000", 
        "numberLabels": "2"
    }, 
    "selfLink": "https://www.googleapis.com/prediction/v1.6/projects/loaner-1316/trainedmodels/1000sample", 
    "trainingComplete": "2016-05-19T23:24:33.399Z", 
    "trainingStatus": "DONE"
}
Above is what the results look like for a model that has completed fitting/training.  Note the 83% accuracy score which sounds somewhat encouraging; however, it's not real.  As discussed here - the Lending Club data set has a fairly significant class imbalance.  The 'analyze' API call shows a clearer picture of what is going on.
def analyze_model(project, mod):
    response = papi.analyze(project=project, id=mod).execute()
    print json.dumps(response, sort_keys=True, indent=4)
"outputFeature": {
            "text": [
                {
                    "count": "171", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "829", 
                    "value": "PAID"
                }
            ]
        }

"id": "1000sample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "0.00", 
                "PAID": "16.83"
            }, 
            "PAID": {
                "DEFAULT": "0.00", 
                "PAID": "83.17"
            }
        }, 
Above is a snippet of the 'analyze' results.  This particular sample was just a random selection of 1000 entries from the 200K set.  The imbalance is still roughly 4:1 and thus the resulting confusion matrix is a mess.  Now we can see that the 'accuracy' here degenerated to simply the frequency of the majority class (paid loans).

Since we have no control of Google's underlying algorithm, the only way to influence performance here is to manipulate the training set.  Below are results of a perfectly balanced data set created via down-sampling.
 "outputFeature": {
            "text": [
                {
                    "count": "743", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "743", 
                    "value": "PAID"
                }
            ]
        }
    }, 
    "id": "downsample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "49.00", 
                "PAID": "31.50"
            }, 
            "PAID": {
                "DEFAULT": "26.00", 
                "PAID": "42.50"
            }
        }, 
So, simple down-sampling made a noticeable improvement.  Recall on Defaults is now ~61%.

Below is another set of 'analyze' results, this time with a data set created via SMOTE.
 
 "outputFeature": {
            "text": [
                {
                    "count": "2000", 
                    "value": "DEFAULT"
                }, 
                {
                    "count": "1000", 
                    "value": "PAID"
                }
            ]
        }

"id": "smotesample", 
    "kind": "prediction#analyze", 
    "modelDescription": {
        "confusionMatrix": {
            "DEFAULT": {
                "DEFAULT": "177.25", 
                "PAID": "18.75"
            }, 
            "PAID": {
                "DEFAULT": "70.75", 
                "PAID": "33.25"
            }
        }, 
For this data set, I fabricated a 2:1 imbalance towards the minority class (Defaults).  Recall on Defaults is now up to 90%.

Executing a Prediction

After tweaking data sets to get to some reasonable performance out of a model, you can run predictions against the model.
def predict(project, mod):
    authKey = 'yourLendingClubAPIKey'
    loanListURL = 'https://api.lendingclub.com/api/investor/v1/loans/listing'
    header = {'Authorization' : authKey, 'Content-Type': 'application/json'}
    payload = {'showAll' : 'false'}
    resp = requests.get(loanListURL, headers=header, params=payload)
    samp = cleaner.clean(resp.json()['loans'])
    
    loanid = samp.index[0]
    val = samp.iloc[0].tolist()
    body = {'input' : {'csvInstance': val}}
    response = papi.predict(project=project, id=mod, body=body).execute()
    print 'loanid', int(loanid)
    print json.dumps(response, sort_keys=True, indent=4)    

predict(project='loaner-1316', mod='smotesample')
Lines 2-4:  Parameter set-up for a Lending Club API call.
Line 6:  API call to Lending Club fetching the currently available loans.  My discussion on accessing Lending Club's REST interface here.
Line 7:  'Wrangling' step to get the Lending Club loan data in a format acceptable to this prediction model.  This was discussed in more detail here.
Lines 9-10:  Pull a single loan out of the list returned.
Line 12:  API call to Google for a prediction on this single loan.

Results of this prediction call below:
 

{
    "id": "smotesample", 
    "kind": "prediction#output", 
    "outputLabel": "DEFAULT", 
    "outputMulti": [
        {
            "label": "DEFAULT", 
            "score": "0.703670"
        }, 
        {
            "label": "PAID", 
            "score": "0.296330"
        }
    ], 
    "selfLink": "https://www.googleapis.com/prediction/v1.6/projects/loaner-1316/trainedmodels/smotesample/predict"
}

No comments:

Post a Comment