This approach is one method of deploying a machine learning model to a production environment. We utilise many methods, as we have access to a wider array of technology, however this method works just as well. I will take you through training a model, making validation predictions on the model, evaluating the model fit, saving […]
This approach is one method of deploying a machine learning model to a production environment. We utilise many methods, as we have access to a wider array of technology, however this method works just as well. I will take you through training a model, making validation predictions on the model, evaluating the model fit, saving the trained model, loading the trained model into your production R environment, predicting unseen (production) data on the model and combining these predictions with your production data. Let’s get started!
Training the model
The dataset – patients who have been readmitted (binary flag of yes and no i.e. readmit vs not readmitted) has been partitioned into a test and train dataset. This method is a common ML approach, and it is one of the methods you can learn when you come along to our Machine Learning training (enquire at email@example.com). Plus find what we put in our secret sauce to make models even more accurate.
We will use the caret (Classification and Regression Training) package to train the model:
1 2 3 4 5 6 7 8 9 10 11
seed_val <- set.seed(123) no_tree <- 2000 ctrl <- trainControl(method = "none", seeds =seed_val ) rf_caret <- caret::train(factor(Readmitted) ~ ., data=train, method="rf", trainControl=ctrl, ntree=no_tree) rf_caret
What is happening here:
- Set a random seed value – this can be any number you want – but random forests work by taking random samples of data and creating a number of decision trees on the data
- I then specify the number of decision trees we wish to grow
- With caret – the ctrl variable is used to set the train control of the data. This allows different resampling methods to be applied to the model
- I then store the random forest model in a variable called rf_caret this uses the caret::train function to train my patients who have been readmitted. The ~ (tilde) is a special way of saying the left is my target variable and to the right are predictor variables. We use the dot (.) to signify everything else in the data frame. The data argument is passed as your training partition – i.e. the sample used to train the model to inform the relevant decision tree splits and the number of trees to grow is referenced by referring to the variable at the top of the code no_tree.
Making predictions with the test dataset
You can then make predictions on the model using the test data to validate how well the model will perform with unseen (production data):
pred_from_test <- predict(rf_caret, newdata= test[-4])
This use the trained model to make the predictions and then uses all the variables in the test data, bar the 4th column (as this is the thing we are trying to predict i.e. readmission (yes/no).
Evaluating the model fit
You can then assess your model using the confusion matrix to check its validity – refer to our previous post on this: https://www.draperanddash.com/machinelearning/2019/07/confusion-matrices-evaluating-your-classification-models/.
Saving your finalised model
When you are happy with how the model performs – it is time to package it up – so it can be used on production (unseen / live) data – please note the data structure must be the same as the trained model, so if you used 3 predictor variables (age, LOS and gender code) in this example, then you would pass the same amount of values to make the prediction.
This can be achieved by the following code:
prod_ready_rf <- rf_caret save(test,prod_ready_rf, file="rf_data.RDA")
This saves the model and the test data frame and names it rf_data.RDA this is a native R command to store RDA files:
This script would be the script you use to retrain the model. How often you retrain is down to the frequency of how often the data and underlying patterns in the data change. If it is an A & E department that is changing all the time – then retraining daily might be a good strategy. However, some models can take 5-12 hours to retrain – dependent on complexity – so the data scientist who oversees the setting this up of this would need to be cognisant of that fact.
Using the script with live data
To load the file back to your R environment you would use:
This will load the file from the working directory and you would have two objects – test and the prod_ready_rf model.
Next, I am going to use a portion of the test data to represent new unseen data – this would be the data that is being fed live through the model to make predictions:
live_data <- test[1:40,1:3]
This creates a variable called live data and used the first 1:40 rows as the new records and selects the first three columns, as the 4th column is the readmitted flag – this cannot be passed to the model:
Predicting with the production data
The final step is to use the model to generate predictions – that is the whole point of exporting the model in the first place. This can be done by using the predict version of the base R package to generate new predictions:
1 2 3 4
live_predictions <- predict(prod_ready_rf, live_data, type = "raw") print(live_predictions)
If you want to return the class probabilities – you would need to replace the parameter type = “raw” with type = “prob”.
The last step is to bind the predictions back to the master production dataset, in this case it would be the test data frame purporting to be a live dataset:
production_data_frame <- data.frame(Readmit_class=live_predictions, live_data)
Completion of the model deployment
The training model should be separate to the production model – this approach is the approach D&D use to segment our training models from our production models. There are other ways to deploy your model – via Azure Machine Learning or integration directly into a BI solution such as Qlik or Tableau.
This is a very simple example of a model and you spend a lot of time getting the data right, tweaking the accuracy of the model and selecting algorithms. Once you have that great model though – it is good to know how to deploy it.
Remember, the great George Box once said “all models are wrong, but some are useful”. I’ll leave you with that one.