Bucketing and highlighting dominant predictors in your ML models

At D&D we endeavour to add custom solutions to our products, the example hereunder shows how we innovate and understand the challenges managers and senior stakeholders in the NHS face when it comes to interpreting the outputs of predictive models, and more advanced analytics, in general. The below is a custom example of how we […]

At D&D we endeavour to add custom solutions to our products, the example hereunder shows how we innovate and understand the challenges managers and senior stakeholders in the NHS face when it comes to interpreting the outputs of predictive models, and more advanced analytics, in general. The below is a custom example of how we do it differently and this method is now in many of our predictive modules. So, how does it solve the problem?

Problem summary

ML models have a tendency towards obscurity in their outputs. Terms like Kappa Value, sensitivity or Gini Index will, in most cases, have no meaning to the final user of the predictive algorithm. They are mere indicators of the level of accuracy of the model, and of how much of a good fit it is for a specific problem and are only useful to the Data Scientist or ML Engineer.

One key element of a predictive ML model that suffers from this fate is the list of variables that the model has used more prominently in order to reach a decision. The different elements inputted into it will be used differently by the model (i.e. age could be more relevant information for predicting height than, for example, gender), and it will automatically score them in regards of their contribution for reaching a decision.

D&D’s approach to variable importance

This output of scored variables according to importance, will undoubtedly be relevant to the Data Scientist or ML Engineer but, again, it won’t give the final user the information needed.

In a binary decision-making ML model (a dependent variable / predictor variable, such as stranded or not stranded, that has a true or negative result) the list of importance variables will point at the most important variables leading to both a negative and positive decision. To avoid this, and to only obtain the variables that led to a positive decision (i.e. a patient being readmitted in less than 30 days), at Draper & Dash, we have developed a system that highlights the dominant predictors in a concise, precise and useful way.  It is used in our Stranded and Reamissions apps, and has pointed users to observing the elements that led to patients being stranded, or readmitted on less than 30 days, and act on it more effectively. This is also easier to interpret than the standard model outputs.

Throwing features into a bucket

To do this, we first start by bucketing our variables. If they are numeric, we need to establish cohorts of values, and label them with 1’s and 0’s. Following the previous example, we could bucket age as Age 0-18, Age 18-30, and so on. If the variable is categorical, we would create labels for each possible value, and label them with 1’s and 0’s indicating membership to that bucket.

Once this is done, and the model trained with this bucketed dataset, we start by generating an extra column in the dataset, called ImportantVariables, with the names of all of the bucketed variables that presented a “1”, when the prediction for that particular row was over 50%:

1
2
3
4
5
6
7
8
9
10
11
12
a<-str_split(m$ImportantVariables, ",")
n <- max(lengths(a))
a<-lapply(a, length<-, n)
a<- data.frame(a)
a<-data.frame(apply(a, 2, function(x) gsub("^$", NA, trimws(x))))
…
a$Score <-  rev(rownames(a))
d <- a %>%
  gather(., key ="Score", value="Variable", as.character(c))
d <- data.table(d, key = "Variable")
e<-d[, max(cumsum(Score)), by = key(d)]
setnames(e, "V1", "Value")

Outputting the variable importance combinations

The code, in the previous section, will generate a more understandable list of variables. This is done by counting how often a feature contributes to a positive decision. This system highlights variables that present the value 1 less often (than others) will be considered as less important, even if they were more significant in predicting the overall goal (readmission, stranded, on a pathway, etc.)

The final step is to use the outputs from the code above to generate a weighted model and create a variable score. Finally, thanks to this system, we can obtain an easily understandable, precise list of variables useful to the final user, that can point to problems and, more importantly, solutions.

If you are interested in our stranded, readmissions and other predictive modules – please enquire with gary@draperanddash.com.

I hope you have found this useful.

Alfonso Portabales – Data Scientist