I started this blog series because many physicians and researchers have asked for a reader’s guide to help them navigate our peer-reviewed research papers. Prediction modeling requires extensive interdisciplinary work. I’d like to remind readers that I’m not a statistician or prediction modeler myself. But having directed and worked for seven years now with a top-notch team to develop and apply prediction modeling to infertility, I have come to see myself as a guide to the meshing of these two disciplines. I thank our Chief Statistician, Dr. Bokyung Choi, for discussions and review of this blog post to ensure its accuracy. Please feel free to send us comments or questions at firstname.lastname@example.org.
“How accurate is your IVF prediction model?” “How many times more accurate is it compared to an age-based prediction?” These questions and their variations are some of the most common questions that I get. Often, accuracy and quality of the prediction (also called predictive power) are one and the same in most people’s minds, but in fact, they are two different measures of how well a prediction model performs. I discussed how we assess predictive power in my previous blog post, and will now tackle “accuracy”.
Let’s say an IVF prediction test predicts that the probability of success is 40% for a particular patient. Take 100 of such patients who are all given a personalized success rate of 40%. Well, since no one will have 40% of a baby, each patient either has a baby or doesn’t. If this IVF prediction test has 100% accuracy (and 0% error), then 40 of these 100 patients will have a baby.
How is this different from predictive power? If the conventional age group method also predicted that each of these 100 women had a 40% chance of success, then the IVF prediction test would have 100% accuracy but a PLORA (posterior probability of log-odds ratio compared to age – see previous blog post) of 0, meaning zero value add in terms of predictive power, when compared to age.
In reality, even our large, multicenter data sets would not contain 100 women with a 40% success rate, another 100 women with a 39% success rate, another 100 women with a 38% success rate, and so on. If we were to personalize the accuracy as well as the actual predicted probability, the test data set (see previous blog post for what a “test data set” means) would need to have ~ 100 patients for each predicted probability percentage point. That means if you want to provide prediction to patients with success rates ranging from 10% to 60%, you would need 5,000 (e.g. 50 percentage points x 100 IVF cycles) test cases that are evenly spread out to represent 10% to 60% success rates. And that’s just the test set, which does not even include the training data set. (Even if we were to use a data set comprising 10,000 cases, it’s unlikely that there is an equal distribution of patients for each predicted probability percentage point!)
Therefore, even in personalized medicine, measuring the accuracy of this personalized prediction requires the grouping of patients. The grouping of patients allows us to measure accuracy with a data set much smaller than 10,000 cases. Let’s say we have 1000 patients in the IVF prediction test data set. We divide these patients into groups, each of which roughly representing a fifth (or 20%) or so of the whole group. For example, the top group may comprise 200 patients who have the highest range of predicted probabilities; the next 200 patients have the next highest range of predicted probabilities, etc. (The actual grouping may have slightly more or fewer than 200 patients.) The prediction error of each group is the difference between the average predicted probability (e.g. the expected probability) and the average probability of that group (e.g. the observed probability).
I hope you see now that accuracy and predictive power measure different qualities of an IVF prediction model, but both qualities must be excellent for the model to be reliable and trustworthy. For example, the Univfy PreIVF (e.g. predicting success in a patient’s first IVF cycle) and Univfy PredictIVF (e.g. predicting success in a patient’s next IVF cycle) have an improvement of predictive power by 36% and 75%, respectively, over age-based predictions. Their group-based prediction error is up to 1.5% for most subgroups.
But even high measures of accuracy and predictive power in themselves do not necessarily make an IVF prediction model useful; tune in next time for more on that.