I took a break from writing last month to attend the American Society for
Reproductive Medicine Annual Meeting in San Diego, where my team and I received some very helpful and enthusiastic feedback about how Univfy can meet the needs of patients and providers with our prediction platform. Speaking of the ASRM Annual Meeting, my thoughts go out to colleagues and friends in the NJ/NY area. I hope that their lives will be back to normal soon.
I started this blog series because many physicians and researchers have asked for “reader’s guide to the galaxy” to help them navigate our peer-reviewed research papers. Prediction modeling requires extensive interdisciplinary work. I’d like to remind readers that I’m not a statistician or prediction modeler myself. But having directed and worked for seven years now with a top-notch team to develop and apply prediction modeling to infertility, I have come to see myself as a guide to the meshing of these two disciplines. I thank our Chief Statistician, Dr. Bokyung Choi, for discussions and review of this blog post to ensure its accuracy. Please feel free to send us comments or questions at firstname.lastname@example.org.
Prediction modeling is downright exciting. I’ve already talked about our approach to building IVF prediction models, but for our statisticians, the more challenging task is to test whether a prediction model “works”. Anyone can build a prediction model these days, but how do we know if we can trust it?
There are many levels of validation. Many researchers perform a type of validation that is called internal validation, meaning that they test how well the prediction model works on a portion of the data that was used to develop or train that same model. This approach alone is not rigorous enough, because prediction models tend to do superbly when applied to the data that was used to build them. It’s like a self-fulfilling prophecy.
Our research team performs external validation. This is one of those terms that mean completely different things to physicians, researchers, statisticians, and business people. First, I’ll talk about how most physicians and researchers interpret external validation, and then I will explain how statisticians think of this term.
Conventionally, researchers use external validation to mean that laboratory results (e.g., the predictive value of a molecular biomarker) that were established based on one clinical center’s experience can also be applied to another center’s patients. Thus, the word external in this context refers to a different patient population, or a healthcare facility that is geographically separate.
In our prediction modeling work, external validation means that we apply the model to an independent data set (called the test set) to test whether the predicted probabilities of outcomes are consistent with the true outcomes. The use of an independent data set is required to establish the accuracy and reproducibility of the IVF prediction model itself.
What exactly are we validating? In the validation work, we test how well the IVF prediction model performs on the independent test set in several measures: predictive power, discrimination, calibration, dynamic range, and reclassification. These quantitative measures allow us to compare different models using the same metrics. We cannot judge the performance or utility (usefulness) of a model unless we know how it performs in all these areas.
Predictive power measures how much more likely the test data are represented by one prediction model over another (e.g. the control model). Non-statisticians would ask, “Which model is better? Which model gives a better, more accurate prediction?” Statisticians would ask whether the data fit the model well, or whether the fit is good. To answer these questions, we measure predictive power using a number called “log-likelihood,” which literally means “how likely is it that the data will fit the model”. We measure this fit with a method called posterior log-likelihood and obtain its odds ratio compared to the age control model. (Univfy’s research team has coined this measure PLORA, for posterior log-likelihood of the odds ratio against the age model.)
Let’s say we’re building a prediction model to predict the probability of having a live birth with the first IVF cycle, when given a set of clinical factors. The actual technical measures (log-likelihood) of the prediction model and the age control model are negative numbers that may seem meaningless without a reference. Therefore, we establish a reference, which is the average live birth rate or the probability of having a live birth without using any predictors, not even age. With this reference and a formula that we constructed, we can determine the improvement of a prediction model over the prediction of the age control model, and show this measure as a percentage of improvement. This percentage improvement helps us to determine whether one prediction model is “better”, “more accurate”, or “has higher predictive power” than another.