In publications of randomized controlled trials, subgroup analyses are frequently performed. The idea behind such analyses being to determine whether one group or another has a different result to the overall results, for example, whether boys or girls have more benefit from an intervention. Sometimes this is done to try to salvage some possibly positive results when the overall result is negative, sometimes to try to refine indications for interventions based on the results.
The first thing to realize is that it would be bizarre if every subgroup had exactly the same result from an intervention, just based on random effects. Simply because, to use my own example, girls had more improvement in a particular outcome than boys, does not mean that the difference is due to some biologic difference between them, it may just be chance, and the next trial might show more impact in boys than in girls.
Interpretation of subgroup analyses always has to be taken with a grain (or even a handful) of salt.
When you examine the results of your trial and then decide to do a subgroup analysis based on a suspicion that the girls did better, you are entering dangerous territory. Such post-hoc subgroup analyses should be avoided like a plague, it is far too easy to be led astray; if by chance blond babies did much better with the intervention and brunettes only did slightly better, and you notice in your data set that this is the case, and then do statistical analysis to show that the results are significant in blonds, and not in brunettes, what should you do? The best idea is to not do such analyses. Stick with subgroup analyses that were decided before the study was started based on a reasonable supposition that one group or another might have a different response. Deciding a priori on a small number of subgroups that might feasibly have different responses, (and not a priori listing every subgroup that you can think of) is the first step. Then the statistical analysis requires an evaluation of the interaction between the intervention and the subgroup, it is not enough to show a significant result in one group and not in another, it requires a statistical test to show that the responses are actually different, and that such a difference is unlikely to be due to chance.
Even when you do all that, the only way to be sure that the difference is real, is to do a prospective trial, which might only include the group who had the apparent benefit, if the overall study was a null trial. Post hoc subgroup analyses are not usually strong enough evidence to even do that, which is why a clear statement of whether a subgroup analysis was decided before or after commencing the trial is important, and why publication of protocols, including a description of planned subgroup analyses, is important.
Sometimes things change during a trial, I remember a trial of an established medication, and the company changed the preparation part way through the trial, which changed bio-availability dramatically, which mandated a subgroup analysis that was not planned before starting. Of course in such a circumstance the publication should describe exactly what was done and why, and why the subgroup analysis became important. Something similar happened in the oxygen targeting trials, when Masimo recalibrated the oximeters in use in several of the trials, the changes in saturations actually achieved required a subgroup analysis.
A publication from 2012 investigated claims of significant subgroup effects in RCTs, and showed that only 50% reported a significant test of interaction (and only 2/3 of those actually reported the test or gave the data). Sun X, et al. Credibility of claims of subgroup effects in randomised controlled trials: systematic review. BMJ. 2012;344:e1553.
That study included a list of criteria for deciding whether a claim of a subgroup effect might be reliable:
Ten criteria used to assess credibility of subgroup effect
Was the subgroup variable a baseline characteristic?
Was the subgroup variable a stratification factor at randomisation?*
Was the subgroup hypothesis specified a priori?
Was the subgroup analysis one of a small number of subgroup hypotheses tested (≤5)?
Was the test of interaction significant (interaction P<0.05)?
Was the significant interaction effect independent, if there were multiple significant interactions?
Was the direction of subgroup effect correctly prespecified?
Was the subgroup effect consistent with evidence from previous related studies?
Was the subgroup effect consistent across related outcomes?
Was there any indirect evidence to support the apparent subgroup effect—for example, biological rationale, laboratory tests, animal studies?
A new publication in JAMA Internal Medicine (Wallach JD, et al. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA internal medicine. 2017) specifically looked at subgroup analyses in published RCTs. The investigators examined whether such analyses were performed, whether appropriate statistical tests of interaction were performed, how common significant differences were, and then whether any follow-up studies had been done. They found 64 RCTs with 117 analyses making claims of important subgroup differences and :
Of these 117 claims, only 46 (39.3%) in 33 articles had evidence of statistically significant heterogeneity from a test for interaction. In addition, out of these 46 subgroup findings, only 16 (34.8%) ensured balance between randomization groups within the subgroups (eg, through stratified randomization), 13 (28.3%) entailed a prespecified subgroup analysis, and 1 (2.2%) was adjusted for multiple testing. Only 5 (10.9%) of the 46 subgroup findings had at least 1 subsequent pure corroboration attempt by a meta-analysis or an RCT. In all 5 cases, the corroboration attempts found no evidence of a statistically significant subgroup effect.
Most claims of a subgroup difference, then, are not supported, even by the evidence in the actual publications where the claims are made (note to anyone involved in peer review, make sure that statistical tests of interaction are reported before accepting that subgroup differences might be real). In the few cases where later randomized trials are performed which tried to determine whether there really were subgroup differences, they were all negative.
In neonatology, one study which answered most of the above criteria is from the CAP trial: Davis PG, et al. Caffeine for Apnea of Prematurity Trial: Benefits May Vary in Subgroups. The Journal of pediatrics. 2010;156(3):382-7.e3. That secondary analysis showed that age at starting treatment (a baseline characteristic, but not a prespecified subgroup, or a factor for stratification) had a significant impact on the age of extubation and the age of stopping oxygen. Starting treatment before 3 days had a greater impact than after 3 days, and the interaction was significant, at least for postmenstrual age at last extubation and post-menstrual age of finally stopping CPAP. That publication also showed that the infants who were receiving positive pressure ventilatory support at randomization also had a greater impact on their neurodevelopmental outcome. Both of these findings are biologically plausible, and both are accompanied by subgroup differences for other outcomes which (even if not statistically significantly interactions) were in the same direction, such as a reduction in bronchopulmonary dysplasia.
Observational studies also need to be carefully interpreted. Methods for adjusting for baseline risk differences in cohort studies, such as multivariate regression, propensity analysis and instrumental variable analysis, might help to balance groups for prognostic variables, but there will always remain the potential for unknown prognostic variables to bias the results. A fantastic new addition to the “Users’ guides to the medical literature” series in JAMA has just been published. Agoritsas T, et al. Adjusted analyses in studies addressing therapy and harm: Users’ guides to the medical literature. JAMA. 2017;317(7):748-59. A great read for anyone who uses the medical literature and sometimes reads observational studies, which I think is most of us. They describe the various methods of adjustment (in non-statistican language, thankfully) including the “instrumental variable analysis” which was new to me as a term, but the concept is simple. When variations in the application of a treatment occur which are not related to prognosis, then you can use that variation as a substitute for randomization. In other words if a treatment is applied differently in one hospital compared to another (such as inhaled NO in the very preterm) but the hospitals treat the same kind of patients, with the same risk characteristics, then you can use that fact to mimic cluster randomized allocation. The problem is that even the statisticians can’t agree exactly how to do that, and there is still a possibility of unbalance in other prognostic factors.
The authors of the article end with a list of major publications that reported observational studies showing a positive or negative effect of a medication, which was disproved by prospective randomized trials
Comparative effectiveness research relying on observational studies using conventional or novel adjustment procedures risks providing the misleading effect estimates seen with hormone replacement for cardiovascular risk, β-blockers for mortality in noncardiac surgery, antioxidant supplements for healthy people, and statins for cancer. If RCTs cannot be conducted, it will remain impossible to determine whether adjusted estimates are accurate or misleading
The abstract ends with this sentence “Although all these approaches can reduce the risk of bias in observational studies, none replace the balance of both known and unknown prognostic factors offered by randomization.”