I would never have actually thought to ask the question whether platelet transfusion might close the PDA, although early thrombocytopenia is associated with persistent PDA, and platelet plugs seem to be part of the mechanism of closure. A group in India have just published an RCT in preterm infants with a PDA (hemodynamically significant, whatever that means) who had a platelet count under 100,000. Kumar J, et al. Platelet Transfusion for PDA Closure in Preterm Infants: A Randomized Controlled Trial. Pediatrics. 2019. Gestational age averaged 30 weeks, and they were enrolled at a mean of 3 days of age. Median time to PDA closure was identical in the group randomized to receive transfusion (10, 15 or 20 mL/kg depending on the count) and the control group, at 72 hours in each group, data based on repeated echo every 24 hours until closed. All babies received ibuprofen or acetaminophen also. 44 babies were enrolled, and of the 22 in the transfusion group there were 9 new IVH (4 severe, grade 3 or 4) after enrolment, compared to 2 new IVH among the controls, (both severe).
In the recent PLANET2 trial there were more serious bleeding episodes in the transfused babies than in the controls, and apparently most of them were IVH, I don’t have access to those numbers, but whatever they are, the effect appears to be in the same direction.
I would like to see a meta-analysis, which would have some limitations given the 3 different thresholds in those 3 trials (which are as far as I know the only RCTs of platelet transfusion at different thresholds), but if the PLANET2 data are indeed consistent, and with a much greater power than the 2 other small trials, that would be very powerful data. It would confirm that not only are platelet transfusions in general ineffective in preventing bleeding at these 3 threshold levels, but they likely increase the risk of IVH.
Why would that be the case? It may be that transfusing adult platelets to babies with newborn plasma, which is already hypercoagulable, causes the effect, either by capillary damage, or by causing infarctions which then become hemorrhagic, or some other mechanism. It could just be the effect of volume expansion, which can certainly cause lesions in newborn beagle puppies (see Laura Ment’s studies from the 80’s and 90’s), and many observational studies that have correlated volume expansion with IVH. Platelets are often given somewhat faster than red cell transfusions, (it does not appear to have been specified inPLANET2, the dose was 15 mL/kg, but the duration isn’t mentioned in the protocol) often over 1 hour. Volume expansion is also probably more effective than with saline, much of which rapidly leaks out of the circulation. I think either some impact on overall coagulation/anticoagulation balance or hemodynamic changes, or both, may be responsible for the apparent increase in IVH.
This is the multicenter randomized controlled trial of sustained inflations at the onset of resuscitation for very preterm infants less than 27 and at least 23 weeks gestation. Enrolled babies received face mask CPAP for up to 30 seconds, and if they needed PPV (i.e. apneic or gasping or heart rate <100) then they were randomized to sustained inflation or standard NRP. Sustained inflation babies started with a 15 second inflation at 20 cmH2O, they were then evaluated on CPAP and, if apneic or gasping or heart rate < 100, they switched to standard NRP, if those things didn’t apply they received a second sustained inflation to 25 cmH2O for 15 seconds. All of which was rather arbitrary, in terms of indications, pressures, and durations, but there wasn’t any reliable data to make more evidence based choices (and still isn’t).
The primary outcome of the study was the infamous “death or BPD”, which I have criticised here frequently enough, I think, but just to be really annoying; being dead and having oxygen at 36 weeks PMA are not equivalent, and a composite outcome which combines them risks the real potential that they could change in opposite directions, and show no effect, or that mortality changes will be overwhelmed by the much more frequent occurrence of BPD. Mortality as one outcome and BPD among survivors, as another outcome makes much more sense. Even better would be a measure of lung injury which reflects respiratory outcomes of importance to families.
As many of you will know by now, the study was stopped by the DSMC after enrolment of 460 patients because of an excess of early deaths (under 48 hours of age) in the sustained inflation group, many of which were considered to be possibly associated with the intervention. As well as stopping the trial the DSMC mandated a Bayesian analysis, which revealed that it was highly unlikely that sustained inflation would be shown to be preferable if the study had continued, and that either a null result, or an advantage of standard care were far more likely results.
This is an important trial with an important message: if you want to do sustained inflation, don’t do it like this. If you want to do sustained inflation using a substantially different approach, you had better do a high quality study with careful surveillance for adverse effects, and don’t do it outside of an RCT.
Failing that, I think that sustained inflation as routine initiation of resuscitation of the preterm infant should be laid to rest.
The authors have done what other trials have also done recently, which is to report BPD at 36 weeks, or death at 36 weeks as being the components of the primary outcome, I still don’t understand this, as it means that death after 36 weeks without BPD is considered a good outcome! Why not survival to discharge as part of the composite? The authors collected survival to discharge (it is secondary outcome number 22), but I cannot see the result in the article or appendix.
My recent discussions about significance and how to refer to results are well illustrated by the following sentence from the discussion.
An unexpected excess mortality rate with sustained inflation in the first 48 hours of life led to early trial closure, although mortality at 36 weeks’ postmenstrual age was not different.
Well, pardon me, but as far as I am concerned 20.9% IS different to 15.6%, they are clearly different numbers! Because the difference between 2 numbers is not “statistically significant” does not make them the same. As you can see from the survival curves below, they are a bit closer together at 12 weeks than they are at 7 days, but they remain different. It would be accurate to say, ‘the p value for the difference in death at 36 weeks is 0.17 with a relative risk of 1.3’, and to note that ‘relative differences in mortality at 36 weeks which are compatible with the data, range from a 10% decrease with sustained inflation, to a 90% increase ‘; but not just to say they are “not different”.
Other secondary outcomes vary between those which are practically identical between groups, such as severe IVH (9.8% vs 10.4%), and those which are very different, e.g. pneumothorax (5.1% with SI vs 9% standard NRP). None of them were “statistically significant”.
Almost simultaneously appeared in print the following Tingay DG, et al. Gradual Aeration at Birth is More Lung Protective than a Sustained Inflation in Preterm Lambs. Am J Respir Crit Care Med. 2019 a very interesting study in preterm lambs examining a sustained inflation strategy, where they used 35 cmH2O and maintained it until there was no more volume entering the lungs, and then for another 10 seconds. This was compared to ventilation with PEEP, and a 3rd strategy of ventilation with PEEP, and added progressive increases in PEEP until compliance was maximized, at which time, PEEP was progressively decreased. The sustained inflation group had very uneven lung aeration, and increased signs of lung injury. This confirms I think that we could still have some benefit from finding novel ways of ensuring early adequate uniform lung inflation, but simple sustained inflation is not the answer, at least in the immature lung.
I started writing the previous post several weeks ago, and, of course, the ideas are not original with me, in fact, a whole recent issue of “The American Statistician” is dedicated to not just trying to eliminate talk of statistical “significance”, but to provide alternatives.
For example, consider a series of analyses of unintended effects of anti-inflammatory drugs. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation…. and that the results stood in contrast to those from an earlier study with a statistically significant outcome.
Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).
It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect.
Similar things happen all the time in our field, where results with wide confidence intervals which cross a relative risk of 1 are reported as showing “no effect” or “no statistically significant effect”.
Here is a real neonatal example, the classic interpretation of the Davidson study would be that inhaled NO does not prevent ECMO in term babies with hypoxic respiratory failure, as the 95% confidence intervals for their RR of 0.64 include 1.0. The classic interpretation of the other two studies is that inhaled NO does prevent ECMO, but one, NINOS, had a relative risk that was actually less extreme than Davidson, at 0.71, but the confidence intervals don’t include 1. In reality all 3 studies show about the same effect, two being more precise than the third. In some (most) journals you would have to state the results in that way, and would not be allowed, when reporting the Davidson trial, to note the fact that ECMO was less frequent after iNO (although clearly it was), because it is not “statistically significant”.
I think we have to be ready to embrace uncertainty, to realize that dichotomizing our research into reports of things that work and things that don’t work, is unhelpful and may retard clinical advances.
no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern famously observed, the difference between “significant” and “not significant” is not itself statistically significant.
So what should we do? There are useful suggestions at the end of that editorial, and the authors of each paper were asked to come up with positive suggestions, rather than just a list of “don’t”s.
Overall the suggestions are given the mnemonic “ATOM” Accept uncertainty, be Thoughtful, Open and Modest.
One specific suggestion is that we might continue to report P-values, but as exact continuous values, (p = 0.08, or 0.46) without any threshold implications by the use of < or > notation. I think that could be useful as a way to eliminate the tyranny of p<0.05. It could reduce the risk of “p-hacking”, which is the tweaking of analysis, or even of data, in the search for a p-value which is just under 0.05. They further suggest that such exact p-values should be accompanied by other ways to present the results, such as s-values, Second generation p-values (SGPV), or the false positive risk, all of which they explain, and all of which themselves carry difficulties or unknowns.
Another suggestion is to refer to what are now called confidence intervals as “compatibility intervals”, the idea being that you would state that your result is most compatible with a range of effect sizes between Y and Z, rather than concluding that if the 95% confidence interval includes 1 the difference is not real, but, if it just excludes 1, then there is a real difference between the results. (That would be no better than relying on p<0.05).
The nexus of openness and modesty is to report everything while at the same time not concluding anything from a single study with unwarranted certainty. Because of the strong desire to inform and be informed, there is a relentless demand to state results with certainty. Again, accept uncertainty and embrace variation in associations and effects, because they are always there, like it or not. Understand that expressions of uncertainty are themselves uncertain. Accept that one study is rarely definitive, so encourage, sponsor, conduct, and publish replication studies. Then, use meta-analysis, evidence reviews, and Bayesian methods to synthesize evidence across studies.
One unexpected benefit of eliminating the words “significant” and “significantly” as well as their opposites would be a reduction in the number of words in a manuscript, which could be used for other things. In the recent publication from the Stop-BPD trial that I posted about recently, the words significant and significantly were used 19 times.
In contrast, I am currently revising an article for publication, and it is actually quite difficult! It is so ingrained to think of p<0.05 being significant that trying to come up with other ways of talking about the results of statistical tests can require some actual thought about the meaning of your results!
More seriously, the tyranny of p<0.05 and the use of the words “significant” and “non-significant” lead to a distortion of the English language. For example, a study with 100 patients per group might find that one group has a mortality of 10% and the other has a mortality of 20% (p=0.075,), it would be dangerous and misleading to state “there was no difference in mortality” just because the p-value was too large “p>0.05”, or “NS”.
This is also not a “trend”, a word which implies that things are moving in that direction, it is a real finding in the results, but like all real findings it can only give an estimate of what the actual difference would be if the 2 treatments were given to the entire population. That actual difference is unknowable, and we should be more careful about pretending we know what the actual difference is. Any result from a trial is only an estimate of the true impact of the intervention being tested, an estimate which gets closer to the likely probable true impact as the compatibility intervals become smaller, as long as there are no biases in the trial.
It is also, I think wrong to suggest that the difference is “non-significant” only because of lack of numbers. That always presupposes that a larger trial would have found the same proportional difference (100/1000, vs 200/1000), and that it would then become significant (p<0.001, sorry about the < sign, but the software doesn’t give actual p-values when they are that small!) In reality a larger study might show a mortality difference anywhere within, or beyond, the compatibility intervals of the initial trial.
A better way of presenting those data would be the actual continuous p-value from Yates corrected chi-square, which is 0.075, the actual risk difference in deaths, 0.2 – 0.1, that is, 0.1 and the 95% compatibility intervals of that difference which are 0.07 to +0.26. So the sentence in the results should read something like, “there was a 10% absolute difference in mortality between groups 10% vs 20%, p=0.075, a difference which is most compatible with a range of impacts on mortality between a 7% increase and a 26% decrease”. That is longer than saying “no difference in mortality”, but it has the advantage of being true, and of using some of the words you saved by eliminating “significant” from the paper. It also alerts readers and future researchers that there is a potential for substantial differences in a major clinically important outcome, which does not happen when the terms non-significant, NS, p>0.05, or no impact, are used.
I am going to do my best to avoid thinking of statistical tests as yes or no, true/not true, effective/not effective, and to avoid the word “significant” in my publications, I wonder how long until an editor tells me that doesn’t work, and I have to say it, or makes me say “no difference” because p>0.05.
I think the word “significant” should be banned. (Not in life; I am not a fascist; you can say whatever you want is significant, but in medical research there is so much confusion about the term that we would be better to never use it!)
I think authors who find a potentially positive result in a good quality study should be allowed to say things like “if there were no other unanticipated biases in our research design, the likelihood that our results are due solely to random variation is less than 1 in 20”, which is less sexy, but more accurate, compared to saying “our results were significant”. (If there are any real statisticians out there reading this, and I say anything which is not accurate, please let me know, I only have basic statistical training and would be happy to be corrected!)
It would certainly be much better than assuming that p<0.05 means that you definitely found an effect, or that p>0.05 means that there is nothing there!
In this blog I usually try to avoid the term “statistically significant” (or not), as the term is often used to imply “proven effect” as compared to “proof of no effect”. I hope we all know that the threshold, where p=0.051 means no effect, and p=0.049 means proven effect, is nonsense. Some journals have banned the reporting of p-values and even confidence intervals, as a result. I think this is extreme, I think we should be able to report confidence intervals, but that multiple confidence intervals, 90, 95, and 99% should perhaps be demanded. And also appropriate wording, similar to what I suggested above. The risk is that a 95% confidence interval which excludes unity will be considered to be proof that there is a real difference, which is no better than using a p-value threshold. The differing confidence intervals could be used to give an overall estimate of an effect, and its potential ranges.
In this blog I probably sometimes get caught up in the usual patterns of referring to p-values, but usually I try to say something like “not likely to be due to chance alone”, which does not mean that a difference is necessarily due to a real effect of the intervention, but that the data would be unlikely if you picked the numbers at random out of a soup of numbers. All sorts of things might cause a p-value to be less than 0.05 when you compare outcomes between 2 groups with a different intervention, only a minority of which are due to a true impact of the intervention.
A study with an unknown bias might well provide a “significant” p-value when there is no real effect of the intervention, just as a study with a “non-significant” p-value might report a major advance in medicine.
The authors of that recent paper put it this way :
It is true that the smaller the P value, the more unusual the data would be if every single assumption were correct; but a very small P value does not tell us which assumption is incorrect. For example, the P value may be very small because the targeted hypothesis is false; but it may instead (or in addition) be very small because the study protocols were violated, or because it was selected for presentation based on its small size. Conversely, a large P value indicates only that the data are not unusual under the model, but does not imply that the model or any aspect of it (such as the targeted hypothesis) is correct; it may instead (or in addition) be large because (again) the study protocols were violated, or because it was selected for presentation based on its large size.
There have been recent publications suggesting that the critical P-value should be shifted to a much smaller number (such as p<0.005), particularly for epidemiological, rather than interventional studies. But I think that will just shift the problem, and will make it harder to find really useful beneficial effects, or to potentially harmful results.
Abandoning the term “statistically significant” should be enforced, and will force us to makes more nuanced and reasonable evaluations of our data.
For a few years now Annie Janvier in our unit has been developing programs of partnership with families. Using contacts with mostly “veteran parents”, and occasionally veteran patients, we have developed partnerships in research, patient care, and education.
The new article is an in-depth evaluation of the PAF team development, evaluation, and improvement, some of the mistakes made along the way, and some principles, many of which are probably generalizable, that can be used to help in the process.
The title, I think, is apposite, although many of us have been discussing how to involve parents over the past few years, often the involvement of parents has been seen as a “nice extra”. In contrast, I think we should consider that everything that we do will benefit from the full integration of resource parents in our teams, and that having a token parent seat at the table is not enough.
For anyone who doesn’t have full text access to the Journal of Pediatrics, Annie gave me permission to include the following link in this blog post https://authors.elsevier.com/a/1YnEL55CrsVAw the first 50 people accessing the link can download a free full text.
The PAF initiative costs very little, but there are some costs, mostly for parking, snacks, our wall of hope, and other minor costs. Our goal for fundraising this year is only $12,000 (Canadian), please consider making a small donation to our team. If you like this blog, please consider making a large donation!
Many babies receive some sort of “resuscitation” during their transition from intra-uterine to extra-uterine life.
How do we decide when a baby needs intervention? A baby who is active and breathing is usually left alone, a baby who is neither of those things might need intervention, and many of our decisions are based on the baby’s heart rate.
Bradycardia= needs ventilation. Mild bradycardia= optimize ventilation and reassess, good heart rate = observe and wait. I like things to be simple!
Recent studies have focused on heart rate determination as the best indication that adaptation is appropriate, but that begs the question:, how to determine heart rate? Should we listen to their heart sounds, palpate their pulses, or watch their ECG? It seems that getting an accurate heart rate is faster with immediate ECG application (Katheria A, et al. A pilot randomized controlled trial of EKG for neonatal resuscitation. PLoS One. 2017;12(11):e0187730) and that this might lead to more rapid institution of corrective actions. But electrical activity of the heart does not mean that it is pumping well; in animal models pulseless electrical activity is frequent. Many immature animals, after resuscitation, have periods of electrical activity without mechanical activity. If that happens with babies, then we may have to readjust our algorithms; presence of an ECG signal does not mean that you necessarily have adequate cardiac function.
What I think this means is that, when resuscitating depressed newborns, the ECG might be very helpful to get an accurate heart rate quickly, and if the heart rate is slow we should respond according to NRP algorithms.
At some point we should confirm that there is actually cardiac contraction, not just electrical activity. If the infant starts to move and breathe, that is probably enough evidence. BUT, if the ECG heart rate is present but the baby isn’t improving, we should immediately evaluate whether there is sufficient cardiac activity.
In the cases we report there was ECG activity, but no actual cardiac function detectable, when that was recognized and interventions followed, all the babies were severely damaged, and they all died. I wonder if the situation had been recognized faster, could there have been better outcomes? We could even ask if those babies would have been better treated without the ECG?
Maybe the introduction of the ECG as a routine measure of cardiac activity during neonatal resuscitation has been an error?
How should we determine that the heart is actually contracting effectively? I think if the pulse oximeter is giving a reliable signal, at the same rate as the ECG, that means there is at least some arterial pulsation in the right wrist/hand and probably perfusion is at least minimally effective: if the pulse oximeter is not (yet) functioning, then palpation of the pulses may be adequate, or perhaps clear heart sounds are enough evidence that the heart is actually moving…
I’m not sure what the best approach is, but recognizing that the ECG only identifies electrical activity, and that actual cardiac pumping is what the baby needs, is the first step.
We have a big problem in neonatal research. We have constructed composite outcomes that have become the “standard of design”, but are not of much use for anyone. Because we are, rightly, concerned that death and other diagnoses may be competing outcomes, we often use as the primary outcome measure “death or BPD” or “death or severe retinopathy” or death or “neurodevelopmental impairment”. We have done this because dead babies can’t develop BPD, or developmental delay.
The idea, of course, is that we want to see if an intervention will improve survival without lung injury, for example. There are two problems with this, if the outcome is more frequent, but neither part of the outcome is individually significantly affected. What then? The other problem is that we might well find that death is less frequent but that lung injury is more frequent. And what then? If the composite outcome is unchanged, then strictly speaking we can only say that the study found no effect on the outcome, and an analysis of the parts of the composite outcome are considered secondary analyses.
This happens. The SUPPORT trial showed no effect of oxygen saturation targets on the primary outcome, but the low target babies had more mortality, while the high target babies had more retinopathy.
Study designs like this are effectively equating the parts of the primary outcome in importance for the analysis.
By studying the outcome of “death or BPD” we are effectively saying that an adverse outcome is being dead or being on low-flow oxygen at 36 weeks. I don’t think many readers of this blog would agree, if they themselves were critically ill, that surviving with a need for long-term domiciliary oxygen and being dead were equivalent.
They had to be ventilator dependent at 7 to 14 days of age with a respiratory index (product of mean airway pressure and the fraction of inspired oxygen) equal to or greater than 3.5 for more than 12 h/d for at least 48 hours.
Which would mean for example a mean airway pressure of 8 and an FiO2 of 0.44.
During the initial months of the trial, participating centers noted that many infants receiving ventilation and considered at high risk of BPD had a respiratory index of less than 3.5 and were treated with corticosteroids outside the trial. Based on this feedback, the respiratory index threshold was reduced to 3.0 and finally to 2.5 (in May 2012 and December 2012, respectively) via approved protocol amendments.
By the end of the trial, then, an infant at 7 days of age, with a mean airway pressure of 8 on 32% oxygen or more would have been eligible.
The definition of BPD was oxygen requirement at 36 weeks (with an O2 reduction test if needing less than 30%). Death was also recorded to 36 weeks for the primary outcome. Which means that dying between 36 weeks and discharge would be considered a good outcome, if you didn’t have BPD.
The primary outcome occurred in 128/181 hydrocortisone babies (70.7%), and 140/190 controls (73.7%). In other words there was no impact of the hydrocortisone, which is what the abstract states. But at 36 weeks there were significantly, and substantially, more babies who received hydrocortisone alive than controls, 84.5% vs 76.3%, which was “statistically significant” p=0.048. Between 36 weeks and hospital discharge there were several deaths in each groups, and the difference had narrowed slightly, with 80% of hydrocortisone babies and 71% of control babies being alive, p=0.06.
This happened despite a very high rate of open-label hydrocortisone use in the control babies. In fact 108 of the 190 control babies received hydrocortisone.
The protocol is available with the publication, and it notes the following :
In case of life threatening deterioration of the pulmonary condition, the attending physician may decide to start open label corticosteroids therapy in an attempt to improve the pulmonary condition. At that point in time the study medication is stopped and the patient will be recorded as “treatment failure”.
This could occur during the 21 days of study drug use. In addition, physicians could give steroids after the 21 days of the study drug:
Late rescue therapy outside study protocol (late rescue glucocorticoids): Patients still on mechanical ventilation after completion of the study medication, i.e. day 22, may be treated with open label corticosteroids.
I’m not quite sure about this, but I think that 86 of those 108 control babies who received hydrocortisone got it during the 21 days study drug window, and 22 others received steroids after the study drug period. In the hydrocortisone group I can see no indication of how many got open-label steroids during the study drug period, but there are 6 who got steroids after the end of that period.
The substantial differences in mortality are despite a very high rate of treatment of babies randomized to control who received hydrocortisone, which will of course dilute the potential impact of the intervention.
There are modest differences in BPD between the groups, with the hydrocortisone babies having slightly more (100 cases vs 95), but if you express this result as “BPD among survivors”, the numbers are actually identical; just over 65% in each group.
I think the best interpretation of this study would be as follows: eligible babies who received immediate hydrocortisone, compared to those who waited and only received hydrocortisone in the case of a “life-threatening” deterioration, were less likely to die, but, if they survived had the same likelihood of developing BPD.
I hope there is neurological and developmental follow up planned for this trial, although the power of the study to say very much, when so many control babies received hydrocortisone, will be quite limited.
This is now a huge problem, the published article states there is no effect of hydrocortisone, but that is not what I get from the data.
Here is the cute graphic that accompanies the paper
What can we do about this? Based on this study, the use of hydrocortisone in a similar dose, to infants with substantial oxygen requirements after 7 days of age would be a reasonable choice. Waiting for life threatening deterioration (it would be interesting to know what that meant to the attending physicians!) seems to increase your risk of dying. I think it is unlikely that any neurological or developmental impacts of hydrocortisone are severe enough to be worse than dying, and I just hope that any long term outcome study of these infants does not use the outcome “death or low Bayley scores”.
Analyzing the deaths differently using survival curves gives the following, with a p-value suggesting that this is unlikely to be due to chance alone. I know it’s a bit more than .05, but there is only 1 chance in 17 that completely random numbers would give a difference like this :
I think we have to stop using “death or BPD” as a composite dichotomous outcome for our studies.
There are alternatives, even when death and the other outcome of interest are competing.
One way is to analyze the same data differently. One method, for example, is to compare each babies outcome to all of the babies in the other group. A baby who dies receives zero points in comparison to the other group babies who died, receives -1 point in comparison to the other group babies who survived. Each surviving baby with BPD is then scored +1 point in comparison with the other group babies who died, zero points in comparison with the other group babies with BPD and -1 point in comparison with the surviving babies without BPD, and babies without BPD score +1 in comparison with babies who died or survived with BPD, and score 0 in comparison with babies who survived without BPD. The ratio of winning to losing babies is then referred to as the “win ratio”.
For example, if you ran a study with 20 babies per group, and the results showed group A had 5 deaths and 10 survivors with BPD, group B had 10 deaths and 5 with BPD. Our usual analysis would say there was no impact on “death or BPD”. The analysis that I have just suggested, in contrast, gives a score in group A to each one of the dead babies of -10, and -15 to those in group B. The BPD babies each score+5 in group A and 0 in group B, and the survivors without BPD score +15 in both groups. The win ratio for the trial is 3.0 for group A, as there are 15 babies who win overall in most of their pairwise comparisons, and 5 who lose. Calculating the p=value for this is complicated, but well described, and methods for calculating the confidence interval of the win ratio are, also.
Effectively, what this kind of analysis does is to rank the adverse outcomes, death being scored before BPD.
I would be fascinated to see what the results of STOP-BPD would look like if this kind of analysis was performed, the win ratio of the hydrocortisone group works out to 5.2 to my calculation, compared to 3.4 for the controls. It could be that such a difference is statistically significant, and such an analysis might enable future trials to be designed using this method.
You can also with this technique examine different severities of BPD, with BPD being scored as moderate vs severe. This kind of analysis can also include longitudinal quantitative measures, such as duration of home oxygen therapy, or number of admissions after discharge. Things which are, I would suggest, far more important to parents than whether the oxygen is stopped before or after 36 weeks.
Before there are any other trials counting death and BPD as equally important outcome measures, or death and retinopathy, or death and developmental delay, or “death, BPD, NEC, LOS, IVH, ROP” we should reconsider how we measure and analyze outcomes. We should be including outcomes that are important to families, rank them according to their relative importance to parents, and analyze them using methods which are now well validated which take into account their relative importance.
Two recent trials in adult ICU patients ask very interesting questions, questions which are only linked by testing something clinically simple versus a more technologically demanding evaluation.
This might be most important in countries with limited access to laboratory results, but in all countries, if our clinical evaluation of perfusion is accurate and can be followed in almost real-time and responds more quickly to changes in actual perfusion, then it would be an advantage for all of us to include it in our protocols.
28 hospitals in 5 South American countries participated in a trial which enrolled 424 adults with septic shock. The protocols for following either serum lactate or capillary filling time were clearly documented, and mostly followed. Cap filling was performed in a highly standardized way with a glass slide used to compress a nail bed of hand until blanched for 10 seconds, then released and the return of colour was monitored with a stop-watch.
To spare you all the details of the protocols, they were very similar apart from the methods used to evaluate perfusion.
Mortality was 35% in the cap filling group and 43% in the lactate group. This was not “statistically significant” but it looks to me like a big deal! I guess I should say now that the study was underpowered but having read this really clear blog post about using that phrase (https://towardsdatascience.com/why-you-shouldnt-say-this-study-is-underpowered-627f002ddf35) I will say that the study was underpowered to detect really important differences in mortality, differences that that anyone who has septic shock might be interested in.
In fact the 95% confidence intervals for the hazard ratio only just include 1.0 (0.55 to 1.02) and the p-value was 0.06, which might not conventionally be “significant”, but is at least highly suggestive. The patients in the cap refill group received about 0.5 L less fluid during the resuscitation phase, and had less organ dysfunction. Again suggesting that maybe we often give too much fluid to patients with septic shock, leading to organ dysfunction and death.
There were no advantages shown to the more invasive strategy, so the simple clinically directed adjustment of PEEP was equally effective.
One reason for discussing this study is that they don’t analyze “BPD or chronic lung disease” as an outcome! The primary outcome and the method of analysis are fascinating :
The prespecified primary end point was a ranked composite score that incorporated death and days free from mechanical ventilation through day 28, calculated in such a manner that death constitutes a worse outcome than fewer days off the ventilator. Time free from mechanical ventilation was calculated as the number of days between successful liberation from the ventilator and study day 28. Each patient was compared with every other patient in the study and assigned a score (tie: 0, win: +1, loss: −1) for each pairwise comparison based on whom fared better. If one patient survived and the other did not, scores of +1 and −1 were assigned, respectively, for that pairwise comparison. If both patients in the pairwise comparison survived, the assigned score depended on which patient had more days free from mechanical ventilation: the patient with more days off the ventilator received a score of +1, while the patient with fewer days received a score of −1. If both patients survived and had the same number of days off the ventilator, or if both patients died, they both were assigned a score of 0 for that pairwise comparison. For each patient, scores for all pairwise comparisons were summed, resulting in a cumulative score for each patient. These cumulative scores were ranked and compared between treatment groups via the Mann-Whitney technique.
I think there are some very important lessons to be learned here. You can incorporate potentially competing outcomes without giving them the same importance in the analysis. In this analysis death is clearly considered more important than getting extubated more quickly.
The above title is the title of a talk I just gave at the NEO2019 conference. I have made available a .ppt file of the final slides from the talk, under the tab at the top of the page “presentations”. The version in the App, which is made available to participants at the conference, is slightly different to this final version.
My review of the literature led to the following conclusions
Among very preterm or very low birth weight infants :
Growth and bone mineralization approaching desired standards can only be achieved by fortifying BM
Commercial bovine or human multicomponent fortifers have become the standard of care
Desired growth can be achieved with maternal BM and fortification, or donor BM and fortification, if enough attention is paid to growth
Donor BM has less protein (and slightly fewer calories) than preterm maternal BM, for a few weeks, and requires higher supplementation
And then in terms of the scientific evidence about what and when:
When maternal breast milk supply is insufficient to meet the baby’s needs: supplementation with formula increases NEC compared to donor BM (Older and recent studies, moderate to good quality data)
Multi-component fortification, with powdered bovine-protein based products, has not been shown to affect NEC compared to no fortification (almost all studies before 2000, poor to moderate quality data, wide confidence intervals)
Multi-component fortification from different sources (bovine compared to human) has not been shown to change the incidence of NEC, when used with a strategy of maternal or donor breast milk (moderate quality data, wide confidence intervals)
Individualized fortification using BM analysis not shown to improve clinically important outcomes compared to adjustment according to growth (poor quality data, small studies, wide confidence intervals)
Early introduction of fortifiers has not been shown to adversely impact clinical outcomes or complications compared to >100 mL/kg/d (poor quality data, one study, wide confidence intervals)
The data about human versus bovine based fortifiers for expressed breast milk feeding in the very preterm are based on only 3 small studies. All of which I have discussed previously on this blog. I created a graphic using the data from these trials, comparing the overall incidence of stage 2 NEC or greater with each feeding strategy. The strategies being 1. Maternal BM with Donor BM as a supplement when insufficient MBM, both fortified with human-milk-based fortifier (MBM/DBM+hmf) 2. Maternal BM with Donor BM fortified with bovine-milk-based fortifier (MBM/DBM+bmf) 3. Maternal BM, fortified with bovine-milk-based fortifier with preterm formula as as supplement. 4. Preterm formula.
Cristofalo EA, et al. Randomized trial of exclusive human milk versus preterm formula diets in extremely premature infants. The Journal of pediatrics. 2013;163(6):1592-5 e1. Sullivan S, et al. An Exclusively Human Milk-Based Diet Is Associated with a Lower Rate of Necrotizing Enterocolitis than a Diet of Human Milk and Bovine Milk-Based Products. The Journal of pediatrics. 2010;156(4):562-7.e1. Trang S, et al. Cost-Effectiveness of Supplemental Donor Milk Versus Formula for Very Low Birth Weight Infants. Pediatrics. 2018;141(3).O’Connor DL, et al. Effect of Supplemental Donor Human Milk Compared With Preterm Formula on Neurodevelopment of Very Low-Birth-Weight Infants at 18 Months: A Randomized Clinical Trial. JAMA. 2016;316(18):1897-905.
Strategy 1 was an intervention group as each of the first 3 of those trials, Strategy 2 in Trang and O’Connor, strategy 3 in Trang and Sullivan, and strategy 4 in Cristofalo.
I stress that this is not a formal SR and meta-analysis! The p=values are the typical p=values from the initial publications
I think it is interesting to compare the tiny amount of information we have about different sources of fortifiers and milks for their impact on NEC, to the evidence which exists from RCTs of probiotics. The latest systematic review/meta-analysis included about 3,600 patients per group.
I find it distressing that fortifiers, which are given to fragile, high-risk, babies 8 (or 12) times a day for several weeks, do not have to provide the same kind of proof of safety and efficacy as a new drug, which may be given 2 or 3 times a day for a week. We end up in a situation where thousands of babies are being exposed to these agents with very poor quality evidence that they are equivalent to each other.
Given these limitations of the proofs, an evidence-based protocol for breast milk fortification would look like this
For infants at risk of NEC:
Promote maternal breast milk as much as possible, early expression, lactation consultants, pumps freely available everywhere…
When MBM insufficient, always use donor BM, until risk of NEC passed (no good data on when to stop donor milk, 34 weeks post-menstrual age seems reasonable)
As feeds advance, fortify breast milk as soon as TPN can not meet nutritional requirements of the infant (which will usually happen around 50 mL/kg/d)
Start with standard fortification, up to an assumed calorie density of 24 kcal/oz for maternal BM, and start at a higher dose for donor BM (assumed calorie density of 26 kcal/oz) because donor BM has less protein.
Use powder or liquid fortifier, there is no proven advantage of one over the other.
Use bovine or human-based fortifier, there is no proven advantage of one over the other if MBM is supplemented with donor BM.
Concentrate on growth, review frequently and increase fortification if growth < target for 2 wk, at ≥ 160 mL/kg/d, then re-assess frequently.
So what about the idea of trials examining even higher oxygen saturations? Is there a chance we might further reduce mortality by aiming for saturations in the mid-90’s rather than the low 90’s?
Not so fast, even for adults that may be a really bad idea!
A fascinating systematic review, the IOTA study, was published last year in The Lancet. Chu DK, et al. Mortality and morbidity in acutely ill adults treated with liberal versus conservative oxygen therapy (IOTA): a systematic review and meta-analysis. Lancet. 2018;391(10131):1693-705. This study found 25 trials including 16,000 adults; baseline saturations were in the mid to upper 90’s in most studies, and interventions varied, but included one, liberal, group with more oxygen and a higher target saturation, and another, conservative, with more restricted oxygen use (either room air, or only when necessary, or a lower concentration as a routine compared to the liberal group). The review found a 20% increase in in-hospital mortality RR 1·21 (95% CI 1·03–1·43) with more liberal oxygen use. I remember my days, many years ago now, as an SHO (junior resident) in general internal medicine, anyone with chest pain, breathlessness, or just looking a bit peaky, would get oxygen cannulae slapped on the moment they arrived in the emergency room (or A&E as we called it). We didn’t have pulse oximeters back then in the dark ages, and didn’t want to poke everyone’s arteries for a blood gas analysis, so routine low flow oxygen was considered safe and maybe helpful. Even today, this study informs us, 34% of adults in an ambulance are given oxygen, and in the UK 15% of adults admitted to hospital are on oxygen therapy, which they may not need to maintain acceptable saturations.
IOTA also performed a meta-regression suggesting that there is a dose effect, that the higher the saturations achieved with liberal therapy the greater the adverse impact on mortality.
Of course adults don’t have some of the hazards that we face in neonatology, they aren’t very likely to develop retinopathy of prematurity!
Also animal models use hyperoxia in the neonatal period to create pulmonary hypertension, which can be lifelong. There are numerous examples in the literature, but here are a few, some of which note the differences, and increased sensitivity of the neonate, to the effects of hyperoxia.
(Berkelhamer SK, et al. Developmental differences in hyperoxia-induced oxidative stress and cellular responses in the murine lung. Free Radic Biol Med. 2013;61:51-60. Jimenez J, et al. Progressive Vascular Functional and Structural Damage in a Bronchopulmonary Dysplasia Model in Preterm Rabbits Exposed to Hyperoxia. Int J Mol Sci. 2016;17(10). Menon RT, et al. Long-term pulmonary and cardiovascular morbidities of neonatal hyperoxia exposure in mice. Int J Biochem Cell Biol. 2018;94:119-24. Nakanishi H, et al. Morphological characterization of pulmonary microvascular disease in bronchopulmonary dysplasia caused by hyperoxia in newborn mice. Med Mol Morphol. 2018;51(3):166-75. )
Kumar VH, et al. Neonatal hyperoxia increases airway reactivity and inflammation in adult mice. Pediatr Pulmonol. 2016;51(11):1131-41. Patel A, et al. Exposure to supplemental oxygen downregulates antioxidant enzymes and increases pulmonary arterial contractility in premature lambs. Neonatology. 2009;96(3):182-92.)
To summarize, hyperoxia is toxic to the lungs, especially in the neonatal period when antioxidant defences are reduced; in addition to oxygen free radical effects, administering excessive oxygen inhibits Guanylyl cyclase activity and increases PDE5 expression leading to increased vascular contractility and a reduced effect of nitric oxide, endogenous or exogenous. The vascular impact of these changes can lead to vascular remodelling and permanent pulmonary hypertension, as well as reduced alveolarization. Even fairly brief exposure to hyperoxia can cause some of these changes (Lakshminrusimha S, et al. Pulmonary hemodynamics in neonatal lambs resuscitated with 21%, 50%, and 100% oxygen. Pediatr Res. 2007;62(3):313-8).
Not all studies show an effect on hypoxia, (Dani C. Automated control of inspired oxygen (FiO2 ) in preterm infants: Literature review. Pediatr Pulmonol. 2019;54(3):358-63), which may be partly due to apneas; if you aren’t breathing then adjusting the FiO2 is unlikely to have much impact! As Carlo Dani suggests in this review there is no current evidence that clinically important outcomes are improved, but I am going to say something which is probably unique among the posts on this blog: I don’t think we need such evidence. I can’t see any downside to reducing episodic, or sometimes prolonged, hyperoxia and hypoxia in preterm infants, or indeed in infants at term either. As long as the devices actually work, and do adjust FiO2 to achieve more stable saturations and reduce episodes of desaturation followed by re-oxygenation, and episodes of excessively high saturations, that would be sufficient for me to buy them for every baby in the NICU. I can’t see how it could possibly be harmful, and it would probably reduce nurses stress and turnover and decrease parental stress also!
Manual adjustment of the FiO2 by the nursing staff (usually in response to alarms) has been shown in some of the studies to be substantially reduced by automated oxygen control (Hallenberger A, et al. Closed-loop automatic oxygen control (CLAC) in preterm infants: a randomized controlled trial. Pediatrics. 2014;133(2):e379-85). Many times parents become stressed when their baby’s alarms ring, the most common alarm, by far, is from the pulse oximeter, and when the nurse doesn’t come immediately (most often when there is a minor desaturation or high sat alarm and the nurse knows from experience of that baby that just waiting a little while will correct the saturation) they can sometimes freak out. The extra reassurance that there is a device active which is regulating the oxygen requirements will help a little to reduce the stress of having a baby in intensive care.