Platelet transfusions don’t close the PDA, but they may increase IVH

I would never have actually thought to ask the question whether platelet transfusion might close the PDA, although early thrombocytopenia is associated with persistent PDA, and platelet plugs seem to be part of the mechanism of closure. A group in India have just published an RCT in preterm infants with a PDA (hemodynamically significant, whatever that means) who had a platelet count under 100,000. Kumar J, et al. Platelet Transfusion for PDA Closure in Preterm Infants: A Randomized Controlled Trial. Pediatrics. 2019. Gestational age averaged 30 weeks, and they were enrolled at a mean of 3 days of age. Median time to PDA closure was identical in the group randomized to receive transfusion (10, 15 or 20 mL/kg depending on the count) and the control group, at 72 hours in each group, data based on repeated echo every 24 hours until closed. All babies received ibuprofen or acetaminophen also. 44 babies were enrolled, and of the 22 in the transfusion group there were 9 new IVH (4 severe, grade 3 or 4) after enrolment, compared to 2 new IVH among the controls, (both severe).

In the much older study by Maureen Andrew and colleagues, (Andrew M, et al. A randomized, controlled trial of platelet transfusions in thrombocytopenic premature infants. The Journal of pediatrics. 1993;123(2):285-91). Preterm infants with a platelet count less than 150,000 were randomized to be transfused or not. 12/78 transfused babies developed a serious grade 3 or 4 IVH, and 9/79 controls. The 33% increase in IVH was “not statistically significant” they said, but as you all know that doesn’t mean that it isn’t real!

In the recent PLANET2 trial there were more serious bleeding episodes in the transfused babies than in the controls, and apparently most of them were IVH, I don’t have access to those numbers, but whatever they are, the effect appears to be in the same direction.

I would like to see a meta-analysis, which would have some limitations given the 3 different thresholds in those 3 trials (which are as far as I know the only RCTs of platelet transfusion at different thresholds), but if the PLANET2 data are indeed consistent, and with a much greater power than the 2 other small trials, that would be very powerful data. It would confirm that not only are platelet transfusions in general ineffective in preventing bleeding at these 3 threshold levels, but they likely increase the risk of IVH.

Why would that be the case? It may be that transfusing adult platelets to babies with newborn plasma, which is already hypercoagulable, causes the effect, either by capillary damage, or by causing infarctions which then become hemorrhagic, or some other mechanism. It could just be the effect of volume expansion, which can certainly cause lesions in newborn beagle puppies (see Laura Ment’s studies from the 80’s and 90’s), and many observational studies that have correlated volume expansion with IVH. Platelets are often given somewhat faster than red cell transfusions, (it does not appear to have been specified inPLANET2, the dose was 15 mL/kg, but the duration isn’t mentioned in the protocol) often over 1 hour. Volume expansion is also probably more effective than with saline, much of which rapidly leaks out of the circulation.  I think either some impact on overall coagulation/anticoagulation balance or hemodynamic changes, or both, may be responsible for the apparent increase in IVH.

Posted in Neonatal Research | Tagged , , , | 2 Comments

Sail Away, Sail Away…

You could probably guess that a post about the SAIL trial (Kirpalani H, et al. Effect of Sustained Inflations vs Intermittent Positive Pressure Ventilation on Bronchopulmonary Dysplasia or Death Among Extremely Preterm Infants: The SAIL Randomized Clinical Trial. JAMA. 2019;321(12):1165-75.) would have to be accompanied by this, as it was when I reported on the presentation at last years PAS :

This is the multicenter randomized controlled trial of sustained inflations at the onset of resuscitation for very preterm infants less than 27 and at least 23 weeks gestation. Enrolled babies received face mask CPAP for up to 30 seconds, and if they needed PPV (i.e. apneic or gasping or heart rate <100) then they were randomized to sustained inflation or standard NRP. Sustained inflation babies started with a 15 second inflation at 20 cmH2O, they were then evaluated on CPAP and, if apneic or gasping or heart rate < 100, they switched to standard NRP, if those things didn’t apply they received a second sustained inflation to 25 cmH2O for 15 seconds. All of which was rather arbitrary, in terms of indications, pressures, and durations, but there wasn’t any reliable data to make more evidence based choices (and still isn’t).

The primary outcome of the study was the infamous “death or BPD”, which I have criticised here frequently enough, I think, but just to be really annoying; being dead and having oxygen at 36 weeks PMA are not equivalent, and a composite outcome which combines them risks the real potential that they could change in opposite directions, and show no effect, or that mortality changes will be overwhelmed by the much more frequent occurrence of BPD. Mortality as one outcome and BPD among survivors, as another outcome makes much more sense. Even better would be a measure of lung injury which reflects respiratory outcomes of importance to families.

As many of you will know by now, the study was stopped by the DSMC after enrolment of 460 patients because of an excess of early deaths (under 48 hours of age) in the sustained inflation group, many of which were considered to be possibly associated with the intervention. As well as stopping the trial the DSMC mandated a Bayesian analysis, which revealed that it was highly unlikely that sustained inflation would be shown to be preferable if the study had continued, and that either a null result, or an advantage of standard care were far more likely results.

This is an important trial with an important message: if you want to do sustained inflation, don’t do it like this. If you want to do sustained inflation using a substantially different approach, you had better do a high quality study with careful surveillance for adverse effects, and don’t do it outside of an RCT.

Failing that, I think that sustained inflation as routine initiation of resuscitation of the preterm infant should be laid to rest.

The authors have done what other trials have also done recently, which is to report BPD at 36 weeks, or death at 36 weeks as being the components of the primary outcome, I still don’t understand this, as it means that death after 36 weeks without BPD is considered a good outcome! Why not survival to discharge as part of the composite? The authors collected survival to discharge (it is secondary outcome number 22), but I cannot see the result in the article or appendix.

My recent discussions about significance and how to refer to results are well illustrated by the following sentence from the discussion.

An unexpected excess mortality rate with sustained inflation in the first 48 hours of life led to early trial closure, although mortality at 36 weeks’ postmenstrual age was not different.

Well, pardon me, but as far as I am concerned 20.9% IS different to 15.6%, they are clearly different numbers! Because the difference between 2 numbers is not “statistically significant” does not make them the same. As you can see from the survival curves below, they are a bit closer together at 12 weeks than they are at 7 days, but they remain different. It would be accurate to say, ‘the p value for the difference in death at 36 weeks is 0.17 with a relative risk of 1.3’, and to note that ‘relative differences in mortality at 36 weeks which are compatible with the data, range from a 10% decrease with sustained inflation, to a 90% increase ‘; but not just to say they are “not different”.  

Other secondary outcomes vary between those which are practically identical between groups, such as severe IVH (9.8% vs 10.4%), and those which are very different, e.g. pneumothorax (5.1% with SI vs 9% standard NRP). None of them were “statistically significant”.

Almost simultaneously appeared in print the following Tingay DG, et al. Gradual Aeration at Birth is More Lung Protective than a Sustained Inflation in Preterm Lambs. Am J Respir Crit Care Med. 2019 a very interesting study in preterm lambs examining a sustained inflation strategy, where they used 35 cmH2O and maintained it until there was no more volume entering the lungs, and then for another 10 seconds. This was compared to ventilation with PEEP, and a 3rd strategy of ventilation with PEEP, and added progressive increases in PEEP until compliance was maximized, at which time, PEEP was progressively decreased. The sustained inflation group had very uneven lung aeration, and increased signs of lung injury. This confirms I think that we could still have some benefit from finding novel ways of ensuring early adequate uniform lung inflation, but simple sustained inflation is not the answer, at least in the immature lung.



Posted in Neonatal Research | Leave a comment

To p or not to p, what is the alternative?

I started writing the previous post several weeks ago, and, of course, the ideas are not original with me, in fact, a whole recent issue of “The American Statistician” is dedicated to not just trying to eliminate talk of statistical “significance”, but to provide alternatives.

One of the problems is illustrated by this figure from an editorial in “Nature” which discusses that journal issue: (Amrhein V, et al. Scientists rise up against statistical significance. Nature. 2019;567(7748):305-7) The figure showing real life data from 2 studies:

For example, consider a series of analyses of unintended effects of anti-inflammatory drugs. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation…. and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect.

Similar things happen all the time in our field, where results with wide confidence intervals which cross a relative risk of 1 are reported as showing “no effect” or “no statistically significant effect”.

Here is a real neonatal example, the classic interpretation of the Davidson study would be that inhaled NO does not prevent ECMO in term babies with hypoxic respiratory failure, as the 95% confidence intervals for their RR of 0.64 include 1.0. The classic interpretation of the other two studies is that inhaled NO does prevent ECMO, but one, NINOS, had a relative risk that was actually less extreme than Davidson, at 0.71, but the confidence intervals don’t include 1. In reality all 3 studies show about the same effect, two being more precise than the third. In some (most) journals you would have to state the results in that way, and would not be allowed, when reporting the Davidson trial, to note the fact that ECMO was less frequent after iNO (although clearly it was), because it is not “statistically significant”.

I think we have to be ready to embrace uncertainty, to realize that dichotomizing our research into reports of things that work and things that don’t work, is unhelpful and may retard clinical advances.

The whole issue of ‘The American Statistician” is devoted to “moving to a world beyond p<0.05” and the opening editorial is well worth the read (Wasserstein RL, et al. Moving to a World Beyond “p < 0.05”. The American Statistician. 2019;73(sup1):1-19). One of the major themes is to stop saying “statistically significant” as a term,  as the distinction between the statistical and the ordinary world meaning of “significant” is now hopelessly lost. 

no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern famously observed, the difference between “significant” and “not significant” is not itself statistically significant.

So what should we do? There are useful suggestions at the end of that editorial, and the authors of each paper were asked to come up with positive suggestions, rather than just a list of “don’t”s.

Overall the suggestions are given the mnemonic “ATOM” Accept uncertainty, be Thoughtful, Open and Modest.

One specific suggestion is that we might continue to report P-values, but as exact continuous values, (p = 0.08, or 0.46) without any threshold implications by the use of < or > notation. I think that could be useful as a way to eliminate the tyranny of p<0.05. It could reduce the risk of “p-hacking”, which is the tweaking of analysis, or even of data, in the search for a p-value which is just under 0.05. They further suggest that such exact p-values should be accompanied by other ways to present the results, such as s-values, Second generation p-values (SGPV), or the false positive risk, all of which they explain, and all of which themselves carry difficulties or unknowns.

Another suggestion is to refer to what are now called confidence intervals as “compatibility intervals”, the idea being that you would state that your result is most compatible with a range of effect sizes between Y and Z, rather than concluding that if the 95% confidence interval includes 1 the difference is not real, but, if it just excludes 1, then there is a real difference between the results. (That would be no better than relying on p<0.05).

The nexus of openness and modesty is to report everything while at the same time not concluding anything from a single study with unwarranted certainty. Because of the strong desire to inform and be informed, there is a relentless demand to state results with certainty. Again, accept uncertainty and embrace variation in associations and effects, because they are always there, like it or not. Understand that expressions of uncertainty are themselves uncertain. Accept that one study is rarely definitive, so encourage, sponsor, conduct, and publish replication studies. Then, use meta-analysis, evidence reviews, and Bayesian methods to synthesize evidence across studies.

I would recommend anyone involved in designing and analysing research to read the editorial and the article which immediately follows it (Ioannidis JPA. What Have We (Not) Learnt from Millions of Scientific Papers with P Values? The American Statistician. 2019;73(sup1):20-5) which is a review of many studies that John Ioannidis has published which show the insidious impacts of the term “statistical significance” and the focus on testing for p-values below a threshold.

One unexpected benefit of eliminating the words “significant” and “significantly” as well as their opposites would be a reduction in the number of words in a manuscript, which could be used for other things. In the recent publication from the Stop-BPD trial that I posted about recently, the words significant and significantly were used 19 times.

In contrast, I am currently revising an article for publication, and it is actually quite difficult! It is so ingrained to think of p<0.05 being significant that trying to come up with other ways of talking about the results of statistical tests can require some actual thought about the meaning of your results!

More seriously, the tyranny of p<0.05 and the use of the words “significant” and “non-significant” lead to a distortion of the English language. For example, a study with 100 patients per group might find that one group has a mortality of 10% and the other has a mortality of 20% (p=0.075,), it would be dangerous and misleading to state “there was no difference in mortality” just because the p-value was too large “p>0.05”, or “NS”.

This is also not a “trend”, a word which implies that things are moving in that direction, it is a real finding in the results, but like all real findings it can only give an estimate of what the actual difference would be if the 2 treatments were given to the entire population. That actual difference is unknowable, and we should be more careful about pretending we know what the actual difference is. Any result from a trial is only an estimate of the true impact of the intervention being tested, an estimate which gets closer to the likely probable true impact as the compatibility intervals become smaller, as long as there are no biases in the trial.

It is also, I think wrong to suggest that the difference is “non-significant” only because of lack of numbers. That always presupposes that a larger trial would have found the same proportional difference (100/1000, vs 200/1000), and that it would then become significant (p<0.001, sorry about the < sign, but the software doesn’t give actual p-values when they are that small!) In reality a larger study might show a mortality difference anywhere within, or beyond, the compatibility intervals of the initial trial.

A better way of presenting those data would be the actual continuous p-value from Yates corrected chi-square, which is 0.075, the actual risk difference in deaths, 0.2 – 0.1, that is, 0.1 and the 95% compatibility intervals of that difference which are 0.07 to +0.26. So the sentence in the results should read something like, “there was a 10% absolute difference in mortality between groups 10% vs 20%, p=0.075, a difference which is most compatible with a range of impacts on mortality between a 7% increase and a 26% decrease”. That is longer than saying “no difference in mortality”, but it has the advantage of being true, and of using some of the words you saved by eliminating “significant” from the paper. It also alerts readers and future researchers that there is a potential for substantial differences in a major clinically important outcome, which does not happen when the terms non-significant, NS, p>0.05, or no impact, are used.

I am going to do my best to avoid thinking of statistical tests as yes or no, true/not true, effective/not effective, and to avoid the word “significant” in my publications, I wonder how long until an editor tells me that doesn’t work, and I have to say it, or makes me say “no difference” because p>0.05.

Posted in Neonatal Research | Tagged , | 2 Comments

To p or not to p, that is the question.

I can’t claim preference for this title, although I wish I could. I copied it from an article published in an ENT journal (Buchinsky FJ, Chadha NK. To P or Not to P: Backing Bayesian Statistics. Otolaryngol Head Neck Surg. 2017;157(6):915-8).

I think the word “significant” should be banned. (Not in life; I am not a fascist; you can say whatever you want is significant, but in medical research there is so much confusion about the term that we would be better to never use it!)

I think authors who find a potentially positive result in a good quality study should be allowed to say things like “if there were no other unanticipated biases in our research design, the likelihood that our results are due solely to random variation is less than 1 in 20”, which is less sexy, but more accurate, compared to saying “our results were significant”. (If there are any real statisticians out there reading this, and I say anything which is not accurate, please let me know, I only have basic statistical training and would be happy to be corrected!)

It would certainly be much better than assuming that p<0.05 means that you definitely found an effect, or that p>0.05 means that there is nothing there!

In this blog I usually try to avoid the term “statistically significant” (or not), as the term is often used to imply “proven effect” as compared to “proof of no effect”. I hope we all know that the threshold, where p=0.051 means no effect, and p=0.049 means proven effect, is nonsense. Some journals have banned the reporting of p-values and even confidence intervals, as a result. I think this is extreme, I think we should be able to report confidence intervals, but that multiple confidence intervals, 90, 95, and 99% should perhaps be demanded. And also appropriate wording, similar to what I suggested above. The risk is that a 95% confidence interval which excludes unity will be considered to be proof that there is a real difference, which is no better than using a p-value threshold. The differing confidence intervals could be used to give an overall estimate of an effect, and its potential ranges.

In this blog I probably sometimes get caught up in the usual patterns of referring to p-values, but usually I try to say something like “not likely to be due to chance alone”, which does not mean that a difference is necessarily due to a real effect of the intervention, but that the data would be unlikely if you picked the numbers at random out of a soup of numbers. All sorts of things might cause a p-value to be less than 0.05 when you compare outcomes between 2 groups with a different intervention, only a minority of which are due to a true impact of the intervention.

One recent paper that I liked was by Doug Altman and a group of co-workers (Greenland S, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-50) they list the many errors that people make when talking about the statistical test results, when I read the list it makes me think of the many similar errors I have read, and probably made myself.

A study with an unknown bias might well provide a “significant” p-value when there is no real effect of the intervention, just as a study with a “non-significant” p-value might report a major advance in medicine.

The authors of that recent paper put it this way :

It is true that the smaller the P value, the more unusual the data would be if every single assumption were correct; but a very small P value does not tell us which assumption is incorrect. For example, the P value may be very small because the targeted hypothesis is false; but it may instead (or in addition) be very small because the study protocols were violated, or because it was selected for presentation based on its small size. Conversely, a large P value indicates only that the data are not unusual under the model, but does not imply that the model or any aspect of it (such as the targeted hypothesis) is correct; it may instead (or in addition) be large because (again) the study protocols were violated, or because it was selected for presentation based on its large size.

There have been recent publications suggesting that the critical P-value should be shifted to a much smaller number (such as p<0.005), particularly for epidemiological, rather than interventional studies. But I think that will just shift the problem, and will make it harder to find really useful beneficial effects, or to potentially harmful results.

Abandoning the term “statistically significant” should be enforced, and will force us to makes more nuanced and reasonable evaluations of our data.

Posted in Neonatal Research | Tagged , | Leave a comment

Partnering with parents

For a few years now Annie Janvier in our unit has been developing programs of partnership with families. Using contacts with mostly “veteran parents”, and occasionally veteran patients, we have developed partnerships in research, patient care, and education.

The “PAF” team (équipe Partenariat Famille) have now published a report of how such family partnerships can be developed, how their impacts can be evaluated, and how our partnerships have developed and expanded as a result of those evaluations (Dahan S, et al. Beyond a Seat at the Table: The Added Value of Family Stakeholders to Improve Care, Research, and Education in Neonatology. JPediatr 2019;207:123-9 e2). Last year we published a review of integration of parents in research endeavours, (Janvier A, et al. Integrating Parents in Neonatal and Pediatric Research. Neonatology. 2019;115(4):283-91) and included in that review some of our endeavours and our research about family participation specifically in research. The group also published a review article about what has been published about family participation in the NICU (Bourque CJ, et al. Improving neonatal care with the help of veteran resource parents: An overview of current practices. Seminars in fetal & neonatal medicine. 2018;23(1):44-51).

The new article is an in-depth evaluation of the PAF team development, evaluation, and improvement, some of the mistakes made along the way, and some principles, many of which are probably generalizable, that can be used to help in the process.

The title, I think, is apposite, although many of us have been discussing how to involve parents over the past few years, often the involvement of parents has been seen as a “nice extra”. In contrast, I think we should consider that everything that we do will benefit from the full integration of resource parents in our teams, and that having a token parent seat at the table is not enough.

For anyone who doesn’t have full text access to the Journal of Pediatrics, Annie gave me permission to include the following link in this blog post the first 50 people accessing the link can download a free full text.

The PAF initiative costs very little, but there are some costs, mostly for parking, snacks, our wall of hope, and other minor costs. Our goal for fundraising this year is only $12,000 (Canadian), please consider making a small donation to our team. If you like this blog, please consider making a large donation!

Please Follow the link to our fundraising page. and click on “Donate Now”.

Posted in Neonatal Research | Leave a comment

How should we evaluate heart rate during neonatal resuscitation?

Many babies receive some sort of “resuscitation” during their transition from intra-uterine to extra-uterine life.

How do we decide when a baby needs intervention? A baby who is active and breathing is usually left alone, a baby who is neither of those things might need intervention, and many of our decisions are based on the baby’s heart rate.

Bradycardia= needs ventilation. Mild bradycardia= optimize ventilation and reassess, good heart rate = observe and wait. I like things to be simple!

Recent studies have focused on heart rate determination as the best indication that adaptation is appropriate, but that begs the question:, how to determine heart rate? Should we listen to their heart sounds, palpate their pulses, or watch their ECG? It seems that getting an accurate heart rate is faster with immediate ECG application (Katheria A, et al. A pilot randomized controlled trial of EKG for neonatal resuscitation. PLoS One. 2017;12(11):e0187730) and that this might lead to more rapid institution of corrective actions. But electrical activity of the heart does not mean that it is pumping well; in animal models pulseless electrical activity is frequent. Many immature animals, after resuscitation, have periods of electrical activity without mechanical activity. If that happens with babies, then we may have to readjust our algorithms; presence of an ECG signal does not mean that you necessarily have adequate cardiac function.

A group of us interested in the issues have been discussing this for a while, and decided to write a brief article, focusing on the results from Po-Yin Cheung and  Georg Schmolzer’s lab in Edmonton. (That, I always like to point out, used to be my lab! (here’s one example) but Po-Yin and Georg and doing better work from that lab than I ever did.) Patel S, et al. Pulseless electrical activity: a misdiagnosed entity during asphyxia in newborn infants? Archives of disease in childhood Fetal and neonatal edition. 2018. The new article notes that PEA (which I always used to call electro-mechanical dissociation (EMD)) occurs frequently in animals that have been exposed to clinically relevant models of perinatal asphyxia.

Does this actually happen in human newborns? Yes. Luong D, et al. Cardiac arrest with pulseless electrical activity rhythm in newborn infants: a case series. Archives of disease in childhood Fetal and neonatal edition. 2019. Four cases are reported in this article, and I know personally of two others, I wasn’t able to get them into the article (of which I am co-author), but this is not something that is vanishingly rare; how frequent is it? We really don’t know, but I think we should investigate that somehow.

What I think this means is that, when resuscitating depressed newborns, the ECG might be very helpful to get an accurate heart rate quickly, and if the heart rate is slow we should respond according to NRP algorithms.

At some point we should confirm that there is actually cardiac contraction, not just electrical activity. If the infant starts to move and breathe, that is probably enough evidence. BUT, if the ECG heart rate is present but the baby isn’t improving, we should immediately evaluate whether there is sufficient cardiac activity. 

In the cases we report there was ECG activity, but no actual cardiac function detectable, when that was recognized and interventions followed, all the babies were severely damaged, and they all died. I wonder if the situation had been recognized faster, could there have been better outcomes? We could even ask if those babies would have been better treated without the ECG?

Maybe the introduction of the ECG as a routine measure of cardiac activity during neonatal resuscitation has been an error?

How should we determine that the heart is actually contracting effectively? I think if the pulse oximeter is giving a reliable signal, at the same rate as the ECG, that means there is at least some arterial pulsation in the right wrist/hand and probably perfusion is at least minimally effective: if the pulse oximeter is not (yet) functioning, then palpation of the pulses may be adequate, or perhaps clear heart sounds are enough evidence that the heart is actually moving…

I’m not sure what the best approach is, but recognizing that the ECG only identifies electrical activity, and that actual cardiac pumping is what the baby needs, is the first step.

Posted in Neonatal Research | Tagged , | 2 Comments

Death or oxygen, which is worse?

We have a big problem in neonatal research. We have constructed composite outcomes that have become the “standard of design”, but are not of much use for anyone. Because we are, rightly, concerned that death and other diagnoses may be competing outcomes, we often use as the primary outcome measure “death or BPD” or “death or severe retinopathy” or death or “neurodevelopmental impairment”. We have done this because dead babies can’t develop BPD, or developmental delay.

The idea, of course, is that we want to see if an intervention will improve survival without lung injury, for example. There are two problems with this, if the outcome is more frequent, but neither part of the outcome is individually significantly affected. What then? The other problem is that we might well find that death is less frequent but that lung injury is more frequent. And what then? If the composite outcome is unchanged, then strictly speaking we can only say that the study found no effect on the outcome, and an analysis of the parts of the composite outcome are considered secondary analyses.

This happens. The SUPPORT trial showed no effect of oxygen saturation targets on the primary outcome, but the low target babies had more mortality, while the high target babies had more retinopathy.

Study designs like this are effectively equating the parts of the primary outcome in importance for the analysis.

By studying the outcome of “death or BPD” we are effectively saying that an adverse outcome is being dead or being on low-flow oxygen at 36 weeks. I don’t think many readers of this blog would agree, if they themselves were critically ill, that surviving with a need for long-term domiciliary oxygen and being dead were equivalent.

This has again become painfully clear with the publication of the STOP-BPD trial. (Onland W, et al. Effect of Hydrocortisone Therapy Initiated 7 to 14 Days After Birth on Mortality or Bronchopulmonary Dysplasia Among Very Preterm Infants Receiving Mechanical Ventilation: A Randomized Clinical Trial. JAMA. 2019;321(4):354-63). This was a very high quality, important trial of hydrocortisone in ventilator dependent babies. Infants less than 1250 g birthweight and <30 wk gestation were randomized to placebo or to hydrocortisone 1.25 mg/kg/dose 4 times a day for a week, then 3 times a day for 5 days, then twice a day for 5 days then once a day for 5 days.

They had to be ventilator dependent at 7 to 14 days of age with a respiratory index (product of mean airway pressure and the fraction of inspired oxygen) equal to or greater than 3.5 for more than 12 h/d for at least 48 hours.

Which would mean for example a mean airway pressure of 8 and an FiO2 of 0.44.

During the initial months of the trial, participating centers noted that many infants receiving ventilation and considered at high risk of BPD had a respiratory index of less than 3.5 and were treated with corticosteroids outside the trial. Based on this feedback, the respiratory index threshold was reduced to 3.0 and finally to 2.5 (in May 2012 and December 2012, respectively) via approved protocol amendments.

By the end of the trial, then, an infant at 7 days of age, with a mean airway pressure of 8 on 32% oxygen or more would have been eligible.

The definition of BPD was oxygen requirement at 36 weeks (with an O2 reduction test if needing less than 30%). Death was also recorded to 36 weeks for the primary outcome. Which means that dying between 36 weeks and discharge would be considered a good outcome, if you didn’t have BPD.

The primary outcome occurred in 128/181 hydrocortisone babies (70.7%), and 140/190 controls (73.7%). In other words there was no impact of the hydrocortisone, which is what the abstract states. But at 36 weeks there were significantly, and substantially, more babies who received hydrocortisone alive than controls, 84.5% vs 76.3%, which was “statistically significant” p=0.048. Between 36 weeks and hospital discharge there were several deaths in each groups, and the difference had narrowed slightly, with 80% of hydrocortisone babies and 71% of control babies being alive, p=0.06.

This happened despite a very high rate of open-label hydrocortisone use in the control babies. In fact 108 of the 190 control babies received hydrocortisone.

The protocol is available with the publication, and it notes the following :

In case of life threatening deterioration of the pulmonary condition, the attending physician may decide to start open label corticosteroids therapy in an attempt to improve the pulmonary condition. At that point in time the study medication is stopped and the patient will be recorded as “treatment failure”.

This could occur during the 21 days of study drug use. In addition, physicians could give steroids after the 21 days of the study drug:

Late rescue therapy outside study protocol (late rescue glucocorticoids): Patients still on mechanical ventilation after completion of the study medication, i.e. day 22, may be treated with open label corticosteroids.

I’m not quite sure about this, but I think that 86 of those 108 control babies who received hydrocortisone got it during the 21 days study drug window, and 22 others received steroids after the study drug period. In the hydrocortisone group I can see no indication of how many got open-label steroids during the study drug period, but there are 6 who got steroids after the end of that period.

The substantial differences in mortality are despite a very high rate of treatment of babies randomized to control who received hydrocortisone, which will of course dilute the potential impact of the intervention.

There are modest differences in BPD between the groups, with the hydrocortisone babies having slightly more (100 cases vs 95), but if you express this result as “BPD among survivors”, the numbers are actually identical; just over 65% in each group.

I think the best interpretation of this study would be as follows: eligible babies who received immediate hydrocortisone, compared to those who waited and only received hydrocortisone in the case of a “life-threatening” deterioration, were less likely to die, but, if they survived had the same likelihood of developing BPD.

I hope there is neurological and developmental follow up planned for this trial, although the power of the study to say very much, when so many control babies received hydrocortisone, will be quite limited.

This is now a huge problem, the published article states there is no effect of hydrocortisone, but that is not what I get from the data.

Here is the cute graphic that accompanies the paper

Effect of Hydrocortisone 7-14 Days After Birth in Very Preterm Infants Receiving Mechanical Ventilation

What can we do about this? Based on this study, the use of hydrocortisone in a similar dose, to infants with substantial oxygen requirements after 7 days of age would be a reasonable choice. Waiting for life threatening deterioration (it would be interesting to know what that meant to the attending physicians!) seems to increase your risk of dying. I think it is unlikely that any neurological or developmental impacts of hydrocortisone are severe enough to be worse than dying, and I just hope that any long term outcome study of these infants does not use the outcome “death or low Bayley scores”.

Analyzing the deaths differently using survival curves gives the following, with a p-value suggesting that this is unlikely to be due to chance alone. I know it’s a bit more than .05, but there is only 1 chance in 17 that completely random numbers would give a difference like this :

I think we have to stop using “death or BPD” as a composite dichotomous outcome for our studies.

There are alternatives, even when death and the other outcome of interest are competing.

One way is to analyze the same data differently. One method, for example, is to compare each babies outcome to all of the babies in the other group. A baby who dies receives zero points in comparison to the other group babies who died, receives -1 point in comparison to the other group babies who survived. Each surviving baby with BPD is then scored +1 point in comparison with the other group babies who died, zero points in comparison with the other group babies with BPD and -1 point in comparison with the surviving babies without BPD, and babies without BPD score +1 in comparison with babies who died or survived with BPD, and score 0 in comparison with babies who survived without BPD. The ratio of winning to losing babies is then referred to as the “win ratio”.

This is a variant of the method used by the study I discussed in my last post, Beitler et al examining different ways of determining optimal PEEP. Finkelstein DM, Schoenfeld DA. Combining mortality and longitudinal measures in clinical trials. Statistics in Medicine. 1999;18(11):1341-54. In fact it is more generally applicable, and there have been multiple publications about the method (and other related methods) as well as many publications using the methods, mostly in cardiology, where composite outcomes including death or a revascularization procedure, as one example, are common, but recognized to have differing weights. Pocock SJ, et al. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European Heart Journal. 2012;33(2):176-82.

For example, if you ran a study with 20 babies per group, and the results showed group A had 5 deaths and 10 survivors with BPD, group B had 10 deaths and 5 with BPD. Our usual analysis would say there was no impact on “death or BPD”. The analysis that I have just suggested, in contrast, gives a score in group A to each one of the dead babies of -10, and -15 to those in group B. The BPD babies each score+5 in group A and 0 in group B, and the survivors without BPD score +15 in both groups. The win ratio for the trial is 3.0 for group A, as there are 15 babies who win overall in most of their pairwise comparisons, and 5 who lose. Calculating the p=value for this is complicated, but well described, and methods for calculating the confidence interval of the win ratio are, also.

Effectively, what this kind of analysis does is to rank the adverse outcomes, death being scored before BPD.

I would be fascinated to see what the results of STOP-BPD would look like if this kind of analysis was performed, the win ratio of the hydrocortisone group works out to 5.2 to my calculation, compared to 3.4 for the controls. It could be that such a difference is statistically significant, and such an analysis might enable future trials to be designed using this method.

You can also with this technique examine different severities of BPD, with BPD being scored as moderate vs severe. This kind of analysis can also include longitudinal quantitative measures, such as duration of home oxygen therapy, or number of admissions after discharge. Things which are, I would suggest, far more important to parents than whether the oxygen is stopped before or after 36 weeks.

Before there are any other trials counting death and BPD as equally important outcome measures, or death and retinopathy, or death and developmental delay, or “death, BPD, NEC, LOS, IVH, ROP” we should reconsider how we measure and analyze outcomes. We should be including outcomes that are important to families, rank them according to their relative importance to parents, and analyze them using methods which are now well validated which take into account their relative importance.

Posted in Neonatal Research | Tagged , , , | 2 Comments