Well and clearly written! This is a difficult concept to get across, as you and Kerr have discovered. The coin flipping example was good, but I wonder if there's another example that could be given that focuses more on 'torturing the data'.
I have 100 people flip a coin. I then cross reference head/tails with the flipper's age, sex, height, hair color, first initial, etc.
Statistically, I will eventually find a pattern in the noise (say, most people who with a graduate level degree flipped heads) and if I publish a paper saying "People with graduate degrees more likely to flip heads!" I would be HARKing.
If, instead, I started with the hypothesis that people with graduate degrees would be more likely to flip heads, and the noise fell just right that it looked true, I'd only be falling victim to the limits of confidence intervals and random chance, not HARKing.
Imagine a wall with several overlapping targets. I shoot at this wall, hypothesizing that I can hit the wall. Then, when examining my shot, I notice that I happen to get a bullseye on one of the targets. I take this result and say "Look, I hit a bullseye, my aim is fantastic!"
The point being that it doesn't count unless you first call your shot. My experiment might be proof that I can hit a wall (a surprising result, but data is data), but cannot be used as proof that I can aim accurately enough to hit a bullseye.
Excellent and thank you. The endless harping on "appearance congruence" really got to me. I mean, yes, if you give a female testosterone they will usually grow a beard, but are they happier?
The suicide rate in this cohort is terrible, but I know you're getting to that in part 2.
Thanks for making this public, I will share with colleagues!
Indeed. If they took a bunch of junkies, and had a "Opiate Use Satisfaction Scale", and then gave all the junkies opiates, I would guess that the OUSS would show a statistically significant increase.
Since when has "good medical care" been defined as "giving the patient what the patient wants"?
You can be reported to the medical board, sure, but there's a whole process before your medical license is in danger. I'm not doing anything to endanger my license, just my social standing.
Seriously. I am lucky in that I work for that rarest of things: a small independently owned medical office. My boss is a pretty classic western Libertarian and he supports what I do. My patients are in Gresham and Sandy and Boring and Estacada and they don't care what I say on gender. If I worked for Kaiser I think I might be muzzled and if I worked in Sellwood or the west side I might worry that I'd lose customers but where I'm at, I'm fine.
Am I reading this correctly that there was no 'control' group? They had a cohort of kids treated with hormones and they tracked their progress through 2 years, but didn't compare that to kids who didn't get hormones? The improvements they report are not hugely impressive. I'd definitely want to know how kids who didn't get blockers or hormones faired...
Am I missing something? It seems like to truly say much of anything you'd need a control group? Do they address that in the paper at all?
You're not missing anything. I'll get into this in Part 2, but one problem is that the kids with psychological difficulties also got medication, counseling, or both. In the absence of statistical controls there's genuinely no reason to attribute the improvements to hormones rather than meds or therapy (or other unobserved variables).
That was my major problem with this paper. There's no way to determine if the changes in the study participants' psych assessment scores were any different from what their peers would have scored, if they had been included as a control group.
Academic medicine has its own brand of political bullshit and this isn't it. It feels off and transcends professional squabbling. I don't understand why so many journals and professional societies are pushing something so unscientific as gender ideology. It's ideological capture, but by whom? I've been paying attention since Jazz Jennings was on 20/20 as a small child and I still can't figure out what's driving the trans rights activists.
A variety of LTTE regarding this study are in preparation by a number of researchers. I have submitted a LTTE along with another researcher. Once that LTTE is either submitted or not accepted, I will post the text here.
There are a surprisingly high number of Google results for the Liberation Tigers of Tamil Eelam (??) that are preventing me from learning what LTTE means in this context, even when I add other search terms such as “study” or “research” or “published”
A total lay-person here. I'm not trying to be snarky- I just want to understand.
Can surveys of feelings of an individual (especially in a hot button area) really tell us anything?
I've lied on surveys because I know the lie will support my position or because I felt the survey was dumb.
How do the investigators know that the answers they are given are truthful? Can it be truthful in the moment the survey is taken but not truthful at other times?
In things like forensic psychology there are robust instruments that are made to detect when someone is being deceptive by having a multitude of questions where what's being measured is not so transparent.
I don't know anything about the instruments used in these studies but I would be surprised if they have any mechanisms like that, and would imagine these kinds of results suffer from those defects.
I think clinically there is the assumption that people want to get better and so are less likely to be deceitful in that environment which is probably fair outside of specific issues (like Munchausen).
I would wonder if participants are more likely to amplify their depression and anxiety at the beginning of the study in order to ensure treatment--you know, like a cry for help.
Oh interesting! I only know of the MMPI through a recent high profile legal case (guess), but I was really fascinated with it's design given the setting it is meant for. I had however assumed the methods it employs would be more broadly applied in non-clinical settings due to data integrity concerns in studies on contentious subjects.
I've been wondering the same thing for years. It's as if a collection of engineered sentiments, when convincingly formatted and word-smithed, can just identify as science.
Amazing work! But I think there might be a typo. Just above the ***, you say "Then, when it was time to report their data, they only told us what happened to six of those variables." Shouldn't it be "TWO of those variables"?
Thanks for this invaluable analysis. I noted various major flaws when I read the study- no control group, extremely short assessment period of 2 years, etc, but assume you'll go into that in Part 2.
Please take into account fully the love-bombing role schools play for kids who transition, and how suddenly being the brave and stunning hero with tons of glitter friends and admirers will affect kids' mood. Go to The Anti-Science Disaster of Gender Ideology in The Schools to see what schools teach and do at https://caroldansereau.substack.com/p/the-anti-science-disaster-of-gender.
Thanks for once again doing the work that peer reviewers and editors inexplicably refuse to do.
My normal reaction to this kind of sloppiness is to assume that people are just plain incompetent and ignorant. "Protocol? What protocol?" "Those instruments? Nah, we'll just use these over here." But in this case the stakes are just too high—and the ideology too entrenched—for this not to be deliberate deceit. What I don't get is now they could not have known that Jesse Singal would see right through their shoddy results.
It's not that they "inexplicably" refuse to do it. At least one editor has a NB kid that I know of. They want this study to prove that affirmative care is a raging success, so they overlook any of its not-at-all-glaring-or-major flaws.
I have to wonder whether any of the valid push-back on this train wreck will result in a retraction or if NEJM will bury and ignore the negative response. Thank goodness for Jesse and the others who provide cogent, valid feedback, but really, it's not enough given a medical community that buys this garbage without any critical thought if it's published anywhere of note.
I think that there will be significant pushback. Publishing such an obviously flawed study is unethical. It's a giant middle finger to researchers and clinicians who care about doing good science and improving patients' quality of life. It's too big to ignore.
I can't wait for the pop sci outlets to cover this angle! /s
Actually I'm most surprised by the small effect sizes. 2 years of treatment, cherry picked outcome measurements and all we get are marginal improvements?
My concern with transition medicine has always been poor screening, and a move away from gender dysphoria (basically a subset of body dysmorphia) towards gender ideology which I believe is incoherent. I assumed that treatments resulted in big improvements for those who need it. Perhaps that is the case and the poor screening offsets those benefits in the aggregate even for those receiving treatment.
Edit: I forgot to include, very clear for such an in-the-weeds topic. Glad you're writing on this Jesse.
As an MD I can speak to the cultural phenomenon of people wanting to mine the data until something pops positive. There’s pressure to publish-publish-publish and having a statistically significant result is how you do that. Promotions in academic medicine are most often based (in large part) on scholarly output.
I agree in part w/ the folks you mentioned who didn’t necessarily see anything wrong with reporting on statistically significant relationships, even if it wasn’t what was being looked for. But framing and context is vitally important and what WASN’T found is just as important as what WAS (but this is often harder to publish).
It is good the NEJM included protocols for this paper - how else would you have been able to discern something may be missing in this analysis?-but I am disappointed in the framing/overstatement of study findings by the authors and I also am very surprised that the most important scales/results (suicidality related) were not mentioned at all. I think that if the authors wanted to publish this info separately, it could have been mentioned that “so and so pattern is emerging in these results, not statistically significant, authors plan to continue to connect longitudinal data…” to leave out any mention of suicide or self-harm measures seems extremely strange.
Thank you for digging into this. I read through the study when it came out, and none of this would have occurred to me to look into.
What I did notice (and am curious about your take on): is it normal that they averaged the scores, rather than reporting on what percentage of the patients showed a clinically significant improvement or decline in each measure? As a parent, that's what I'd want to know: the percentage of kids that benefit from treatment.
More specifically, the way I'm reading the measures, it seems possible that dramatic improvements in a minority of children could even out minor declines in a larger group. I'm not a researcher, so I don't know how common it is to average scores like this. But I could see how you could come up with those numbers even if, say, just 20% of the kids dramatically improved, 50% stayed about the same, and 30% got marginally worse.
Could this be just as likely an outcome as anything else? And if so, are there any sort of ethics standards that would stop researchers from doing this?
If you poke around in the Supplementary Appendix they do at least have a table showing what percentage of kids met certain clinical thresholds for anxiety and depression at each wave of the study. Not quite as fine-grained as what you're looking for -- I agree it would be good to even just know what percentage of kids improved -- but it's something, at least.
It does seem like a lot of numbers and graphs to sift through and still be unable to answer the question: for each measure, what percentage of kids showed improvement?
What these studies should report on is the change from baseline. For all with severe depression, what percentage are better, same, worse? For all patients with no depression, what percentage are better, same, worse? Same with anxiety. This is also true with other studies like the Tordoff study that Jesse reported on earlier.
The declines in the anxiety and depression are statistically significant, but the effect size is quite small. I'm more than a little skeptical that those declines are even clinically significant. CBT would probably have a bigger effect.
Some of the comments here suggest that the authors of this study are politically motivated and knowingly engaged in misconduct. That may be true. However, I think this kind of sloppy science is actually the norm across many areas of research. Everyone is doing it, there are professional incentives to find positive results, there are professional disincentives to criticizing the work of colleagues too seriously, etc...
I don’t mean to defend bad research practices, just to suggest the causes can be very mundane, not malicious.
In many, if not most, cases the "professional incentives" are dollars. Look at where the authors work, and imagine what happens if a well publicized study says what they offer is a wonderful thing. $$$$
Totally agree. This happens fairly often. Some of it is due to the lack of statistical training of medical authors as well, and the reviewers/editors at NEJM. Not that any of this is good news, just that I don't think it's as easy to jump to conclusions on the motivations of the authors.
Jesse, SexMatters Technical paper (Dec. 2022)"Gender-questioning teenagers: puberty blockers and hormone treatment v placebo" is also important to consider. Finds that average improvement in mental health over course of gender treatments is no bigger than for placebo in other mental health-measuring studies. You probably have already seen it, but just in case:
For any statistical laypeople who are interested in understanding how and why p-hacking, Harking, and other methodological malpractice has (and continues to) operate within the scientific 'community', and what approaches - such as open science - are being encouraged to reduce its prevalence, this is a pretty good breakdown: https://www.youtube.com/watch?v=0a9MmloTRO4 (you can probably skip the first 6mins unless you are interested in an unnecessarily long introduction)
I don't think it is too heavy in statistical jargon, so should be relatively easy to follow.
It’s called confirmation bias. “Science” is riddled with it. Won’t end until being “wrong” becomes normalized. Wrong is also a valid result, but only if you are doing real science, aka seeking truth.
Well and clearly written! This is a difficult concept to get across, as you and Kerr have discovered. The coin flipping example was good, but I wonder if there's another example that could be given that focuses more on 'torturing the data'.
Possible XKCD example? https://xkcd.com/882/
I'm looking forward to part 2, which I'm guessing will use the word Testosterone (or T) at least 20 times.
Another possible example:
I have 100 people flip a coin. I then cross reference head/tails with the flipper's age, sex, height, hair color, first initial, etc.
Statistically, I will eventually find a pattern in the noise (say, most people who with a graduate level degree flipped heads) and if I publish a paper saying "People with graduate degrees more likely to flip heads!" I would be HARKing.
If, instead, I started with the hypothesis that people with graduate degrees would be more likely to flip heads, and the noise fell just right that it looked true, I'd only be falling victim to the limits of confidence intervals and random chance, not HARKing.
I think I've figured out a suitable example.
Imagine a wall with several overlapping targets. I shoot at this wall, hypothesizing that I can hit the wall. Then, when examining my shot, I notice that I happen to get a bullseye on one of the targets. I take this result and say "Look, I hit a bullseye, my aim is fantastic!"
The point being that it doesn't count unless you first call your shot. My experiment might be proof that I can hit a wall (a surprising result, but data is data), but cannot be used as proof that I can aim accurately enough to hit a bullseye.
I was thinking of this exact thing as I read :D
Excellent and thank you. The endless harping on "appearance congruence" really got to me. I mean, yes, if you give a female testosterone they will usually grow a beard, but are they happier?
The suicide rate in this cohort is terrible, but I know you're getting to that in part 2.
Thanks for making this public, I will share with colleagues!
Indeed. If they took a bunch of junkies, and had a "Opiate Use Satisfaction Scale", and then gave all the junkies opiates, I would guess that the OUSS would show a statistically significant increase.
Since when has "good medical care" been defined as "giving the patient what the patient wants"?
You can be reported to the medical board, sure, but there's a whole process before your medical license is in danger. I'm not doing anything to endanger my license, just my social standing.
Seriously. I am lucky in that I work for that rarest of things: a small independently owned medical office. My boss is a pretty classic western Libertarian and he supports what I do. My patients are in Gresham and Sandy and Boring and Estacada and they don't care what I say on gender. If I worked for Kaiser I think I might be muzzled and if I worked in Sellwood or the west side I might worry that I'd lose customers but where I'm at, I'm fine.
Am I reading this correctly that there was no 'control' group? They had a cohort of kids treated with hormones and they tracked their progress through 2 years, but didn't compare that to kids who didn't get hormones? The improvements they report are not hugely impressive. I'd definitely want to know how kids who didn't get blockers or hormones faired...
Am I missing something? It seems like to truly say much of anything you'd need a control group? Do they address that in the paper at all?
You're not missing anything. I'll get into this in Part 2, but one problem is that the kids with psychological difficulties also got medication, counseling, or both. In the absence of statistical controls there's genuinely no reason to attribute the improvements to hormones rather than meds or therapy (or other unobserved variables).
Also, let’s not forget the well- know regression to the mean phenomenon, which is obscured in an uncontrolled study like this one.
That was my major problem with this paper. There's no way to determine if the changes in the study participants' psych assessment scores were any different from what their peers would have scored, if they had been included as a control group.
Academic medicine has its own brand of political bullshit and this isn't it. It feels off and transcends professional squabbling. I don't understand why so many journals and professional societies are pushing something so unscientific as gender ideology. It's ideological capture, but by whom? I've been paying attention since Jazz Jennings was on 20/20 as a small child and I still can't figure out what's driving the trans rights activists.
I believe transhumanism is the core ideology, the notion that the inner self transcends the physical body and its limitations.
A variety of LTTE regarding this study are in preparation by a number of researchers. I have submitted a LTTE along with another researcher. Once that LTTE is either submitted or not accepted, I will post the text here.
I'd be curious to be kept abreast of this if you're able to email me
Please publicize the LTTE, Jesse, so we all see it when it happens.
There are a surprisingly high number of Google results for the Liberation Tigers of Tamil Eelam (??) that are preventing me from learning what LTTE means in this context, even when I add other search terms such as “study” or “research” or “published”
Same, but I believe it's "letter to the editor".
I ran into the same problem. Letter to the editor.
A total lay-person here. I'm not trying to be snarky- I just want to understand.
Can surveys of feelings of an individual (especially in a hot button area) really tell us anything?
I've lied on surveys because I know the lie will support my position or because I felt the survey was dumb.
How do the investigators know that the answers they are given are truthful? Can it be truthful in the moment the survey is taken but not truthful at other times?
In things like forensic psychology there are robust instruments that are made to detect when someone is being deceptive by having a multitude of questions where what's being measured is not so transparent.
I don't know anything about the instruments used in these studies but I would be surprised if they have any mechanisms like that, and would imagine these kinds of results suffer from those defects.
I think clinically there is the assumption that people want to get better and so are less likely to be deceitful in that environment which is probably fair outside of specific issues (like Munchausen).
Even I know that surveys ask similar Qs multiple times to be sure of the answers, and so I am careful to be consistent.
Given that it is hard to admit something you buy into may not be as good as you thought, maybe surveys reflect that bias?
I would wonder if participants are more likely to amplify their depression and anxiety at the beginning of the study in order to ensure treatment--you know, like a cry for help.
Oh interesting! I only know of the MMPI through a recent high profile legal case (guess), but I was really fascinated with it's design given the setting it is meant for. I had however assumed the methods it employs would be more broadly applied in non-clinical settings due to data integrity concerns in studies on contentious subjects.
I've been wondering the same thing for years. It's as if a collection of engineered sentiments, when convincingly formatted and word-smithed, can just identify as science.
Amazing work! But I think there might be a typo. Just above the ***, you say "Then, when it was time to report their data, they only told us what happened to six of those variables." Shouldn't it be "TWO of those variables"?
fixed -- ty
Thanks for this invaluable analysis. I noted various major flaws when I read the study- no control group, extremely short assessment period of 2 years, etc, but assume you'll go into that in Part 2.
Please take into account fully the love-bombing role schools play for kids who transition, and how suddenly being the brave and stunning hero with tons of glitter friends and admirers will affect kids' mood. Go to The Anti-Science Disaster of Gender Ideology in The Schools to see what schools teach and do at https://caroldansereau.substack.com/p/the-anti-science-disaster-of-gender.
Thanks for once again doing the work that peer reviewers and editors inexplicably refuse to do.
My normal reaction to this kind of sloppiness is to assume that people are just plain incompetent and ignorant. "Protocol? What protocol?" "Those instruments? Nah, we'll just use these over here." But in this case the stakes are just too high—and the ideology too entrenched—for this not to be deliberate deceit. What I don't get is now they could not have known that Jesse Singal would see right through their shoddy results.
It's not that they "inexplicably" refuse to do it. At least one editor has a NB kid that I know of. They want this study to prove that affirmative care is a raging success, so they overlook any of its not-at-all-glaring-or-major flaws.
I have to wonder whether any of the valid push-back on this train wreck will result in a retraction or if NEJM will bury and ignore the negative response. Thank goodness for Jesse and the others who provide cogent, valid feedback, but really, it's not enough given a medical community that buys this garbage without any critical thought if it's published anywhere of note.
"At least one editor has a NB kid" – I did not know that, but it isn't too surprising.
https://www.washingtonpost.com/news/parenting/wp/2018/07/09/why-i-had-a-hard-time-calling-my-transgender-child-they-and-why-im-doing-it-anyway/
"Debra Malina is the perspective editor of the New England Journal of Medicine."
So not one of the medical editors, but I would be surprised if there weren't others or if there was no influence.
I think that there will be significant pushback. Publishing such an obviously flawed study is unethical. It's a giant middle finger to researchers and clinicians who care about doing good science and improving patients' quality of life. It's too big to ignore.
I can't wait for the pop sci outlets to cover this angle! /s
Actually I'm most surprised by the small effect sizes. 2 years of treatment, cherry picked outcome measurements and all we get are marginal improvements?
My concern with transition medicine has always been poor screening, and a move away from gender dysphoria (basically a subset of body dysmorphia) towards gender ideology which I believe is incoherent. I assumed that treatments resulted in big improvements for those who need it. Perhaps that is the case and the poor screening offsets those benefits in the aggregate even for those receiving treatment.
Edit: I forgot to include, very clear for such an in-the-weeds topic. Glad you're writing on this Jesse.
As an MD I can speak to the cultural phenomenon of people wanting to mine the data until something pops positive. There’s pressure to publish-publish-publish and having a statistically significant result is how you do that. Promotions in academic medicine are most often based (in large part) on scholarly output.
I agree in part w/ the folks you mentioned who didn’t necessarily see anything wrong with reporting on statistically significant relationships, even if it wasn’t what was being looked for. But framing and context is vitally important and what WASN’T found is just as important as what WAS (but this is often harder to publish).
It is good the NEJM included protocols for this paper - how else would you have been able to discern something may be missing in this analysis?-but I am disappointed in the framing/overstatement of study findings by the authors and I also am very surprised that the most important scales/results (suicidality related) were not mentioned at all. I think that if the authors wanted to publish this info separately, it could have been mentioned that “so and so pattern is emerging in these results, not statistically significant, authors plan to continue to connect longitudinal data…” to leave out any mention of suicide or self-harm measures seems extremely strange.
Thank you for digging into this. I read through the study when it came out, and none of this would have occurred to me to look into.
What I did notice (and am curious about your take on): is it normal that they averaged the scores, rather than reporting on what percentage of the patients showed a clinically significant improvement or decline in each measure? As a parent, that's what I'd want to know: the percentage of kids that benefit from treatment.
More specifically, the way I'm reading the measures, it seems possible that dramatic improvements in a minority of children could even out minor declines in a larger group. I'm not a researcher, so I don't know how common it is to average scores like this. But I could see how you could come up with those numbers even if, say, just 20% of the kids dramatically improved, 50% stayed about the same, and 30% got marginally worse.
Could this be just as likely an outcome as anything else? And if so, are there any sort of ethics standards that would stop researchers from doing this?
If you poke around in the Supplementary Appendix they do at least have a table showing what percentage of kids met certain clinical thresholds for anxiety and depression at each wave of the study. Not quite as fine-grained as what you're looking for -- I agree it would be good to even just know what percentage of kids improved -- but it's something, at least.
It does seem like a lot of numbers and graphs to sift through and still be unable to answer the question: for each measure, what percentage of kids showed improvement?
What these studies should report on is the change from baseline. For all with severe depression, what percentage are better, same, worse? For all patients with no depression, what percentage are better, same, worse? Same with anxiety. This is also true with other studies like the Tordoff study that Jesse reported on earlier.
The declines in the anxiety and depression are statistically significant, but the effect size is quite small. I'm more than a little skeptical that those declines are even clinically significant. CBT would probably have a bigger effect.
Some of the comments here suggest that the authors of this study are politically motivated and knowingly engaged in misconduct. That may be true. However, I think this kind of sloppy science is actually the norm across many areas of research. Everyone is doing it, there are professional incentives to find positive results, there are professional disincentives to criticizing the work of colleagues too seriously, etc...
I don’t mean to defend bad research practices, just to suggest the causes can be very mundane, not malicious.
Sloppy science generally doesn't get published in the New England Journal. The editors allowed this study. Why?
In many, if not most, cases the "professional incentives" are dollars. Look at where the authors work, and imagine what happens if a well publicized study says what they offer is a wonderful thing. $$$$
Totally agree. This happens fairly often. Some of it is due to the lack of statistical training of medical authors as well, and the reviewers/editors at NEJM. Not that any of this is good news, just that I don't think it's as easy to jump to conclusions on the motivations of the authors.
Jesse, SexMatters Technical paper (Dec. 2022)"Gender-questioning teenagers: puberty blockers and hormone treatment v placebo" is also important to consider. Finds that average improvement in mental health over course of gender treatments is no bigger than for placebo in other mental health-measuring studies. You probably have already seen it, but just in case:
Link: https://sex-matters.org/posts/publications/gender-questioning-teenagers-puberty-blockers-and-hormone-treatment-vs-placebo/
For any statistical laypeople who are interested in understanding how and why p-hacking, Harking, and other methodological malpractice has (and continues to) operate within the scientific 'community', and what approaches - such as open science - are being encouraged to reduce its prevalence, this is a pretty good breakdown: https://www.youtube.com/watch?v=0a9MmloTRO4 (you can probably skip the first 6mins unless you are interested in an unnecessarily long introduction)
I don't think it is too heavy in statistical jargon, so should be relatively easy to follow.
It’s called confirmation bias. “Science” is riddled with it. Won’t end until being “wrong” becomes normalized. Wrong is also a valid result, but only if you are doing real science, aka seeking truth.