AI’s Bias in Predictive Healthcare with Associate Professor Marzyeh Ghassemi Transcript

Welcome to MIT'S Computer Science and Artificial Intelligence Labs Alliances podcast. I'm Kara Miller. [MUSIC PLAYING]

On today's show, technology has given us a lot more power to make predictions in clinical health care settings. But--

It's not a question anymore whether we can predict something. The answer is almost always yes, we can. The question you should be asking is, should we predict this. And if we do, what action will we take as a result?

Marzyeh Ghassemi is an associate professor in electrical engineering and Computer Science at MIT, and her focus is AI in health care. How can you use it to help clinicians and patients? Well, companies are building solutions, but those solutions may have some holes.

So far, what we've seen is that the generative AI technology is being proposed for these tasks haven't been well vetted in these settings. And so I think we as technologists really have a responsibility to the people in health to make sure that when we're helping them with something that's going to affect their patients and their own practice, we know it's going to work at least as well as what they are doing right now.

Today, I'll look at AI in health care through the lens of somebody developing models and seeing the problems under the surface. That's all coming right up.

But first, in so many of our conversations here, we talk about generative AI. It's one of the game-changing technologies of the last few years. So how can you understand it better? How can it help your business? Join CSAIL for the online course, Driving Innovation with Generative AI.

If you're interested in more info, check out our show notes on Apple Podcasts or wherever you get your podcasts. There you're going to find a link to more info about the class. And listeners to the podcast get 10% off. Use the code MIT X Pod 10. Again, the code MIT X Pod 10.

[MUSIC PLAYING]

In 2011, roboticist Ayanna Howard put together an experiment.

Ayanna Howard is a personal hero of mine. I saw her present this paper as a graduate student in a conference.

Howard wanted to understand how much trust people put in robots. Here's Marzyeh Ghassemi.

And what she does in this experiment is she has a group of people go into a room and they're doing some task or watching something. They're doing something else. And then into this room she sets off some artificial smoke and a fire alarm.

And so there's a sense of urgency. They need to evacuate. Where do they go? And then there's a little rescue robot that comes up to guide them to safety.

But this, not surprisingly, was a setup. A bunch of the people in the experiment had just seen the robot do terribly with a navigational task, so it probably was not a good robot to follow when you needed to get somewhere safe.

And yet, in this experiment, all of the people in the room follow the robot to a dark room with no discernible exit walking past the safe exit that they came in on. And so, for me as a graduate student, this was such a powerful example of how trusting people can be of technology, even when maybe there are some signs that they shouldn't trust it in a specific situation.

And that's something that we struggle with in health care, the area I work in, but in any field where we have technology interacting with humans, we need to be very careful about how we deliver advice and understand how people will trust it and follow it.

And every day, many people who work in health care have access to all sorts of technology, almost all of it meant to be helpful. So the question is, should there be more questions about how often we tend to follow that advice?

Many of the prominent uses of generative AI in health care right now are administrative uses. Things like automating billing or completing patient notes, or maybe creating drafts of messages for a provider to send to a patient. It's unclear how we make sure, in these generative use cases, that the information being generated is relevant only to the clinical setting and doesn't retain any of the societal biases that are normal for language models to propagate because they've been trained on human generated text.

But there are other kinds of biases that can filter in through technology. Risk scores, for example, which can help clinicians predict what might be to come for a patient are too often based on surprisingly small data sets of just a few hundred people, according to Ghassemi, who also holds MIT affiliations with the Jameel clinic and CSAIL.

It would be nice if there were just national publicly available de-identified data sets in the United States that any researcher could look at to understand what kind of health care works best for different people but we don't have that right now. And so instead, what happens is if you want to develop a risk score, you have to look at a specific source, get permission to use that data, build a model based off of it. And that's going to reflect the specific biases and population distributions of that site.

Specifically in some of these older risk scores, we also have the problem that they may reflect historical practices. So it's not just the case that we're taking maybe a smaller sample size of a few hundred or a few thousand patients, which isn't as representative as we would like. It's also the case that you're looking at a single center, and so that may reflect the way that training was done there, that may reflect the specific practice of a few doctors in that subspecialty.

And the challenge is even greater when it comes to minority groups, as she notes, because minorities are, by definition, smaller in number. Which means fewer data points and the possibility of models that just aren't that robust.

And when we train a model, even a very simple model to predict a label based on this data, often the way we do it is with something called empirical risk minimization or ERM. And what this is is an instruction to an optimizer to build a model that works best overall for the most samples.

But you can imagine if you had 100 samples and 10 of them belonged to a minority group and it was just harder to fit this minority group. And the model doesn't know that it should be very careful about misclassifying different groups at different rates, it would say, well, you just told me to do best overall and so I can just ignore these 10 patients, do best on these 90 patients and then I'll get a good score.

And so historically, people have used methods empirical risk minimization without awareness of different subgroup performance rates. And I think that's been very difficult to generalize to minority populations but then even to move from site to site. And so now there's been a shift in the field of ensuring that risk scores, when created, they work very well across as many different groups as possible.

But this isn't just an optimization problem, it can also be a data problem fundamentally. Meaning you may not have a high capacity enough model to understand the way that specific features like somebody's biological sex at birth or self-reported race interact with a specific clinical feature.

For example, we know that African-Americans often access health care later on average than White Americans. And so what this can mean is that we tend to see their diseases at a slightly later stage than we do for other majority groups and so this impacts what a model may believe about the potential diagnoses and predictions that it should make for these groups.

But it may be missing the piece that it was a less serious form of the disease at some point, it's just that wasn't captured by the medical system.

Models have no idea what they don't know.

Right. Right.

It's not as if they have some bird's eye view of all of the different sociotechnical decisions that impact data that gets into the system.

So this ties a little bit into why you wanted to work in this area. But I wonder, and we'll talk more about vulnerabilities with data, but I wonder if you feel like there are places where computer modeling has worked in health care and like places that give you hope that yeah, here's a place this has been useful.

There are a lot of advances that I think are really exciting. A few of them are all of the work on protein folding, understanding new ways to advance drug discovery, predicting new binding targets for things that we previously considered undruggable. These are all things that humans just can't do or we're really bad at doing, and AI is extending our capacity.

And what's really cool about that is it's transformative. It's not like we're asking a model to mimic our existing practice that is not so great maybe, but just do it faster. We're asking AI to do something better than us, something that we're actually not great at, and I think that's really exciting. Within clinical settings specifically, there are some risk scores that we work really, really poorly and really would be better informed if we had more data modalities or just higher capacity models.

So, for example, a fellow MIT faculty member, Regina Barzilay, has developed a breast cancer screening program that uses mammogram data that's been collected from many international sites. And that model works better than the historical model that was based on risk stratification, using aggregated metrics for different groups, assumptions about what different group risks might be.

And so having that kind of personal risk score is really impressive. But she did so much work to get there and I think the same is true for some of the other really impressive advances that we've seen. And some of the best examples of advances that I have seen in the practice of health care, they focus on areas where we know human doctors perform really poorly on.

These would be things like domestic partner violence identification or endometriosis identification, where there's such a long delay between the onset of a condition and a clinician recognizing that it's happening and then giving you resources. And other chronic conditions that are really hard to detect and subtype.

These kinds of things are very difficult to do, especially when we have primary care shortages across the United States. And so a model being able to offer more resources to underserved groups, I think is a win for everybody.

So when you look forward, let's say, 10 years, what do you hope that technology could inject into hospitals and to doctors offices that would be really useful to patients?

There are so many things. Being hopeful, I think focusing on really positive outcomes. I could imagine a patient with a chronic condition being detected earlier, something like endometriosis. I could imagine a targeted treatment being recommended much more quickly.

There are some chronic conditions where you have a first set of medications that you try and then you sort of randomly try to understand, by sample selection and failure, which one will work best. For example, which antidepressant should I use? It's unclear.

Knowing which kind of treatment might work best for a specific kind of patient would be extremely valuable. And even beyond that, understanding the impact of different kinds of treatments on different kinds of patients and how that interacts not just with the singular condition it's intended for, but all the other parts of a patient's life not just their sociodemographics, but other conditions they might have.

What kind of lifestyle they have, what sort of resources they have access to whether they are in a food desert, whether they have regular access to other kinds of health screening. All of these could be a consideration. And again, help us direct more health resources to patients who really need it, even when they haven't seen a primary care provider as regularly as we would like.

And would tech be helpful there because it can take in many more inputs than like a person can. Like it can know your zip code. It could also know all the meds you're on, all your history. Like it can take all that stuff in and--

Yeah, high capacity models really can be extremely informative if you know the biases of the data coming in and how to train them to produce a well-calibrated output. I think what we've seen historically is that this space is new enough that we don't have models that have been applied to existing health care problems in general, and then we've been able to deploy them and see that impact. See the patient outcomes improve.

And so what I'm really excited for is some of these models that have been proposed using clinical notes, vitals, labs, longitudinal imaging data over a decade. Those can be used to try to inform clinicians and patient care staff as to how best to support a patient in future years.

And again, if it takes on average 18 months to schedule a primary care appointment, maybe we can use these kinds of technologies to do early prediction and get those appointments scheduled a little bit more quickly so that resources can be allocated or that specialists can be seen.

When you talk to people in health care, what do you hear from them in terms of what they want or whether they even have any sense of what AI could provide them?

I think my friends and colleagues in health care are in a crisis, right. I think in many ways it's been hard to move on from some of the burnout post the COVID 19 pandemic. And I think the reason that many of the applications of generative AI right now are so focused on administrative tasks is because they're feeling that burden even more keenly than they did pre-pandemic.

And so I think many of the questions from clinicians that I get anyway are related to can you remove these administrative burdens, these things that are not relevant to patient care? Things that I don't think should be really part of my job, can you help me automate those out so that I can focus on taking care of patients?

And it's a very reasonable ask, right. There's been this, sort of, explosion of administrative tasks that clinical staff have had to handle. But so far what we've seen is that the generative AI technology is being proposed for these tasks haven't been well vetted in these settings.

And so I think we, as technologists, really have a responsibility to the people in health to make sure that when we're helping them with something that's going to affect their patients and their own practice, we know it's going to work at least as well as what they are doing right now.

So I wonder then what you think the state of play is for creating and for implementing various models in health care? And I'll just insert here that you were part of a paper that got quite a lot of coverage, and it involved a model that looked at chest X-rays and basically tried to diagnose whether somebody had a potential problem, did they need to be further examined. And quite famously from an X-ray, basically just bones, right, the model defied all expectations and accurately predicted self-reported race.

So given the trickiness, as you've shown, of using models in a good way and of using data in an appropriate way, I just wonder what you make of the fact that, yeah, there are a lot of models that are being developed right now in health care, in academia, in industry. Do you think these models are useful or are we in a nascent stage when it comes to AI in health care?

I think we are still in a nascent stage where many of the models being developed have not been properly evaluated. So one of the things we found in the same paper is-- we we're really unhappy that these models performed poorly on different groups when they could detect demographics. And so we used some state of the art algorithms to force the models to predict fairly.

And so if you use techniques like group distributionally robust optimization, you can actually get models that perform really well on average and have very, very small fairness gaps. And that was so relieving to see and also made me very hopeful.

But then we tried training these models on data from Massachusetts. And then taking these optimally fair models that perform really well on average and have no fairness gaps, taking those models, and then deploying them or testing them on X-ray data from California. And the performance drops so much.

And it's not just the overall performance, which does suffer, specifically the fairness. What we found is that when you apply a welltrained model, something that's, sort of, optimal in your setting to a new data setting, often the fairness is the thing that you have no idea about the generalization of.

And so we have basically even odds of a model that is perfectly performant and fair doing much better or much worse in a new setting. And this is concerning for deployments because most of the models that I see developed, they're developed on a single site with a lot of data, but still a single site, right? They're then tested on unseen patients from that same site, but then they're approved to be deployed in much larger settings.

And so we don't really have, I think, right now as a community, good ways of doing broad evaluation and also just local validation. Before I use this model in my hospital, can I get an evaluation of how well it's going to do on patients that look like the patients from my hospital.

This is actually a perfect segue to my conversation. Last month, on this podcast, with Dr. Karandeep Singh, who's the chief health officer at the University of California at San Diego Health, practicing doctor in the hospital. And I basically asked him almost what you described.

Essentially, if you have great data from a hospital in Connecticut, can you say oh yeah, we can totally use that in New Mexico. But of course, different demographics, different incomes, the whole thing. Here is what he said.

You've just described the modern state of health care AI which is precisely that. Companies come to you in some cases with a lot of high quality data from a system that potentially resembles yours and potentially doesn't.

And the challenge is it's not just a demographic thing. This is not a matter of the model works in this patient population, but not in this other one, it's also data quality. And the way that your clinical care gets documented and generates that digital footprint, that process might be slightly different at one hospital or one health system versus another, and that could lead to big differences.

So here's this big promise of what AI can add to health care. But as both of you are describing, we can test the iPhone in California and send it to New Mexico and it still works OK.

yeah. It's true.

And so if you're in this situation where I mean, not only can you not test something in Connecticut and send it over to New Mexico and be like, that totally worked in Connecticut, it's going to work great in New Mexico, you probably can't even send it to another part of Connecticut where there's totally different dem-- maybe you can. I don't know. How do you surmount these hurdles?

So other application areas have faced these issues and have surmounted them in different ways. So talking about machine learning in speech, right. I don't know if you know anything about how systems like Alexa and Siri were developed.

These companies paid their employees, or in some cases, maybe didn't pay their employees to go into rooms and say things and they recorded them saying things, right. They had to bootstrap their own training sets of all of the different ways that somebody might be interacting with Alexa or Siri. People of different ages with different accents of different sexes.

And there were many reports in the early days of voice assistants that these models didn't work well for all women, for children, for people with any accent at all and so they had to go back and get more and more data and improve these models. And so it's not the case that in other areas we have solutions that just work on everybody when we first deploy them, it is the case that you have to figure out oh it's not working on this group, I need to improve it.

But in that case, worst case scenario, I'm angry that my order didn't go through and now I talk to a customer service representative. Customer service says oh it seems like I've gotten a lot of complaints from women. Women on average have higher pitched voices, we need to fix this. The same thing has been true in computer vision. So for a very long time it's been well established that examples of things that are easy to classify in a north American context are almost impossible in non-north American contexts.

So if you do a wedding in the United States, often there's a big white dress, there's a man in a tux. And if you do a wedding in India, it's classified by state of the art performance systems as a performance art exhibition.

OK. OK.

So this is because we have trained models with a specific bias because that's where most of the data has come from. This is where most of the labeling is coming from. This is where most of the evaluation sets are coming from.

And so there's been a concerted effort in the speech community and the vision community and the natural language processing community to collect data sets that are more representative. I know there was a huge effort in the NLP community to try to collect more examples of paired language tasks. So that low resource languages had more examples that models could at least be evaluated on.

If I'm trying to translate medical text from English into Thai, I need to know that it's accurate. But how do I even know that if I don't have examples to evaluate based on? And so many, many communities are dealing with these issues now and have dealt with them in the past, when we're using a model, and it just works on the setting where we trained it.

I think the issue in health care is when something doesn't work on a new patient population, the result can be really disastrous because we don't know why a model fails often. It's just giving us a prediction and maybe it's wrong because I'm in a new setting. But if I'm classifying somebody's wedding as performance art, that doesn't have often the same kind of life-changing, urgent acute impact. It can be dealt with. We can go back and change the model.

And so I think we, as a health community, are working really hard to make sure that models are robust, that we know how to correctly characterize their generalization capacity, that we know how to express uncertainty in a way that people can use and incorporate into decision making. But it's a hard problem that the whole field is struggling with right now.

How confident are you that hospitals, that regulators, that companies are up to the task. Because you're also describing something that is more me being labor intensive, capital intensive, right. I think this needs to be an extremely well-orchestrated effort.

I think many parts of the regulatory system need to be engaged. I think we can take examples from other spaces where regulation has brought something that used to be-- a technological integration that used to be risky into a more safe domain, which is aviation.

I am so impressed that an airplane door can fall off an airplane mid-flight and the airplane lands and everybody survives. That doesn't seem possible to me. But air flight didn't used to be that safe, right. This is because we now have multiple regulatory systems in the United States that were created by different acts of congress, separated by decades in order to ensure that there was the right auditing, the right regulation, the right investigation when something bad happens.

And so I think, as a community, we need to get to a place where we, as experts, are really comfortable working with regulators, giving them the resources and support that they need to provide guidance to model developers and model deployers about how best to use technology in the context it's going to be used in health care. Right now we don't have enough capacity and we don't have enough integration.

And so I think there's a lot of work that needs to be done, a huge amount of work that needs to be done in order to make sure that when we deploy models, and not just AI, any of these risk scores that don't have AI in them and don't work for certain populations, right. Any time we deploy a model in a health care setting, we want to know it does work, and we want to know when it might not work.

It sounds like there's a lot that needs to happen with regulation. I mean, I think of the FAA and their oversight so much. Not that the airlines don't-- like they obviously are a very important part of this but it sounds like regulators haven't created the system yet.

I think we don't have the rules yet. And so it's hard to play by the rules when there aren't any. And so if I as a developer, don't know that an acceptable difference between any two groups is a certain threshold, then I have to make that call on my own. If I, as a model deployer, don't know that a model has to perform at a specific rate for a prediction task every year, otherwise it's moved out of distribution and it needs to be updated, I don't know when to stop it.

These kinds of guidances need to be made clear, but we don't have them right now. Right now individual developers have to make decisions about whether their model is generalizable enough, whether it's robust, whether it's performing well enough. And then individual deployers have to make decisions about when to check models, how to check models, and when they're bad enough to stop using them. And that's not a state that I think should continue if we want safe deployments that are improving human health.

So a final question for you. This is a moment that feels a little bit like of a gold rush in terms of the creation of AI. And so I wonder if you worry about that. There are a lot of companies that are like, we could do this or even hospital chains they're like we could do this.

And if they're rushing to put things together and the system kind of-- they're trying to put the car together as a drive, so to speak, like, do you worry? I worry a lot. Not just about this many things.

I worry a lot. I think there is the potential for any kind of technology to be used for harm. I think we as a society, we know this, right. There are ways to use any kind of technology that will harm humans.

And this extends to education, to finance, to employment. But in health, I think there is, sort of, a natural understanding of the urgency of ensuring that we're not harming people. And I don't think that right now there's the same kind of contemplation that we should have about the time, the place, the manner in which a model should be deployed.

I often hear people say, I have this great idea, we could do this. We could predict whatever health outcome. And my response is always you could predict almost anything at this point. We have very good models.

I am so impressed with the technological landscape and how model capacity has grown even recently. And so that's not a question. It's not a question anymore whether we can predict something. The answer is almost always, yes, we can.

The question you should be asking is, should we predict this? And if we do, what action will we take as a result? And I've found it very difficult to walk through the positive use cases of some of these prediction tasks with different deployers because I think there's so much excitement about the fact that we can do something that we're not focusing on whether we should use AI in many use cases and what the decisions are going to be that are informed by AI in those contexts.

Marzyeh Ghassemi is an associate professor at MIT in electrical engineering and computer science. She holds affiliations with the Jameel Clinic and CSAIL. Thank you. This was great.

Thank you for having me.

[MUSIC PLAYING]

And before we go here, a reminder about CSAIL's upcoming online GenAI course. If you're interested in more info, you can email us, podcast at csail.mit.edu or you can check out our show notes on Apple Podcasts. There you will find links with more info. And listeners to the podcast get 10% off the course with the code, MIT X Pod 10.

I'm Kara Miller. The podcast is produced by Matt Purdy with help from Andrew Zukowski and Audrey Woods. Join us again next time. And stay ahead of the curve.