Getting the Most out of Multimodal Models with MIT CSAIL Postdoc Jehanzeb Mirza

Audrey Woods, MIT CSAIL Alliances | September 15, 2025

 

In January of 2021—over a year before ChatGPT would become a household name—Open AI released a neural network called CLIP, which effectively combined computer vision and language models. While CLIP stayed under the radar with the general public, in the computer science community, this landmark paper blew open the doors for multimodal learning, or training AI models with multiple modalities like text, image, and sound/speech.

Now, MIT CSAIL Postdoctoral Associate Jehanzeb Mirza is working to expand what’s possible in multimodal learning with a focus on enhancing fundamental abilities, like in-context learning or fine-grained understanding. Inspired by human perception and sensory development, Dr. Mirza believes the future of AI is multimodal.

FROM AUTONOMOUS CARS TO AI MODELS
Dr. Mirza always wanted to do something with machine learning, but there were limited options available while he was earning his bachelor’s and master’s degrees in electrical engineering. As an undergraduate in Pakistan, he says, “there was only one very introductory machine learning course.” His first hands-on experience in machine learning came while working in the autonomous car industry in Germany, where he became exposed to computer vision as it related to self-driving vehicles. An internship at Intel solidified his interest in the area, and he moved to Austria to get a PhD in Computer Vision from the Technical University of Graz.

“At the time, there was not a lot of multimodal learning,” Dr. Mirza explains. “There was only unimodal learning, where people were working either solely with natural language or with vision.” The field changed when OpenAI released CLIP, which he describes as “the first large-scale multimodal network where people saw that you can combine language and vision.” Even though he was still technically working in autonomous driving during graduate school, thanks to a supportive advisor, Dr. Mirza spent the second part of his PhD exploring multimodal learning as it relates to unsupervised learning approaches. When he graduated and was looking for postdoc positions, he was introduced to MIT CSAIL Senior Research Scientist Jim Glass, who works on cross-modal learning between audio, speech, and vision, and “the rest is history.”

“The thing which fascinates me about multimodal learning is that human perception is based on multiple modalities. Human perception is built with vision, language, audio components, and other senses, which are difficult to model.” Dr. Mirza believes that the intersection between multimodal learning and self-supervised learning has a “huge role to play in how we will achieve this Holy Grail of AGI, or Artificial General Intelligence.”

CURRENT WORK: BETTER PROMPTS AND PERSONALIZATION
Though Dr. Mirza is working in multiple areas related to multimodal learning, three projects capture the current trajectory of his research. First is GLOV (Guided Large Language Models as Implicit Optimizers for Vision Language Models), which he presented at the 2025 CSAIL Alliances Annual Meeting. GLOV aims to use LLMs to automatically enhance the prompts for Vision Language Models (VLMs) to optimize downstream vision tasks, such as image classification. “The problem is that these models are very sensitive to the input language being provided,” Dr. Mirza explains. Their method creates a feedback loop that leverages the optimization capabilities of LLMs, iterating and improving based on feedback to provide the best possible prompts for a downstream vision model. Fundamentally, GLOV offers automated prompt optimization, an exciting area that has widespread implications with the proliferation of AI models.

The second project Dr. Mirza is excited about is a paper called “Can We Talk Models Into Seeing the World Differently?” This research analyzed how visual features are processed in VLMs, what exactly these models are looking at, and when they might be, for example, more biased toward making decisions based on texture vs. shape. Dr. Mirza and his colleagues then investigated whether the model outputs could be steered to consider texture more than shape or vice versa. “How we do this is super simple, because we have the language component in our VLMs. We can alter the prompt to shift the texture bias or shape bias of the model.” In this way, they showed that VLMs can be steered by prompt engineering. As with GLOV, this project showed “these VLMs have more degrees of freedom, which is interesting going forward if we have more modalities (like speech or audio). You can do a lot of things just by altering a simple textual prompt.” This research has potential applications in medical imaging, where a radiologist could steer an LLM to prioritize tumor shape over image texture, or autonomous driving, where directing a car to pay more attention to signs and pedestrians (shapes) than shadows or road quality (textures) could enhance safety and robustness. 

Dr. Mirza’s third project diverges from prompt optimization, although it still falls in multimodal learning. This paper, titled “Teaching VLMs to Localize Specific Objects from In-context Examples,” explores how to achieve personalization in VLMs. In-context learning is a popular subfield in natural language processing where practitioners provide examples to give the model a better sense of what type of output is required. “But we find that, although VLMs are built upon LLMs, they do not inherit this ability of in-context learning.” Dr. Mirza and his team set out to teach VLMs to learn from context by fine-tuning the model with a subset of curated images to help it locate a personalized object, like a specific cat or mug. The project, which Dr. Mirza says was the first to address this particular challenge, has “huge potential applications in image generation because there you want to have this aspect of personalization,” something that is currently a problem for most commercially available models. For example, this work could help automatically find and tag personal items in photo libraries, isolate a specific actor or prop in film editing, track products or tools across surveillance footage in warehouses, or follow a specific object of concern (e.g., a child near a vehicle or a pet in the rear view) across a car’s camera views. 

Dr. Mirza emphasizes that all these papers are open source, and the code is available on GitHub. He hopes people will try out the models and join the communal effort of answering the exciting question: “where do we go from here?”

THE FUTURE OF THE FIELD: CHALLENGES & NEXT STEPS
Dr. Mirza believes everything is going in a multimodal direction. “Models have already become more and more multimodal, but in the next couple of years I think this trend will skyrocket.” However, there are challenges to address along the way. “The biggest challenge is definitely going to be the compute factor,” Dr. Mirza laments, explaining that “once you have this LLM paired with a very strong vision processing backbone, it really quickly gets out of hand.” On the technical side, unlike in natural language processing, which Dr. Mirza feels is ”quite ahead,” the vision community “still hasn’t really figured out how to best incorporate vision models into LLMs.” There currently isn’t a clear consensus on the best way to integrate vision modality into language models, so more research is needed in that area. Also, with AI models across the board growing in size and parameters, data is becoming a major concern. “Where is this data coming from? Let’s say we scrape all the data and we have learned all that there is to learn from the Internet. At one point, we will definitely run out of data.” Finally, Dr. Mirza is concerned about the ethics and safety of multimodal AI. As image and video generators become better,  misinformation and false content are spreading rapidly and eroding public trust. Despite his excitement about AGI, he admits, “this is something I also fear.”

While Dr. Mirza isn’t yet sure if he’s going to stay in academia or enter industry, one thing’s for sure: “whatever happens, I will still be in research.” He loves the thrill of invention and is grateful to be working in an unprecedented historical time. “There is no dull day in the field of machine learning, computer vision, multimodal learning, or whatever subfield you take.” Keeping up with this rapid rate of progress keeps Dr. Mirza excited and motivated, and he’s ready to be a part of what comes next. 

Learn more about Dr. Mirza on his website or CSAIL Page.