Written By: Audrey Woods
Introduction
Launched in September 2020 and entering its third year, MachineLearningAplications@CSAIL is a CSAIL Alliances research initiative focused on the development and application of machine learning solutions to industry problems. Like other Alliances initiatives, this program brings together a small group of companies around a topic that is important to industry and where CSAIL research is at a stage where it can make an impact. Partnering with industry members gives researchers real-world problems to work on and joining an initiative “broadens the industry partner’s experience across the lab,” says CSAIL Alliances Managing Director Lori Glover, exposing them to projects and ideas they would not otherwise have seen, but might prove useful to their businesses. On December 5th, 2022, those involved in MachineLearningAplications@CSAIL—both on the industry side and in academia—gathered to share the work they’ve been doing and the impact these projects could have.
Professor Daniela Rus: Keynote
Speaking on the current state of artificial intelligence (AI) and machine learning (ML), CSAIL Director Professor Daniela Rus said, “AI is impacting just about any field. No matter what industry you’re in, chances are you have been impacted by AI.” She went on to explain how, despite the exciting advancements in this area of computer science, there are still major challenges left to address, especially when applying ML in safety-critical situations such as autonomous driving and medical assistance. Some of these challenges include acquiring enough data to cover unusual edge cases, understanding how the model is working so failure modes can be anticipated, and model adaptability in new or changing environments. Professor Rus elaborated that “today’s greatest advances are due to decades-old ideas that are enhanced by vast amounts of data and computation… We need new ideas, because otherwise we’re going to keep plowing the same field.”
Walking through a few of these new ideas, Professor Rus first introduced how her team is changing the autonomous driving pipeline. Currently, most programs for self-driving vehicles “break up the problem, solve each problem individually, and then stitch everything together.” This, Professor Rus explained, restricts the application because parameters are needed for every step of the process and the program has limited adaptability to scenarios it hasn’t explicitly been trained on. To address this problem, CSAIL researchers developed a machine learning-based approach, showing that it is possible for a model to directly interpret sensor data and actuate on that information without needing to be trained on every step in between. In other words, they created a model that doesn’t need specific training for driving at night or during the day, or in winter vs. summer conditions. As Professor Rus says, “you just drive.”
However, such learning-based approaches require enormous volumes of data. Simply sending a car out on the road isn’t enough to cover all situations and environments it might face in action, especially things like crashes, erratic drivers, and dangerous conditions. Therefore, Professor Rus and her team developed Virtual Image Synthesis and Transformation for Autonomy (VISTA). Using simulated data, they were able to give the model an array of experience, teaching it to drive in various seasons, with diverse sensor data, and even respond to crash or near-crash scenarios. Professor Rus explained how VISTA is just the beginning and that the trend of using simulators for cheaper, comprehensive, and efficient training is an exciting new frontier of AI research.
Professor Rus’s group is also developing new kinds of neural networks that allow for this straightforward learning to actuation pipeline. She explained how the neural networks of today are often black boxes, with so many nodes and connections that it’s impossible “for users of the system to look inside the box and see what is happening.” This makes it difficult to anticipate failure modes tied to rare inputs that could lead to potentially catastrophic consequences. But, according to Professor Rus, one goal of the Machine Learning Applications initiative has been the creation of explainable models with robustness across multiple tasks and situations. As an example, she walked the audience through a project where researchers took a model with thousands of neurons and an erratic attention map—paying attention to the bushes around a road instead of the road itself while driving—to a model with only 19 neurons and a much cleaner attention map focused on what a person would focus on while driving. This streamlined solution is made possible by continuous time neural networks, which are adaptable, understandable, and causal. These new models can “dynamically adapt after training, adapt to the nature of the task, [and] relate cause and effect,” bringing us that much closer to truly intelligent machines. The limitation of continuous time neural networks is that they only work on time series data, although researchers are currently expanding on this idea to create models that will work on static data as well.
Professor Rus ended by emphasizing how these new approaches “show great promise” and have applications in multiple fields beyond robotics, including finance and text-based tasks. “The model itself is amenable to any kind of data when we have a time series representation,” she said. Overall, Professor Rus summarized, “the two trends to keep in mind are (1) using simulation-based training to increase your data and to capture corner cases when those are difficult to capture in real life and (2) thinking beyond the traditional methods for machine learning for new kinds of models… where we have causality, where we have explainability, where we have compactness, and we have fast training and inference.”
She concluded, “if we stretch ourselves a little bit, we can actually develop foundational solutions that hopefully will take us beyond where we can go with existing models today.”
More Efficient Machine Learning
Computational Limits of Deep Learning: Real-time inference on edge devices via Adaptive Model Streaming
Speaking from Professor Mohammad Alizadeh’s lab, PhD student Mehrdad Khani Shirkoohi spoke about recent advancements in real-time inference on edge devices. He pointed out how edge devices such as cellphones, drones, or robots are limited in both battery and computing power when it comes to running and training large neural networks. One solution he presented was an approach called Adaptive Model Streaming, which uses a remote server to continuously train and adapt a smaller model running on an edge device. This process improves the smaller model’s performance with a moderate bandwidth and can be applied to any problem with temporal locality, meaning any situation in which a program is repeatedly accessing the same data.
However, a problem occurs when the model is forced to adapt to a significant and abrupt scene change, such as a car driving into a tunnel. The necessary retraining for such a new situation can be expensive, resource-intensive, and delayed. But Shirkoohi explained how he and fellow researchers saw an opportunity to reuse already specialized models in a process called Responsive Resource-Efficient Continuous Learning for Video Analytics (RECL). In RECL, a “model zoo” is created which includes the previous expert models trained for all edge devices. When an abrupt situation change happens, such as the example of the car entering a tunnel, RECL rapidly selects an accurate model from the model zoo and dynamically optimizes the process of updating the edge device model. Using RECL, a system’s response time, accuracy, and resource efficiency can be significantly enhanced, improving the performance of models on commonly used edge devices.
MLA: Data-Efficient, Debiased, Social Machine Learning
CSAIL Research Scientist Andrei Barbu presented about some of the work happening in Principal Research Scientist Boris Katz’s group trying to improve training data—especially in image recognition—and use human learning patterns to better understand the future of machine learning. He began by introducing ObjectNet, where researchers put together a database of 50,000 images of objects in unexpected locations or positions using an automated tool that made it easier to scale. Using an app and Amazon Mechanical Turk, they captured a wide range of objects outside of their normal placement and therefore more difficult for AI to identify, like knives in bathrooms or chairs on their side. ObjectNet is useful for illustrating how robust most image detection programs actually are (many of the off-the-shelf systems went from 75% accuracy to 11% accuracy when using ObjectNet) and opens up avenues for future research to create more human-like computer vision.
One example of further research Dr. Barbu went on to describe is the group’s work calibrating the difficulty of an image dataset. To do this, researchers showed images to people for random amounts of time, from 17 milliseconds to 10 seconds. “Easy” images were ones that could be labeled in the shorter timeframes, while “hard” images took up to the full 10 seconds to identify. Most of the image data tested—in both ImageNet and ObjectNet—fell into the easy category, meaning the current data training sets generally lack the kind of hard examples that would allow machines to make gains on what humans can do. This research also offers a way for developers to measure the difficulty curve of images they might want to classify with ML and anticipate what kind of use cases will face the most difficulty. Dr. Barbu ended by touching on the Aligned Multimodal Movie Tree-Bank project, which compiles data on 40+ hours of subjects watching movies with implanted electrodes, offering another resource to computer scientists wishing to understand how humans process visual and audio data.
Mining the Pareto Frontier for More Efficient ML Models
PhD Student Karima Ma began her presentation by introducing the challenge of demosaicking, which is the process of taking raw camera data and reconstructing the image with red, blue, and green pixels. As one might imagine, this is a critical step in image processing, but comes with some significant challenges, namely the balance between speed and image quality. Today’s state-of-the-art demosaicking programs which produce the best image quality are expensive to run and require significant processing power. On the other hand, the quick and cheap demosaicking programs that we have, for example, in our cellphones often produce jarring artifacts like moiré patterns or maze-like patterns. This tradeoff is inevitable, but ideally there would be a way for developers to find a program that runs on their parameters, optimizing for a specific set of requirements.
To address this, Ma and Professor Jonathan Ragan-Kelley, in collaboration with researchers at Adobe and UCSD, created a method to automatically search over demosaicking algorithms to find a Pareto dominant frontier, thereby helping users pick the optimal algorithm for a given limitation in computing power or resolution needs. This system performs muti-objective genetic search over differentiable programs to optimize both speed and quality. Their method offers significant gains on state-of-the-art demosaicking methods at a fraction of the computational cost, combining features of both classical and deep learning-based demosaicking algorithms into more efficient hybrid programs which Pareto dominate prior demosaicking methods. Ma showed how their system can be applied to any raw camera data pattern like Bayer or Xtrans for demosaicking and can also generalize to other image processing tasks like super-resolution.
Leveraging Data for Better Machine Learning Applications
LANCET: Labeling Complex Data at Scale
Addressing the problem of labeling large datasets, Research Scientist Lei Cao from the Data Systems Group introduced LANCET for labeling complex data at scale. Dr. Cao explained how labeling data is a “key bottleneck that prevents cutting edge machine learning techniques such as deep learning from being useful in enterprise applications.” This manual process can be costly, difficult, and, in many fields, dependent on getting the limited time of domain experts like doctors or engineers. However, LANCET offers an automated approach by solving three key problems in data labeling: automatically generating labels, label candidate selection—or knowing which objects humans should manually label—and labeling termination. With this new approach, LANCET has achieved a 40% improvement over current methods such as Snuba and GOGGLES across a variety of datasets. Dr. Cao went on to discuss another data labeling system that has come out of Professor Samuel Madden’s lab called RITA (publication forthcoming), which pre-trains medical AI models to automatically detect seizures from EEG data, minimizing the demand for expert labeling.
The second half of Dr. Cao’s presentation was focused on reducing labeling costs in data curation. Right now, Dr. Cao said, “data scientists spend 80% of their time on data curation,” or the process of finding, merging, and cleaning up data to create a workable dataset. Using the example of merging two large datasets, he pointed out how an algorithm might not understand that two entries are the same if the text is slightly different, or resolve an error humans would immediately pick up on, like a mistakenly entered address. Inspired by GPT-3, Dr. Cao and other researchers combined the strengths of an untrained GPT-type model and a specifically trained model into a generic entity resolution model, which has the accuracy of the specific model but doesn’t require any training at all. During this research, they were also able to speed up the process of the generic model from 60 hours to 10 seconds using a Siamese network. Dr. Cao finished by describing how the next step of this research is to design a single generic model for multiple data labeling, sharing, and curation tasks.
Few-Shot Learning with Large Language Models: Efficiently Adapting to New Domains
Assistant Professor Yoon Kim started by discussing how natural language processing (NLP) models have traditionally been viewed: as siloed programs designed for particular tasks. Professor Kim explained that up until recently, NLP models were data-hungry and time-consuming to create, requiring massive, annotated datasets for a task-specific situation. However, in recent years there’s been what he called a “paradigm shift” where predictive language models can perform tasks on unlabeled raw data with minimal or even zero additional training. This research has led to three fundamental questions. First, can generically trained large language models (LLMs) work on specialized domains such as clinical notes? Second, can LLMs work for structured output tasks, like creating a bullet-point list of medications? And third, how can these models be deployed cost-effectively?
To answer these questions, Professor David Sontag walked through his group’s research using LLMs as clinical information extractors. Professor Sontag first explained how extracting information out of clinical notes could be enormously useful when it comes to designing personalized medical solutions or studying medical outcomes of certain treatments. But clinical text is notoriously messy and difficult to process without expert annotation. However, Professor Yoon, Professor Sontag, and other researchers have showed that LLMs such as GPT-3 can perform few-shot and even zero-shot information extraction on clinical notes, can offer structured outputs, and can be pruned down and tweaked for cost-efficient deployment, as with their edited model that was over a thousand times smaller than GPT-3. While the focus of this research has been clinical data, Professor Sontag elaborated, “these methods that we developed in the health care domain will work out of the box in other domains as well.”
Off-the-Shelf Natural Language Explanations of Deep Networks
Offering a different perspective on NLP research, Professor Jacob Andreas introduced his group’s work using language tools to explain the function of individual neurons in deep neural networks. Today’s deep learning architectures are very good at many applications, but the challenge is understanding exactly what’s happening inside. Professor Andreas explained, “as these models get better and better, it becomes important to develop tools that let us answer questions like: what is this model trying to do here? What has it learned from my training set?” To address this issue, Professor Andreas and fellow researchers developed a general-purpose technique to automatically label what individual neurons are identifying in an image.
There are several important applications of this research. Knowing what individual neurons are doing allows the user to determine when models are picking up on features of training data that developers specifically don’t want them to be sensitive to. For example, in applications that require processing images of people it can be important to turn off the neurons registering faces to preserve privacy and prevent bias. Furthermore, this research can be helpful in solving text-based adversarial attacks, when an image has a word on it that doesn’t describe what’s in the image, for example a picture of an apple with “iPod” written on it. By turning off the neuron responsible for text processing, the model can be improved in that dimension. Finally, this system can help users locate neurons that are processing surprising combinations, like both hermit crab shells and tanks, anticipating failure modes and creating more accurate models. Professor Andreas finished by introducing the next steps of this research, which would be to automate the process of flagging surprising neuron descriptions, making it even easier to locate and deal with problematic neurons.
Responsible Machine Learning
Responsible AI: Can dropping a little data change results and conclusions?
Professor Tamara Broderick began by addressing the exciting capabilities of modern machine learning. She said the algorithms being developed today could have “life-changing applications” in medicine, cybersecurity, financial policy, and more. However, this creates a burden of responsibility on those creating the machine learning models to make sure they’re creating AI that will truly improve quality of life. One aspect of this responsibility is to ensure that conclusions made from training data will generalize to new data encountered in the future. For example, suppose that researchers analyzed data to conclude that an economic intervention (like administration of microcredit) increased small-business profit on average. But it turned out that removing a single data point in a data set of tens of thousands of training data points changed the conclusion: without the data point, the same machine learning method concludes that the economic intervention hurts small-business profit. In this case, the researcher should be concerned by how much the single data point is driving the conclusion and worry how much one can trust this economic intervention to be helpful at new places and times. It is therefore important to know if there exists a very small subset of data that drives conclusions. But manually testing every small subset of data requires an astronomical computational cost, even for moderately-sized data sets. To address this problem, Professor Broderick presented a fast method to automatically check if a very small fraction of data can be removed to substantively change conclusions. Her group has tested this method on several famous economic studies; in some cases, they were able to overturn conclusions by removing less than 0.1% of the data set.
One thing such a method can defend against is the notorious issue of p-hacking, where data analysts run various analyses until they get a statistically significant result. Professor Broderick explained how her group’s method has proven useful in checking against coincidental false positives and can be a defense against p-hacking. She finished by emphasizing that, while her presentation focused on economics because of its rigorous rules about sharing code and data, the concerns she presented are not specific to economics. Other areas, such as medical AI, can sometimes struggle with reproducibility, a prerequisite for checking the kind of robustness she described. Therefore, she and other researchers have proposed a taxonomy of methods which developers could use to increase trust at each step of the machine learning process.
Uncovering Unexpected Behavior of ML Models Pre-Deployment
When deploying ML models, it’s important to think about potential failure modes in advance. Knowing when a model might make a mistake and being able to anticipate and alter training to adjust for these potential error points can help create more trustworthy algorithms. However, doing so can be both difficult and time-consuming. PhD Student Saachi Jain laid out the problem with an example image-recognition model trained to recognize cats. A common failure mode in such a model could be that it comes to associate cats with being inside and might label an image of a cat outside as, say, a dog. A current method of identifying this issue would be to create a saliency map of every image and manually inspect each one in the dataset, which is infeasible at scale. However, Jain presented a new method of automatically distilling a model’s failure modes by visualizing errors as directions in latent space and creating an axis which can define the direction of failure. This allows developers to not only identify but also caption challenging subpopulations within the dataset.
The advantage of knowing which data a model will struggle with is that one can then train the model to specifically target this weakness. Jain explained how, using the caption for their model’s failure mode, researchers were then able to apply an off-the-shelf image generator to produce images with that same caption and feed them back into the training set. Such a pipeline can automatically augment a dataset in ways that improve the ML model. Taking the research one step further, Jain also explained how the process can be streamlined by cutting out the captioning “middleman,” minimizing potential confusion with, for example, hard-to-label poses. Jain finished by laying out potential next steps of this research: leveraging a language model to generate candidate options and disentangling multiple failure modes within each class.
Bias-Free ML, Responsible ML, Detecting Bias, and Moving Toward Explainable AI
Presenting with undergraduate student Ariba Khan, CSAIL Research Scientist Amar Gupta spoke about the process of detecting bias in ML models and methods for minimizing the potential harm such bias might create. They first introduced the problem that, while there has been research done to mitigate bias in datasets, hardly any methods have dealt with intersectional fairness, or taking multiple bias parameters into account. Previous approaches, which generally use “fairness through unawareness,” have only looked at one factor and are therefore unable to account for the full complexity of real-world data. To address this, Dr. Gupta and his students created DualFair, a fair loan classifier pipeline. DualFair splits the dataset into subsets based on combinations of sensitive parameters, balances class and label levels to mitigate the accuracy-fairness tradeoff, removes biased datapoints via situation testing with a pre-trained ML model, and then trains using the modified dataset. Tested on public U.S. mortgage data, this method retains relative accuracy while significantly increasing fairness.
Continuing on that research, Dr. Gupta and Khan spoke about an allied challenge that researchers are currently working on, which is surmounting the bottleneck of trying to address both group and individual fairness. To strike a balance, Dr. Gupta discussed how they are thinking about the machine learning process as a multi-objective optimization problem where they’re trying to optimize for both group and individual fairness on a concurrent basis. Dr. Gupta finished by introducing research identifying bias in opioid prescriptions, where real-world data was used to prove significant racial and sexual disparities were present, demonstrating the urgent need to develop and implement new anti-bias metrics and approaches.
Industry Panel Takeaways
At the end of each series of presentations, a group of industry specialists sat with the researchers to discuss the significance of and considerations for machine learning solutions in the market. Joining from BT, Head of Strategic U.S. University Research Partnerships Steve Whittaker began by saying, “efficiency, to my mind, is the strategic objective of what we need to do now,” emphasizing how the new models and ways of thinking about machine learning problems are “forcing us to rethink our domains through new lenses.” Dr. Ali Payani, Head of Responsible AI Research at Cisco, agreed with Steve on the importance of efficiency, adding that he also sees a trend toward “getting more realistic about what [machine learning] can do” with an emphasis on “connecting academic research to real use cases in the industry.” Adding insight from financial services, Vishal Gossain, Practice Leader of Risk Analytics and Strategy at Ernst & Young, discussed his excitement for privacy-preserving options such as federated learning and methods for debiasing datasets, both of which offer exciting solutions for his field.
One theme that came out of the panel discussions was the importance of, as Gossain touched upon, anticipating and minimizing bias in ML models. There were several questions regarding the tradeoff of accuracy and bias, and Dr. Payani explained how the ability to mathematically code for human intuition about fairness is a “very important problem” that Cisco is currently working on, one which he believes “no one mathematical formula will resolve once and for all.” As a component of countering bias, explainability also came up several times in the conversations. Knowing what’s happening inside models is the first step toward creating fairer and more responsible ML, and while all three industry panelists expressed excitement for the potential of natural language processing, they also emphasized that robust explainability and interpretability are important before any such models can be rolled out into the market. Dr. Payani even hopes future models will be “inherently more explainable by design.”
Finally, the question of how future policy will shape the direction of machine learning applications played a large role in the post-presentation dialogues. Responding to questions about strategy, Professor Broderick and Dr. Gupta both highlighted the importance of considering future policy directions. As Dr. Gupta put it, “we don’t want to think of [machine learning] in isolation.” Relating to the trend of researchers putting their efforts toward real-world applications, there was a consensus about the importance of considering the legal and societal environment in the development of AI models. As Steve Whittaker laid out, there are four main forces that define a model’s success: (1) architectural design, (2) legal framework, (3) market space, and (4) social norms. As these models adopt larger and more important roles in society, it will become ever more important to implement a holistic approach to ML solutions.
Conclusions
In general, participants from both industry and academia agreed that this is a rich, vibrant, and exciting field with many directions for new research and industry use cases to explore going forward. CSAIL Alliances Managing Director Lori Glover ended by emphasizing the importance of sharing the groundbreaking research being done at CSAIL and then, conversely, hearing “how the work we’re doing is actually being used and how it is impactful,” which helps inform future investigation. In short, the day was a highlight of how industry and academia can partner together to develop innovative solutions that address current challenges.