Written By: Audrey Woods
The average consumer might not think too much about the machinery that supports the services we use. When we check our emails, browse social media pages, or shop online most of us don’t realize the vast mechanism of computational equipment that goes into platforming our favorite websites. On the business side, cloud computing and data centers are a pivotal aspect of strategy as companies have to consider their online presence and the functionality of their website. But even for those in industry, the details of cloud computing might remain abstract.
“When we talk about cloud computing and data centers, we’re usually talking about systems that have at least 10,000 servers, typically more than that, and that host distributed internet scale applications,” says CSAIL Assistant Professor Christina Delimitrou. She explains how these systems can be split into two main categories: private internal data centers like those of Google, Microsoft, and Facebook and public ones, which are available for general-purpose rent. From her perspective, the main advantage of cloud computing is cost, offering users the advantage of economies of scale and also the use of commodity equipment, or equipment that doesn’t need to be specialized for a specific purpose. “These are servers that are not unlike what you would have for your desktop or your laptop,” Professor Delimitrou says, the difference being the number of machines in play.
As the hardware and software behind these giant systems evolve to meet current demands and adapt to the changing realities of the computer science field—such as the end of Moore’s Law—Professor Delimitrou’s research aims to improve the performance, efficiency, predictability, and ease of use in cloud computing systems.
Ahead of the Cloud: Finding Her Interest
Professor Delimitrou was first drawn to computer systems in her third year at the National Technical University of Athens when she took a course in computer architecture. She says that she most enjoyed “the intuitiveness behind the design decisions of how systems are designed as well as the fact that large-scale systems like datacenters were very new at the time so the potential for making a real impact on how they should be designed was high.” This led her to study electrical engineering at Stanford, writing her PhD thesis on improving the resource efficiency of cloud computing. After graduating, she went on to become an assistant professor at Cornell University before coming to CSAIL, where she leads the Systems Architecture and Infrastructure Lab (SAIL).
When asked what she finds most fascinating about her field of research, Professor Delimitrou explains that “the scale of the system means that a lot of traditional wisdom we’ve used for years about how systems are designed goes out the window. For example, a lot of traditional architectures use empirical heuristics to determine how various components should be designed and sized. That does not apply in cloud computing because the scale of the system alone would mean that empirical design is completely impractical, so we have to turn to other more automated and reliable techniques.” For this reason, one major focus of her group’s work is applying machine learning techniques to large-scale systems problems like resource management, design, and debugging.
Machine Learning Solutions
With the size of cloud computing systems, approaching problems manually can be a daunting task. Take, for example, debugging. Current techniques rely on retroactive debugging, which is inefficient, time-consuming, and leaves the system open to performance unreliability that can cause domino effects across related microservices. To address this Professor Delimitrou’s group is investigating machine learning methods that can be used to quickly diagnose where the issue is coming from and prevent this negative cascade.
Professor Delimitrou also highlights how machine learning might be used to optimize these large systems, which sometimes run at less than 25% capacity. Because the current setup requires users to request the bandwidth they think they’ll need, people err on the safe side and overestimate their utilization. But with machine learning, customers could tell a system what kind of performance their application requires and the algorithm can allocate the necessary resources, pushing usage higher and minimizing expensive inefficiency. This would also offer sustainability benefits, since the same number of energy-consuming machines can now accommodate a larger amount of computation.
However, studying these issues in cloud computing requires frequent and open access to realistic cloud applications, a challenging hurdle when businesses might not want to release their code for open use. Therefore, another important avenue of Professor Delimitrou’s research is designing representative cloud services with systems like Ditto.
Representative Cloud Services: Ditto
Ditto is an application cloning system that creates a proxy of a cloud service without giving away key details such that someone could reverse-engineer and copy it, offering one solution to the roadblock of researcher access to proprietary code. “Essentially,” Professor Delimitrou says of Ditto, “we want to clone the performance and resource characteristics, but not clone the individual instructions or individual memory accesses which is where a lot of information leakage would come from.” Unlike previous work in this area, which focused on simple applications running at the user level, the innovation behind Ditto is that it replicates the entire stack of a microservice topology, offering scientists like Professor Delimitrou a way to conduct their research and proving that it is possible to clone such complex applications.
Representative services are not only useful for academics looking to study large systems but also could serve cloud providers themselves. Professor Delimitrou says, “if you think about a software cloud provider like Facebook that wants to buy a new generation of servers, they have to share some version of their application to the hardware provider to get an idea of what performance they will get from the next generation of servers.” In such a use case where they don’t want to share proprietary code, applications like Ditto could create a synthetic version to help fill that gap.
Overcoming Misconceptions & Looking Ahead
Professor Delimitrou says the biggest misconception about the field is that academics are not well positioned to do major research on cloud computing because they lack access to such systems and can’t replicate their studies at the scale of what a real data center looks like. With programs like Ditto, she says, “we’ve tried to remove the first part.” As for the second concern, she points out that the public cloud provides researchers the opportunity to experiment with large scale systems and find bottlenecks that would have been missed on a smaller setup.
Overall, Professor Delimitrou feels that research such as hers is important because “we can try much more radical ideas when it comes to changing the cloud system stack” than a business or cloud provider might be able to experiment with. She concludes, “I think academia can play a very important role in advancing what cloud computing systems look like.”