WRITTEN BY: Audrey Woods
Statistics is hardly a new field. The formal use of data has been happening since numbers were invented and the field of modern statistics has been around for at least a century. However, with the rise of ever larger and more complex datasets, statisticians have been forced to adapt. As CSAIL Assistant Professor Sam Hopkins says, “the whole theory of statistics was basically developed to deal with low dimensional data,” or data where the information collected for each individual data point is relatively limited. But what about datasets that use, for example, images, where there are not only numerous images but numerous datapoints in each image? Or genomic research, where each distinct genome contains vast volumes of useful information? Dealing with such complex datasets requires a new way of thinking, and that is the field of high dimensional statistics.
According to Professor Hopkins, high dimensional statistics can be roughly understood as the study of datasets in which “the number of parameters that you measure for each sample is also getting larger as the dataset gets larger.” Extracting useful information from such complex data requires “fundamental changes” to our way of thinking, especially when it comes to the algorithms computer scientists create to process such data. Professor Hopkins’s research is focused especially on “studying computational problems and statistics where we don't know algorithms that can solve them and trying to design algorithms to solve them,” with the aim of discovering general purpose algorithms that can create a new foundation for a field of mathematics “still in its infancy.”
In his conversation with CSAIL Alliances, Professor Hopkins walked through three major tradeoffs to consider in high dimensional statistics. First is the robustness and accuracy tradeoff, in which one must consider the “fundamental limit to the amount of accuracy that I can get based on having a dataset which has a certain percent of bad data.” In every dataset, there will inevitably be outliers that violate the basic assumptions that underlie the methods used to process the data. In other words, mathematicians must make assumptions to create algorithms that will work over a given corpus of information, but the more datapoints that violate those assumptions, the “dirtier” the dataset is and the less accurate one can expect the algorithms to be. Therefore, Professor Hopkins explains, “what I’d like to do is design algorithm methods for extracting useful information from a large, noisy dataset which as you make the dataset dirtier, the accuracy doesn’t get too much worse.”
A similar tradeoff that mathematicians must work around is that of privacy and accuracy. In the modern world, privacy is becoming a larger and larger concern, especially in medical and genomic studies. Patients don’t want their data exposed, even indirectly with the potential for hackers to reverse-engineer models to deduce information. But to come to a conclusion with a given dataset, some information must be released in the form of results. Therefore, Professor Hopkins explains that an ideal algorithm won’t “depend too heavily on any one individual,” meaning “if that individual were or were not in the dataset, the output of that algorithm wouldn’t change too much.” This is a similar concept to the robustness accuracy tradeoff, where it’s important to be able to remove specific data points (either “dirty” or private) and not have the conclusions noticeably change. Professor Hopkins says that he and fellow researchers are implementing differential privacy by “studying how you can qualitatively control the amount of information you release about any particular individual while still getting good conclusions that reflect the broad statistics of an underlying dataset.” It’s important to note that unlike fully private systems like homomorphic encryption, using differential privacy in statistics means “you are trying to release something and mitigate the consequences [instead of] trying to release nothing.”
The third tradeoff is a core theme of Professor Hopkins’s research, and that is the balance between statistical inference and computational complexity. In low-dimensional statistics, the degree to which inference can produce some meaningful result depends largely on the volume of good data available. On the other hand, in high-dimensional statistics the issue is often not the amount of data but the amount of computing power available to throw at the problem. Using the example of genomic research, there’s an enormous quantity of information available in DNA, but the roadblock there is having enough computational power to process such a body of data. Professor Hopkins says, “a big question addressed in my research is: when is that roadblock fundamental? Are there problems in high-dimensional statistics where there is enough information in a dataset in a theoretical sense to extract a useful conclusion from it, but that no computationally efficient algorithm could ever do?” In other words, “there’s all kinds of mathematics that goes into understanding: is this really an information computation gap or have we just not found a better algorithm?”
Professor Hopkins is focused on creating these better algorithms, exploring how to “squeeze out all of the algorithmic power” that’s possible. He compares this process to artform, with each high-dimensional statistics problem requiring “an entire PhD’s worth of research to understand what are the right algorithms for doing whatever task [we] want to do.” While this field is still relatively new, Professor Hopkins imagines a future in which there is a set of mathematical tools for high dimensional statistics comparable to the general use tools we currently have for other mathematical areas. He hopes that someday there will be a way to “take a newly presented problem and pretty quickly figure out what the optimal algorithm is for this problem.” Ideally, these tools will be predictive, explanatory, and useful at telling statisticians what they need to add to “improve on the guarantees of the best possible algorithm.”
Professor Hopkins concluded with his belief that “we haven’t hit the fundamental limits” of these tradeoffs. “There’s still room for algorithmic improvements,” he says, and he’s excited to find them.