Written by Matthew Busekroos
Originally from China, Eugenie Lai moved to Vancouver where she studied at the University of British Columbia prior to starting her PhD at CSAIL. Lai said she was fortunate to experience both research and the industry as an undergrad. She enjoyed doing research more than working in the industry for the autonomy and ownership over her work. Specifically, she loved the iterative process of extracting and formulating research problems from real-world challenges, as well as the technical aspects of solving open-ended problems.
Lai now works in the Data Systems Group at CSAIL alongside her advisor Professor Mike Cafarella.
“Every advisor I’ve worked with is beyond excellent and Mike is no exception,” Lai said. “One thing I learned to appreciate more is the power of mentorships. To do well in a career in research, doing quality work is the foundation, and so many other skills go on top of that, such as networking, public speaking, leadership, teamwork, etc. I genuinely appreciate Mike for taking on a mentor role to help us develop those skills by giving us feedback and opportunities to practice, in addition to being a research advisor.”
Lai’s current research focuses on developing tools to help data scientists and novices transform and make sense of their data. She said one of the projects she is most proud of is bottom-up standardization for data preparation, which is a system that helps users find a standardized version of their data preparation script.
“From scientific projects in academia to data-driven decision-making in industry, data preparation is an essential step in every data-related effort,” Lai said. “Typically, data preparation is not a contribution in a project — it transforms your raw data into a format that enables further innovative work. But the reality is that data scientists spend up to 80% of their time on data preparation before modeling. This is because data preparation scripts are highly project-specific and often written in general-purpose programming languages. These attributes make them tedious to understand and difficult to verify.”
Lai said as a result, not only data preparation scripts themselves can be a breeding ground for poor engineering and statistical practices (e.g., mistakes go unnoticed), but even when they are perfectly built, it’s difficult for data scientists to reuse the scripts as a whole.
“Ideally, data preparation scripts should be ‘admirably boring’ — they should serve the project, but otherwise be as simple and as standard as possible,” she said. “We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized version of itself. Our framework takes the user’s script not as an unchangeable definition of correctness, but as a sketch of the user’s intent. We embedded this framework in a system with a UI, which allows the user to interact with the system and adopt data preparation steps using their discretion.”
She added with the rise of data-driven initiatives in fields like healthcare, finances, public policy, and others, there’s an increasing need to develop tools to make data more accessible to a wide range of people. Data preprocessing is on the critical path of the data-to-insight pipeline, according to Lai. She said another aspect of the potential impact is through how data preprocessing and transformation enables and benefits downstream tasks such as machine learning model training and data mining.
Another project Lai is working on in collaboration with Microsoft Research is assisting Power BI users with table transformations and BI modeling. This can help both data scientists and novices save time on the “chores” and get to their downstream tasks faster. She said Power BI is a business intelligence (BI) product that helps users extract business insights from tabular data.
“I’m grateful for having the opportunity to work with real-world user data and to potentially deploy my research on an application with over 100 million users worldwide,” she said.
Lai said one way to look at research is if it has two things: the problem space and the method space.
“Being able to solve real-world challenges is one of my main drivers,” she said. “My problem space is data preprocessing and data transformation. It has been exciting to work in this space because who doesn’t use data in their applications these days? Everybody consciously or unconsciously generates so much data every day, and the data is then consumed by numerous programs. Data preprocessing and transformation are on the critical path of that pipeline. Because the participants in the pipeline keep evolving, we face new real-world challenges constantly.”
She added that the technical aspects of research also motivates her as well.
“The method space for us is just as exciting,” she said. “For example, LLMs could give us a leap in solving challenges related to capturing and measuring semantics in the data preprocessing space.”
Following her time at CSAIL, Lai said she is still open to all possible paths, with a slight preference for academia.
“As a first-generation college student from a low-income family, I received a lot of help and mentorships from all the professors I’ve worked with. In addition to contributing to research in academia, it’d be awesome to have the opportunity to give back to the community by passing the kindness and knowledge I received to the future generation of students,” she said.
For more on Eugenie Lai, check out her personal website: https://eugenielai.github.io/