Beyond the data: Uncovering how modern machine models learn best

On Thursday, September 12 the Department of Applied Mathematics and Statistics hosted Michael Mahoney of the Department of Statistics at UC Berkeley and the International Computer Science Institute. In his talk titled “Model Selection and Ensembling When There Are More Parameters Than Data,” Mahoney addressed why modern machine learning models work so well in practice, despite even mathematical theories being unable to fully explain them.

Mahoney began by introducing bias-variance tradeoff in traditional machine learning models, which is that simpler models may not capture all the details in the training data (high bias), but tend to perform better on new and unseen data (low variance). On the other hand, complex models fit the training data well (low bias) but might become too specific, which leads them to perform worse on new and unseen data (high variance). Mahoney explained that large datasets are often necessary to strike a balance, as models can learn general patterns rather than memorize specific details.

“If you’ve taken a statistics class, even if you haven’t, you probably have this intuition that if you’re going to do a linear model or a least squares regression or anything, you need lots of data and a small number of parameters, and you look at some bias-variance tradeoff, and you declare a success,” he said. However, Mahoney pointed out that modern machine learning models, which often have more parameters than data points, challenge this assumption. These overparameterized models still perform effectively despite having fewer data points, which contradicts traditional statistical expectations.

“With a lot of models these days, they’re trained with obscene amounts of data with super obscene amounts of computing [power] and super duper obscene amounts of parameters,” Mahoney said. “So more parameters than data.”

Mahoney expanded upon this by describing limitations with diagnostic criteria for these machine learning models. He highlighted that traditional metrics like the Akaike Information Criteria (AIC) and Bayesian Information Criterion (BIC) are designed for models with more data than parameters. But in overparameterized models, these metrics fail to capture model performance accurately.

“There’s model selection criteria, there’s AIC and BIC, that stand on top of a bunch of statistical theory that assume [the number of data points] goes to infinity,” Mahoney said. “What if that’s not the case?”

To address this, Mahoney introduced the Interpolating Information Criterion (IIC), a new metric designed specifically for overparameterized models. Unlike AIC or BIC, which assume the data size is larger than the number of parameters, IIC accounts for models where the number of parameters exceeds the data points. This can be particularly useful in informing model selection for different real-world applications.

“If you want to use these machine learning models beyond internet and social media applications but for what is a paying point for users, I need to be able to take the model you gave me and certify that it has some reasonable quality level,“ Mahoney said.

Mahoney explained the variety of applications the IIC has outside of informing model selection, such as the calculating of uncertainty quantification (UQ) for scientific and engineering machine learning models and the improvement of regression diagnostics for neural networks. Most notably though, Mahoney described how IIC can be used to assess the effectiveness of ensembling machine learning models. Ensembling is the practice of combining multiple models to improve accuracy, and IIC provides a framework to measure how much improvement is achieved when models are ensembled.

“We are interested in characterizing how much ensembling improves performance relative to the performance of any one classifier, on average,” he said.

Mahoney concluded the talk by presenting empirical results on how IIC can measure the effectiveness of ensembling. He explained that ensembling becomes less useful for overparameterized models, such as neural networks or large language models, which can easily interpolate training data and achieve zero training error. In these cases, ensembling offers minimal additional benefits. However, methods like random forest trees are particularly suited for ensembling, as once they reach zero training error, they can’t improve further on their own, allowing ensembling to significantly improve performance

In an interview with The News-Letter, Zekun Wang, a first-year doctoral student in Applied Mathematics and Statistics, said the talk resonated with his research interests.

“It was related to my research domain,” Wang said. “My research focuses on large-language models, and what he talks about is about deep learning especially. I really appreciated his lecture. [Some ideas] may not be something immediately that I can use, but taking part in this talk was more about absorbing something potentially I might be able to use.”

Beyond the data: Uncovering how modern machine models learn best

Trending

Hopkins signs amicus brief in support of Harvard's federal funding lawsuit

Class of 2025 shares mixed reactions to Commencement ceremony

SB2K17: Looking Back on The Spring Break Curse

One year of encampment: What Hopkins owes Palestine, democracy and humanity

TRU-UE demands for the protection of international workers and academic freedom at May Day Picket

University announces initiatives to boost intellectual diversity

Weekly Rundown

Events this weekend (April 25–27)

Science news in review: April 22

Hopkins Sports in Review (April 17 – April 21)

To watch and watch for: Week of April 21

News-Letter Magazine

Beyond the data: Uncovering how modern machine models learn best

Related Articles

Trending

Weekly Rundown

News-Letter Magazine