Published by the Students of Johns Hopkins since 1896
September 19, 2024

Beyond the Data: Uncovering How Modern Machine Models Learn Best

By MIHIR RELAN | September 18, 2024

img-0900

COURTESY OF MIHIR RELAN 

Mahoney described that the new Interpolating Information Criterion metric is a useful tool for evaluating models with more parameters than data points and has several applications.  

On Thursday, September 12 the Department of Applied Mathematics and Statistics hosted Michael Mahoney of the Department of Statistics at UC Berkeley and the International Computer Science Institute. In his talk titled “Model Selection and Ensembling When There Are More Parameters Than Data,” Mahoney addressed why modern machine learning models work so well in practice, despite even mathematical theories being unable to fully explain them. 

Mahoney began by introducing bias-variance tradeoff in traditional machine learning models, which is that simpler models may not capture all the details in the training data (high bias), but tend to perform better on new and unseen data (low variance). On the other hand, complex models fit the training data well (low bias) but might become too specific, which leads them to perform worse on new and unseen data (high variance). Mahoney explained that large datasets are often necessary to strike a balance, as models can learn general patterns rather than memorize specific details. 

“If you’ve taken a statistics class, even if you haven’t, you probably have this intuition that if you’re going to do a linear model or a least squares regression or anything, you need lots of data and a small number of parameters, and you look at some bias-variance tradeoff, and you declare a success,” he said. However, Mahoney pointed out that modern machine learning models, which often have more parameters than data points, challenge this assumption. These overparameterized models still perform effectively despite having fewer data points, which contradicts traditional statistical expectations. 

“With a lot of models these days, they’re trained with obscene amounts of data with super obscene amounts of computing [power] and super duper obscene amounts of parameters,” Mahoney said. “So more parameters than data.” 

Mahoney expanded upon this by describing limitations with diagnostic criteria for these machine learning models. He highlighted that traditional metrics like the Akaike Information Criteria (AIC) and Bayesian Information Criterion (BIC) are designed for models with more data than parameters. But in overparameterized models, these metrics fail to capture model performance accurately. 

“There’s model selection criteria, there’s AIC and BIC, that stand on top of a bunch of statistical theory that assume [the number of data points] goes to infinity,” Mahoney said. “What if that’s not the case?”

To address this, Mahoney introduced the Interpolating Information Criterion (IIC), a new metric designed specifically for overparameterized models. Unlike AIC or BIC, which assume the data size is larger than the number of parameters, IIC accounts for models where the number of parameters exceeds the data points. This can be particularly useful in informing model selection for different real-world applications. 

“If you want to use these machine learning models beyond internet and social media applications but for what is a paying point for users, I need to be able to take the model you gave me and certify that it has some reasonable quality level,“ Mahoney said. 

Mahoney explained the variety of applications the IIC has outside of informing model selection, such as the calculating of uncertainty quantification (UQ) for scientific and engineering machine learning models and the improvement of regression diagnostics for neural networks. Most notably though, Mahoney described how IIC can be used to assess the effectiveness of ensembling machine learning models. Ensembling is the practice of combining multiple models to improve accuracy, and IIC provides a framework to measure how much improvement is achieved when models are ensembled. 

“We are interested in characterizing how much ensembling improves performance relative to the performance of any one classifier, on average,” he said.

Mahoney concluded the talk by presenting empirical results on how IIC can measure the effectiveness of ensembling. He explained that ensembling becomes less useful for overparameterized models, such as neural networks or large language models, which can easily interpolate training data and achieve zero training error. In these cases, ensembling offers minimal additional benefits. However, methods like random forest trees are particularly suited for ensembling, as once they reach zero training error, they can’t improve further on their own, allowing ensembling to significantly improve performance

In an interview with The News-Letter, Zekun Wang, a first-year doctoral student in Applied Mathematics and Statistics, said the talk resonated with his research interests. 

“It was related to my research domain,” Wang said. “My research focuses on large-language models, and what he talks about is about deep learning especially. I really appreciated his lecture. [Some ideas] may not be something immediately that I can use, but taking part in this talk was more about absorbing something potentially I might be able to use.”


Have a tip or story idea?
Let us know!

Comments powered by Disqus

Please note All comments are eligible for publication in The News-Letter.

Podcast
Multimedia
Be More Chill
Leisure Interactive Food Map
The News-Letter Print Locations
News-Letter Special Editions