Alex Carlin

Ablation studies on LLMs from Apple

I think Apple is going to turn out to be one of the most effective companies at harnessing the value of AI. They have invested heavily in their own Silicon, which happens to be great at all the things GPUs are great at, and they have their own frameworks for ML in the popular ML languages, and the ones you need to develop apps. They have a billion active devices out in the real world that are full of AI systems that are constantly generating data and being used in real-world scenarios. And they conduct really clean, careful AI research.

This recent paper, "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training", explores training a series of multimodel LLMs. But more importantly, the authors conduct hundreds of detailed and informative experiments, ablation studies, and other empirical fact-finding, to determine what works, what doesn't, and why.

The paper is fascinating and well-worth a read. One of the most interesting aspects is the heatmap charts showing the performance of the model across huge, dense hyperparameter sweeps. These are Fig. 9 in the supplement, on page 29. The first thing you notice is that some of the cliffs between a good and a bad hyperparameter are quite steep, and that the further you stray from what have become the "standard" transformer hyperparameters the worse the performance gets, on average, for many things, in the case here of learning rate and weight decay.

The authors, having examined all the current tricks rigorously, are able to build a family of models, which they call MM1, which a bunch of nice properties. The models are small, data-efficient, and are qualitatively nice—they produce pleasing outputs. Very cool paper showing how careful, scientific experimentation can bring some order to the fast-paced chaos that is ML research.