With the advance of Large Language Models and Vision Models, many empirical results have proven that model performance is positively correlated with model capacity (number of parameters) and data quantity (number of training samples). Such correlations have been modeled by the so-called scaling law. But despite its success in vision models and large language models, scaling law’s generalizability to ML self-driving models has been unknown.
At Nuro, we conduct an AI-first approach by using ML everywhere, including perception, behavior prediction, and motion planning. In this blog, we’ll show some initial exploration of the scaling laws of our models, explain how we benefit from them, and discuss future directions.
What are Scaling Laws?
In general, an ML model performs better with larger numbers of parameters and more training samples/tokens. But how much improvement will we see? How many more parameters and data are needed? And is there any upper bound? Scaling laws answer these questions by providing mathematical models. In this article, we will introduce two types of scaling laws: Model performance scaling laws and Optimal model scaling laws.
Model performance scaling laws
How does model performance (represented by training or evaluation losses) improve with the model and training set size scale up? We fit a Model performance scaling law based on training and evaluation losses to answer this question.
We start with some basic nomenclatures:
- N — Model capacity: number of parameters
- D — Training dataset size: number of samples
- S — Number of training iterations
- C — Training budget in FLOPs:
- L(N,D)- Training or eval loss metrics, we assume the loss is a function of N, and D
We model the performance scaling as a power law w.r.t. N and D[1]
where E,A, B, 𝛼, 𝛽 are parameters to fit.
Optimal model scaling laws
Another question in ML design is what is the optimal model and training data quantity under a certain computational budget? We try to answer this question by fitting an Optimal model scaling law.
Let’s define the optimal model, data as well as training steps (assume fixed batch size) at a certain FLOPs budget C as:
Scaling Laws in Behavior
Nuro’s behavior encoder is a transformer model that encodes track and context information in a scene. To demonstrate the scaling law here, we conducted multiple experiments by scaling the model capacity from 1X to 32X, and the training data from 1X to 7X. For a fair comparison, we trained all models with Nvidia 40GB A100 GPUs with the same optimizer and configuration. The training time varied from 1 day to 14 days.
Model Performance Scaling Laws
Fig. 1 shows the evaluation loss scaling law w.r.t. model capacity N and data quantity D. As expected we observe better loss with larger N and D, while the impact of N is much larger than D, indicating that we need larger models to handle current data. Moreover, by fixing a target L, we can fit the needed (N, D) profile, which guides model design.
Figure 1. Fitted Model Performance Scaling Function L(N, D), and profile curve (black) at 1% loss improvement.
In our experiment, we obtained the following training and eval scaling laws:
Optimal Model Scaling laws
We study the scaling law between the optimal model capacity, training steps, and computational budget, based on our 7X data experiments. We collect the (L, C) points as depicted in Fig. 2(a), the low envelope shows the optimal (lowest loss) model capacities[2]. These optimal model points are then used to fit the scaling laws of N and S as in Fig. 2(b) and (c).
Figure 2. Fit model capacity (number of parameters) and training steps scaling laws using training loss envelope.
In our experiment, we obtained the following optimal model scaling laws:
Behavior Evaluation Results
Our behavior evaluation is shared by many behavior tasks (prediction, planning, behavior understanding, etc.); here we will show the results of the trajectory prediction task. We evaluated all trained models with a dataset of ~175 km (~15 hrs) total driving, the minADE metric values are shown in Fig. 3.
We found that smaller models (<8x) scale well when the dataset is small (<5X). Our 16x model almost scales linearly with the dataset size. The largest 32x model overfits significantly when the dataset is small, and doesn’t fully converge within limited training epochs when the dataset is larger.
Figure 3. Behavior eval improvements with scaling. Negative/positive numbers indicate relative improvements/regressions.
Another interesting finding was how this scaling generalizes to different model versions. We experimented with two model versions (~1 year time gap) for the same behavior task, but with different model architecture, input features and losses. We observed that the scaling laws still hold and the performance improvements are comparable.
Scaling Laws in Perception
After measuring the effect of Scaling Laws on Behavior, we turned to Computer Vision. For perception, we conducted experiments with ConNeXt[3] as the backbone and trained and evaluated on 2D object detection tasks. We used a VGG architecture as baseline (1X) and scaled up the model size to ~11.3X (ConvNeXt XL).
Similar to obtaining the behavior model scaling law, we conducted experiments and fit scaling law functions for perception models. Fig. 4 shows how 2D detection mAP scales with N and D, and we obtain the exact same conclusion with behavior model experiments.
Figure 4. Fitted Model Performance Scaling Function L(N, D)
We then tried to find out the optimal model scaling law with regard to computational budget C. As shown in Fig. 5, the upper envelope (instead of a lower envelope as here we are using mAP eval metric, rather than train or eval loss) shows the optimal model at each budget level. We also observe that: 1) the 1X model has superior performance in a low FLOPs regime, 2) further experiments are needed to show the benefit of a larger model, which also generates massive costs.
Figure 5. Use the mAP envelope to find the optimal models at different budgets. “Behavior” vertical lines are for the comparison of complexity difference to Perception.
In Fig. 6 we present the scaled model evaluation results with different model and data scales. In general, we see that the model scaling works — but requires a larger data scale. For example with 7.5% data, the improvement is smaller and plateaued. With 50% data, the improvement is similar, but with a much larger increase trend (slope). We also noticed that despite the significant mAP improvements with the scaled model in the original ConvNeXt paper, with our in-house data, the improvement is not as significant.
Figure 6. The evaluation result of the perception model at different scales.
Conclusion
We have presented the mathematical formulations of two types of scaling laws, and shown how scaling laws can apply to our state-of-the-art industry-level self-driving models. We observed that scaling laws do apply to both perception and behavior ML models trained with our massive in-house data. Different extrapolation methods have their own pros and cons but mostly coincide in order. Even without significant model change, we can do better with more training samples/steps and a larger model capacity. For a more detailed look at a range of experiments, check out what we presented at ML4AD ‘23.
That said, we observed several limitations:
- Scaling laws are biased towards the extremum models. We can lose similar “good” but order less complex models
- Scaling laws may not generalize to different evaluation datasets and metrics, especially with data scales up[4]
- The improvements are sometimes marginal but the cost can be multiplicative.
If you’re interested, here are some recommendations for further research::
- Explore different model capacity scaling approaches (width vs depth vs other params) and LR schedules
- Explore the double descent[5] and broken neural scaling laws
- Verify the predicting power of scaling laws with a much larger (e.g. >1e20) budget regime
- Explore even larger-scale data (e.g. web-scale) and more advanced scaling approaches such as mixture-of-experts (MoE) etc.
If any of these topics interest you, check out our open positions.
References
(1) Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. 2020 Jan 23.
(2) Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, Casas DD, Hendricks LA, Welbl J, Clark A, Hennigan T. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. 2022 Mar 29.
(3) Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 (pp. 11976–11986).
(4) Diaz F, Madaio M. Scaling Laws Do Not Scale. arXiv preprint arXiv:2307.03201. 2023 Jul 5.
(5) Caballero E, Gupta K, Rish I, Krueger D. Broken neural scaling laws. arXiv preprint arXiv:2210.14891. 2022 Oct 26.
By: Brian Yao, Adi Ganesh, Zhuwen Li, Aleksandr Petiushko