Determining Model Size and Training Horizon through Scaling Laws

A framework for optimizing model size and training token count from scaling-law coefficients — accounting for inference cost, data repetition, and system efficiency.

1. Introduction

How big should the model be, and how long should we train it? This post walks through how to answer that from scaling-law coefficients, with the usual twists: inference budget, limited data, system efficiency.

Three variants come up in practice: pick the compute-optimal size and horizon, go bigger than optimal to squeeze out more accuracy, or go smaller and overtrain it to save on inference.

The framework below handles the whole picture — training budget, inference budget, limited data, intentional repetition. Give it a compute budget and a rough sense of how much inference you’ll serve, and it returns the optimal size and horizon, plus the % compute overhead of deviating from them.

1.1. Problem formulation and notation

The approach builds on top of the scaling laws of the model family from Chinchilla. We define the variables as:

  • NN: Number of model parameters (model size).
  • DD: Number of training tokens (total tokens processed during training, i.e. training steps times batch size).
  • CtrainC_{\text{train}}: Available training compute budget, measured in floating point operations (FLOPs) or an equivalent unit. This constrains how large NN and DD can be (since more parameters or more tokens both consume more compute).
  • MM: Projected number of inference tokens the model will serve in its lifetime (how much traffic the model will serve in its lifetime; for example, if we expect ~1 billion queries with an average of 1000 tokens each, M=1012M = 10^{12} tokens).
  • DcD_c: Number of unique tokens in the available training corpus (the size of the dataset). If D>DcD > D_c, it means the dataset will be repeated to supply that many training tokens.
  • Loss function L(N,D)L(N, D): A proxy for model quality after training, given by the scaling law. Lower loss corresponds to a better model. We will use a parametric form informed by the Chinchilla paper:
L(N,D)=ANα+BDβ+EL(N, D) = A N^{-\alpha} + B D^{-\beta} + E

where AA, α\alpha, BB, β\beta, EE are the scaling laws coefficients that are determined by data and model architecture. This equation encapsulates the empirically observed power-law improvement as model and data scale, with EE representing the irreducible loss (approached as N,DN, D \to \infty).

These coefficients can be derived from a regression analysis following the [Juntei] How to run a scaling ladder ablation procedure.

2. Compute Optimality Approach

We’ll build this up in layers: start with vanilla compute-optimal training, then add data constraints, inference cost, and system efficiency one at a time.

2.1. Training compute only

To build up the optimization problem, we recall the classic compute-optimal training problem (Chinchilla setting) can be expressed as: for a given training compute budget CtrainC_{\text{train}}, choose NN and DD (with Ctrain=ηtrainNDC_{\text{train}} = \eta_{\text{train}} N D where ηtrain\eta_{\text{train}} is usually 6) such that L(N,D)L(N, D) is minimized. In other words,

minN,D  L(N,D)s.t.ηtrainND=Ctrain\begin{aligned} & \min_{N, D} \; L(N, D) \\ & \text{s.t.} \quad \eta_{\text{train}} N D = C_{\text{train}} \end{aligned}

From this formulation, one could derive that the compute-optimal number of parameters NoptN_{\text{opt}} and compute-optimal token counts follow a power law:

Nopt(Ctrain)=G(Ctrainηtrain)βα+β,Dopt(Ctrain)=G1(Ctrainηtrain)αα+β,G=(αAβB)1α+βN_{\text{opt}}(C_{\text{train}}) = G \left(\frac{C_{\text{train}}}{\eta_{\text{train}}}\right)^{\frac{\beta}{\alpha+\beta}}, \quad D_{\text{opt}}(C_{\text{train}}) = G^{-1} \left(\frac{C_{\text{train}}}{\eta_{\text{train}}}\right)^{\frac{\alpha}{\alpha+\beta}}, \quad G = \left(\frac{\alpha A}{\beta B}\right)^{\frac{1}{\alpha+\beta}}

2.2. Data availability consideration

If D>DcD > D_c, not all those DD tokens are unique. The scaling law L(N,D)=ANα+BDβ+EL(N, D) = A N^{-\alpha} + B D^{-\beta} + E assumes i.i.d. data and doesn’t explicitly differentiate between unique versus repeated data. To incorporate the effect of diminishing returns from repeated data, we need to modify the data term. One simple approach is to introduce an effective token count coefficient τeff(R)\tau_{\text{eff}}(R) that discounts the actual token count to the equivalent token count for repetition R=DDc1R = \frac{D}{D_c} - 1. One particular form that fits the empirical data well is to model the half-life of the data utility RR^{*} which is the number of epochs at which time the additional utility from the dataset is reduced by half compared to the first epoch. The adjusted effective tokens in this case takes the following form:

τeff(R)=1+R(1eRR)R+1\tau_{\text{eff}}(R) = \frac{1 + R^{*}\left(1 - e^{-\frac{R}{R^{*}}}\right)}{R + 1}

This changes the compute optimality formulation to

minN,D  L ⁣(N,  τeff ⁣(DDc1)D)s.t.ηtrainND=Ctrain\begin{aligned} & \min_{N, D} \; L\!\left(N, \; \tau_{\text{eff}}\!\left(\tfrac{D}{D_c} - 1\right) D\right) \\ & \text{s.t.} \quad \eta_{\text{train}} N D = C_{\text{train}} \end{aligned}

2.3. Inference compute consideration

We want to account for the compute required for inference on II tokens. If the model has NN parameters, a single token inference roughly costs O(N)O(N) FLOPs (for a forward pass). So the total inference cost is approximately ηinfNI\eta_{\text{inf}} N I, where ηinf\eta_{\text{inf}} is another proportionality constant (usually ηinf=2\eta_{\text{inf}} = 2, or ηinf13ηtrain\eta_{\text{inf}} \approx \tfrac{1}{3} \eta_{\text{train}}).

Next, we consider two scenarios for the compute optimality problem that takes into account inference.

Scenario 2.3.1. Fixed total compute budget between training and inference

Here we have a fixed compute to be split between training and inference for a fixed amount tokens MM that are known beforehand. This only requires a simple change to the optimization constraint to take into account the inference compute.

minN,D  L(N,D)s.t.ηtrainND+ηinfNI=C\begin{aligned} & \min_{N, D} \; L(N, D) \\ & \text{s.t.} \quad \eta_{\text{train}} N D + \eta_{\text{inf}} N I = C \end{aligned}

Scenario 2.3.2. Under fixed training compute, overtrain an undersized model to reduce inference cost

In this scenario, we trade off the additional compute overhead from the overtraining smaller-than-optimal model for achieving the same accuracy compared to the compute optimal size and horizon. This trades extra training compute for cheaper inference down the line.

In this setting, we want to scale the parameters by kNk_N and training tokens by kDk_D to achieve the same loss as the compute optimal parameters NoptN_{\text{opt}} and token count DoptD_{\text{opt}}, in other words:

L(Nopt,Dopt)=L(kNNopt,  kDDopt)L(N_{\text{opt}}, D_{\text{opt}}) = L(k_N N_{\text{opt}}, \; k_D D_{\text{opt}})

Solving for kDk_D we have,

kD=(1(kNα1)ANoptαBDoptβ)1βk_D = \left(1 - (k_N^{-\alpha} - 1) \frac{A N_{\text{opt}}^{-\alpha}}{B D_{\text{opt}}^{-\beta}}\right)^{-\frac{1}{\beta}}

With that, we can derive the new total compute and the compute overhead compared to compute-optimal settings.

Cnew=ηtrainkNNoptkDDopt=kNkDCoptC_{\text{new}} = \eta_{\text{train}} \cdot k_N N_{\text{opt}} \cdot k_D D_{\text{opt}} = k_N k_D \, C_{\text{opt}} ρoverhead=CnewCoptCopt=kN(1(AB)1αβα+β(αβ)αβα+β(kNα1))1β1\rho_{\text{overhead}} = \frac{C_{\text{new}} - C_{\text{opt}}}{C_{\text{opt}}} = k_N \left(1 - \left(\frac{A}{B}\right)^{1 - \frac{\alpha\beta}{\alpha+\beta}} \left(\frac{\alpha}{\beta}\right)^{\frac{-\alpha\beta}{\alpha+\beta}} (k_N^{-\alpha} - 1)\right)^{-\frac{1}{\beta}} - 1

3. System Efficiency Consideration

Compute operations is a convenient abstraction, but what you actually pay for is machine time. For a fixed amount of compute, higher system efficiency means less machine time and lower training cost. For inference, higher efficiency means more throughput per node, so a smaller fleet for the same traffic. Folding efficiency in turns the compute-optimal story into something you can actually use for production decisions.

3.1 Training efficiency

For accurately taking into account training efficiency for training cost, we need to account for model flops utilization (MFU) and goodput. MFU is defined as the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs. Goodput is loosely defined as the time spent computing useful new steps over the elapsed time of the training job.

Machine time TmachineT_{\text{machine}} falls out in three divisions. Divide training compute CtrainC_{\text{train}} by MFU ρMFU\rho_{\text{MFU}} to get the actual flops CactualC_{\text{actual}} the system has to push through. Divide that by peak flops SS for the ideal machine time TidealT_{\text{ideal}}. Finally, divide by goodput ρgoodput\rho_{\text{goodput}} to get real elapsed time:

Tmachine=CtrainρMFUρgoodputST_{\text{machine}} = \frac{C_{\text{train}}}{\rho_{\text{MFU}} \cdot \rho_{\text{goodput}} \cdot S}

3.2 Inference efficiency

Inference efficiency works the same way, but the details differ. Serving setups are much smaller than training clusters, so goodput is modeled differently. The workload is forward-only, which changes how performance engineering and MFU look. And Scenario 2.3.2 — overtraining an undersized model — is the more common case anyway, and it has no pre-determined token count. There, the natural target is throughput per host.