Sensitivity-Aware Training (SAT): Using Statistical Weight Geometry to Guide LLM Training Dynamics

baa.ai

Abstract. Large language models (LLMs) are routinely trained at full floating-point precision and subsequently compressed via post-training quantization (PTQ). The mismatch between the statistical geometry of weights produced by unconstrained training and the requirements of low-bit arithmetic is well documented: outlier weights, pathologically concentrated singular-value spectra, and noise-amplifying layer topologies all degrade quantized model quality. The SWAN framework (Statistical Weight Analysis for quantizatioN) addresses this mismatch retroactively, diagnosing sensitivity after training concludes. We propose Sensitivity-Aware Training (SAT), a principled extension of the SWAN philosophy into the training loop itself. SAT replaces the static, post-hoc sensitivity report with three online training signals: a kurtosis-driven regularisation term that penalises outlier emergence in real time, a spectral norm constraint that maintains well-conditioned weight matrices throughout optimisation, and a targeted noise-injection schedule that surgically hardens only layers flagged as high-risk. Layered on top of these signals is a Dynamic Bit-Width Allocation (DBWA) mechanism that periodically evaluates SWAN metrics and adjusts per-layer training precision accordingly, reducing wasted compute on low-sensitivity layers while protecting high-sensitivity ones. Together these mechanisms produce models that are quantization-ready by construction, eliminating a root cause of PTQ degradation rather than compensating for it after the fact.

1. Introduction

The deployment lifecycle of a modern LLM involves a conceptual discontinuity: models are trained in high-precision floating point to maximise gradient quality, then compressed aggressively for inference. This two-phase pipeline assumes that a model’s internal geometry—the statistical distribution of its weights and activations—is largely incidental to the training objective and can be post-processed without penalty. Empirical evidence increasingly contradicts this assumption.

Dettmers et al. [2] demonstrated that outlier features emerge systematically in transformer activations and make naive INT8 quantization catastrophically lossy. Subsequent work—including GPTQ, AWQ, SmoothQuant, and QuaRot—has developed increasingly sophisticated post-training correction strategies: weight rotation, activation smoothing, mixed-precision allocation, and learned rounding. The SWAN framework synthesises the diagnostic half of this body of work, exposing three complementary metrics (excess kurtosis, SVD spectral concentration, and output noise amplification) that together characterise a trained layer’s sensitivity to quantization.

The natural next question is: if we can measure sensitivity after training, can we prevent the conditions that produce it during training? This paper answers affirmatively. The SWAN metrics, originally conceived as post-hoc diagnostics, can be repurposed as online training signals. The resulting framework—Sensitivity-Aware Training (SAT)—treats the statistical geometry of weights as a first-class training objective alongside the primary language modelling loss.

This paper makes the following contributions:

  1. A formal derivation of kurtosis-regularised gradient updates that suppress outlier emergence without interfering with model expressiveness.
  2. A spectral norm conditioning constraint that promotes distributed singular-value spectra and improves the robustness of learned representations to precision reduction.
  3. A targeted quantization-noise injection protocol that concentrates the hardening effect of Quantization-Aware Training (QAT) on statistically identified high-risk layers.
  4. A Dynamic Bit-Width Allocation (DBWA) mechanism that periodically queries SWAN metrics to reassign per-layer training precision, substantially reducing memory and compute overhead during pre-training.
  5. A comparative analysis against standard pre-training, full QAT, and SWAN-guided PTQ, demonstrating that SAT achieves superior quantized model quality at competitive or lower training cost.

2. Background and Related Work

2.1 The Outlier Problem in LLM Quantization

Uniform quantization maps a continuous range of values to a fixed grid of b-bit integers. Its accuracy depends critically on the distribution of the values being quantized: the wider and more irregular the distribution, the larger the rounding error. Transformer models trained with standard optimisers develop persistent outlier activations—values orders of magnitude larger than the median—that force quantization grids to accommodate extreme ranges, wasting representational capacity on the tails and degrading the precision of the bulk of the distribution. Weight matrices exhibit a related pathology: a small number of singular values accumulate disproportionate spectral energy, meaning that information is effectively stored in a low-dimensional subspace that is fragile under precision reduction.

2.2 Post-Training Quantization and Its Limits

PTQ methods accept a trained model as given and attempt to correct its statistical deficiencies through transformation. GPTQ [3] uses second-order weight perturbation to compensate for quantization error. AWQ [4] identifies salient weight channels via activation magnitude and protects them. SmoothQuant [9] migrates outlier difficulty from activations to weights, which are easier to quantize. QuaRot [10] and SpinQuant [5] apply learned orthogonal rotations to the weight matrices to redistribute singular-value energy. KurTail [1] directly minimises activation kurtosis via learnable rotation, reducing tail density and improving quantization robustness significantly.

These methods are impressive but share a fundamental limitation: they operate on the outputs of a training process that was oblivious to quantization geometry. Corrective transformations can redistribute existing pathology but cannot eliminate the underlying propensity of the optimiser to create it.

2.3 Quantization-Aware Training

QAT addresses this limitation by inserting fake quantization operations into the forward pass during training, allowing the model to adapt its weights to low-precision constraints via the Straight-Through Estimator (STE) for gradient propagation. QAT consistently produces better quantized models than PTQ, but at significant cost: it requires uniform fake-quantization of all layers, which complicates convergence; it must be performed at a specified target bit-width, making it inflexible; and it demands the full memory budget of floating-point training with an additional computational overhead.

2.4 The SWAN Framework

SWAN (Statistical Weight Analysis for quantizatioN) provides a diagnostic toolkit for assessing a trained model’s readiness for quantization. It characterises each layer along three dimensions:

SWAN runs in seconds on a trained model and produces a per-layer sensitivity profile that can guide mixed-precision PTQ. SAT proposes to make this profile dynamic and to use it as a training signal rather than a diagnostic endpoint.


3. The SAT Framework

SAT augments a standard pre-training loop with three continuously-updated sensitivity signals and one periodic resource allocation mechanism. The overall training objective becomes:

Ltotal = LLM + λκ · Rkurtosis + λσ · Rspectral + Lnoise_injection   (1)

where LLM is the primary language modelling loss, Rkurtosis and Rspectral are differentiable regularisation terms, and Lnoise_injection is an augmented forward-pass loss that trains noise resilience in identified high-risk layers. The λ terms are hyperparameters controlling the strength of each regulariser. We now describe each component.

3.1 Kurtosis-Driven Stability (KDS)

3.1.1 Motivation

The excess kurtosis of a weight tensor W is defined as:

κ(W) = [E((W − μ)&sup4;) / σ&sup4;] − 3   (2)

where μ is the mean and σ is the standard deviation of the elements of W. For a Gaussian distribution κ = 0; positive values indicate heavier tails than Gaussian and the presence of outlier values that will stretch quantization grids.

Standard optimisers have no mechanism to prevent kurtosis from growing. Weight decay penalises magnitude uniformly, but a few very large weights can persist alongside many small ones; kurtosis captures this asymmetry where L2 regularisation does not. Empirically, kurtosis increases monotonically through pre-training in transformer models, with the sharpest increases occurring in later attention projection layers and feed-forward up-projection matrices.

3.1.2 Regularisation Term

We define the kurtosis regularisation term as the sum of positive excess kurtosis values across all layers:

Rkurtosis = Σl max(0, κ(Wl) − κtarget)   (3)

where κtarget is a target kurtosis ceiling (empirically set between 1.5 and 2.5 for most transformer architectures). The max(0, ·) operation makes the regulariser a one-sided penalty: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved layers.

The gradient of Rkurtosis with respect to Wl is computable via automatic differentiation over a differentiable kurtosis estimator. In practice, we use a batch-level estimator computed over the weight values rather than the activations, making the operation cheap relative to the forward pass.

3.1.3 Adaptive Layer Weighting

Not all layers contribute equally to the kurtosis budget. We introduce a layer-adaptive coefficient λκ,l that scales the penalty by the SWAN sensitivity score Sl computed at the most recent diagnostic checkpoint:

Rkurtosis = Σl Sl · max(0, κ(Wl) − κtarget)   (4)

This ensures that layers identified as high-sensitivity by SWAN receive stronger regularisation pressure, concentrating the optimiser’s constraint budget where it matters most.

3.2 Spectral Conditioning (SC)

3.2.1 Motivation

The singular value decomposition of a weight matrix W = UΣVT decomposes the linear transformation into a rotation, a scaling, and another rotation. The singular values {σ1 ≥ σ2 ≥ … ≥ σr} represent the magnitude of each learned direction. When most energy is concentrated in the first few singular values—high spectral concentration—the matrix is effectively low-rank, and small perturbations to those dominant directions (as introduced by quantization rounding) cause disproportionate output errors.

Spectral norm regularisation is well-established in GAN training (Miyato et al. [6]) as a stability measure; there it prevents discriminator weight matrices from growing unbounded. In the SAT context, its role is different: we use it to maintain a well-distributed singular value spectrum, making each learned direction roughly equally important and therefore equally robust to precision reduction.

3.2.2 Regularisation Term

We define the spectral conditioning regularisation term as:

Rspectral = Σl σmax(Wl) / ||Wl||F   (5)

where σmax is the largest singular value and ||W||F is the Frobenius norm. This ratio—the spectral concentration ratio—approaches 1/√r for a perfectly flat spectrum (r is the matrix rank) and approaches 1 as all energy concentrates in a single direction. Minimising this ratio encourages flat spectra.

Computing the exact maximum singular value at every step is expensive. We use the power iteration method to approximate σmax in O(mn) time per layer, where m × n is the weight matrix dimension. This approximation introduces negligible error and requires only one or two iterations to converge.

3.2.3 Relationship to Effective Rank

The spectral concentration ratio is closely related to the effective rank, defined as exp(H(p)) where p is the probability distribution over normalised squared singular values and H is its entropy. SAT implicitly maximises effective rank by minimising spectral concentration, producing weight matrices that encode information in more dimensions and are therefore more robust to the information loss inherent in quantization.

3.3 Noise-Resilient Training via Targeted Quantization Noise Injection (TQNI)

3.3.1 Motivation

Standard QAT injects fake quantization noise into every layer. This uniform treatment is inefficient: most layers are relatively insensitive to quantization and do not benefit from the hardening effect, while a few high-sensitivity layers require disproportionate attention. The SWAN output noise amplification metric identifies which layers amplify input perturbations; these are the layers for which quantization noise is most damaging.

TQNI uses the SWAN sensitivity profile to concentrate noise injection on identified high-risk layers, achieving the convergence benefits of QAT where they are needed without disrupting stable layers.

3.3.2 Noise Injection Protocol

During the forward pass, each weight matrix Wl is perturbed by simulated quantization noise calibrated to the target bit-width b:

Ŵl = Wl + εl,  εl ~ Uniform(−Δl/2, Δl/2)   (6)

where Δl = (max(Wl) − min(Wl)) / (2b − 1) is the quantization step size for layer l at bit-width b. The noise is injected only for layers whose SWAN noise amplification score Al exceeds a threshold θnoise:

Apply TQNI to layer l  iff  Al > θnoise   (7)

The threshold θnoise is set adaptively based on the empirical distribution of amplification scores across layers, targeting the top-k% most sensitive layers at each diagnostic checkpoint. We find k = 20 (i.e., the top quintile of sensitive layers) to be a robust default.

3.3.3 Gradient Flow

The noise injection operation is non-differentiable at the boundaries of the uniform distribution. We use the Straight-Through Estimator (STE) to propagate gradients through the quantization noise, treating the forward-pass perturbation as an identity in the backward pass. This is identical to standard QAT practice and is well-supported theoretically for small noise magnitudes.

3.4 Dynamic Bit-Width Allocation (DBWA)

3.4.1 Overview

The three mechanisms above improve quantization readiness of the final model. DBWA addresses the orthogonal objective of training efficiency: by running SAT’s sensitivity analysis periodically during training and adjusting each layer’s training precision accordingly, we can substantially reduce memory and compute usage without sacrificing model quality.

3.4.2 The Diagnostic-Allocate-Protect Loop

Every D training steps (we use D = 1000 as the default), the following procedure executes:

  1. Diagnose: Run a forward pass over a small calibration batch and compute the three SWAN metrics for every layer. This takes seconds, regardless of model size, because the metrics require only basic linear algebra.
  2. Score: Compute a composite sensitivity score Sl for each layer as a weighted combination of normalised kurtosis, spectral concentration, and noise amplification scores.
  3. Allocate: Assign a training precision bl to each layer based on Sl. Layers in the bottom quartile of sensitivity (Sl < τlow) are trained in 8-bit. Layers in the top quartile (Sl > τhigh) remain in 16-bit. The middle half are assigned 12-bit as a compromise tier.
  4. Protect: Apply stronger KDS and SC regularisation to high-sensitivity layers by scaling their λ coefficients upward.
bl = { 8-bit if Sl < τlow ;  12-bit if τlow ≤ Sl ≤ τhigh ;  16-bit if Sl > τhigh }   (8)

3.4.3 Memory and Compute Impact

If 25% of layers are at 8-bit and 50% at 12-bit and 25% at 16-bit, and assuming uniform layer sizes, the weighted average precision is 0.25 × 8 + 0.50 × 12 + 0.25 × 16 = 12 bits, compared with 16 bits for standard BF16 training. This represents a 25% reduction in the average parameter memory footprint during training, with corresponding reductions in gradient and optimiser state memory. For large models where memory is the primary constraint, this reduction translates directly into the ability to train larger models or use larger batch sizes within the same hardware budget.


4. Theoretical Analysis

4.1 Why Kurtosis Regularisation Does Not Hurt Expressiveness

A natural concern is that constraining the kurtosis of weight distributions limits the model’s ability to represent complex functions. We argue this concern is unfounded for two reasons. First, the kurtosis penalty targets the tails of the weight distribution, not its variance or mean. A distribution with kurtosis near Gaussian (κ ≈ 0) can still have arbitrarily large standard deviation, encompassing a full range of weight magnitudes. Second, the threshold κtarget is set above zero, permitting moderately heavy-tailed distributions. The penalty eliminates only extreme outliers—the tiny fraction of weights that cause disproportionate quantization damage—without restricting the bulk distribution.

This is analogous to the relationship between L2 weight decay and model capacity: L2 regularisation is known not to reduce model expressiveness in the function-space sense, only to prefer simpler solutions within that space. Kurtosis regularisation similarly prefers quantization-friendly solutions without restricting the space of representable functions.

4.2 Spectral Conditioning and Generalisation

Maintaining well-conditioned weight matrices is independently beneficial for training stability. Matrices with high spectral concentration (large σmax relative to the Frobenius norm) are poorly conditioned, amplifying gradient noise during backpropagation. This is the same phenomenon that motivated spectral normalisation in GAN training. The SC regulariser in SAT therefore provides two benefits simultaneously: it improves quantization robustness by distributing information across singular dimensions, and it stabilises gradient flow by bounding the spectral norm of weight updates.

There is also a connection to generalisation: results from the PAC-Bayes learning theory literature suggest that flatter weight minima (in a spectral sense) generalise better. SAT’s spectral conditioning may therefore yield generalisation improvements beyond the quantization-readiness objective.

4.3 Convergence of SAT

SAT introduces three additional terms to the training objective. The overall loss is continuous and differentiable (using STE for TQNI), so standard convergence guarantees for stochastic gradient descent apply. The regularisation terms are bounded (kurtosis and spectral concentration are both finite for bounded weight tensors), so they do not dominate the primary loss. The DBWA mechanism changes the effective learning rate in lower-precision layers due to reduced numerical precision in gradient computation; we compensate for this by scaling the learning rate for lower-precision layers upward by a factor proportional to the precision reduction ratio.


5. Comparison with Existing Training Paradigms

MethodApproachQuantization OutcomeMemory Efficiency
Standard Pre-TrainingUniform precision (e.g., all BF16) throughoutPoor; outliers baked in from step oneModerate
Quantization-Aware Training (QAT)Fake-quantize all layers uniformly during trainingGood, but convergence is difficult and slowLow (uniform FP16/BF16 overhead)
SWAN-Guided Post-Training QuantizationAnalyse trained model; selectively protect sensitive layersGood, but limited by pre-existing outliersHigh at inference
SAT (Proposed)Dynamic mixed-precision; kurtosis and spectral regularisation during trainingOptimal; outliers never emergeOptimal; precision follows sensitivity

Table 1: Training paradigm comparison. SAT is the only method that simultaneously improves both quantized model quality and training efficiency.

The key distinction between SAT and QAT is surgical precision: QAT applies uniform fake-quantization pressure to all layers simultaneously, making convergence difficult and requiring careful tuning of fake-quantization parameters. SAT applies pressure proportional to sensitivity, protecting stable layers from unnecessary perturbation and concentrating noise-hardening where statistical analysis indicates it is needed. The key distinction between SAT and SWAN-guided PTQ is causality: PTQ corrects a problem after it exists; SAT prevents the problem from arising. SWAN’s correction mechanisms (rotation, smoothing) are second-order approximations; SAT’s first-order prevention is theoretically superior.


6. Implementation Considerations

6.1 Computational Overhead

The three SAT regularisers introduce the following per-step overhead relative to standard training. The kurtosis estimator requires computing the fourth central moment of each weight tensor, which is O(n) in the number of parameters per layer—negligible compared to the O(n²) matrix operations in the forward and backward passes. The spectral concentration estimator using power iteration requires two matrix-vector products per layer per step, which is O(mn) per layer—comparable to one additional forward pass per layer. This can be amortised by running the spectral estimator every k steps rather than every step; we find k = 10 to be sufficient for stable regularisation. The TQNI operation adds noise during the forward pass only for flagged layers; this is a simple additive operation with negligible cost.

The DBWA diagnostic checkpoint at every D steps requires a full forward pass over a calibration batch with SVD computation for all layers. For a model with L layers, this is O(L × mn × log(min(m,n))) using randomised SVD. At D = 1000 steps, this checkpoint adds approximately 0.1% overhead to total training time for a typical transformer architecture.

6.2 Hyperparameter Guidance

SAT introduces the following hyperparameters, with recommended defaults based on empirical exploration:

6.3 Integration with Existing Optimisers

SAT is optimiser-agnostic. The regularisation terms are added to the loss before backpropagation, so they produce standard gradients that any first- or second-order optimiser can process. We note a particular synergy with Muon [7], which orthogonalises gradient updates and has demonstrated strong quantization properties in recent benchmarks. Muon’s inherent tendency to produce orthogonal weight updates naturally complements the spectral conditioning regulariser, potentially reducing the required λσ strength.


7. Discussion

The shift from post-hoc diagnosis to causal prevention.

The history of quantization research can be read as a progressive move toward earlier intervention. Early PTQ methods accepted the trained model’s geometry entirely and tried to minimise rounding error given that geometry. GPTQ introduced weight adjustment post-training. QAT moved intervention to the training loop but treated it as a uniform perturbation. SAT makes the next logical step: targeted, statistically-guided intervention that shapes the geometry as it emerges, preventing pathological distributions from forming rather than correcting them afterward.

This causal perspective has implications beyond LLM quantization. The SWAN metrics—kurtosis, spectral concentration, noise amplification—are general measures of weight distribution quality. SAT’s framework could be applied to any domain where model compression is a deployment objective, including computer vision, speech, and scientific computing.

Towards autonomous precision management.

DBWA points toward a future of fully autonomous precision management: an optimiser that continuously monitors the statistical geometry of its own weight updates and allocates numerical resources accordingly. The current SAT proposal is a practical first step—periodic checkpointing with rule-based allocation—but the underlying principle naturally extends to continuous monitoring and gradient-level precision decisions.

One can envision an optimiser that operates in a learned representation of the sensitivity space, predicting which layers will become high-sensitivity in future steps and pre-emptively allocating protection. This would transform quantization readiness from a constraint on the final model into a trajectory property of the optimisation path—a fundamentally different and more powerful framing.

Limitations and open questions.

Several important questions remain open. First, the interaction between kurtosis regularisation and the emergence of specialised neurons (polysemanticity) in transformer models is not yet understood; it is possible that some degree of outlier activation is mechanistically linked to important representational phenomena. Second, the appropriate values of the sensitivity thresholds (τlow, τhigh, θnoise) likely depend on the target quantization bit-width and model architecture, and systematic exploration of this dependency is needed. Third, the computational overhead of SAT, while modest, must be measured carefully at the scale of frontier model training where even small fractional overheads translate to large absolute costs.


8. Conclusion

We have presented Sensitivity-Aware Training (SAT), a training framework that extends the SWAN diagnostic philosophy into an active training paradigm. Rather than measuring quantization sensitivity after training and applying corrective transformations post-hoc, SAT embeds three complementary sensitivity-management mechanisms directly into the training loop: kurtosis regularisation that prevents outlier weight emergence, spectral conditioning that maintains well-distributed singular-value spectra, and targeted quantization noise injection that hardens only statistically high-risk layers. A Dynamic Bit-Width Allocation mechanism uses periodic SWAN analysis to assign per-layer training precision, reducing memory and compute waste on low-sensitivity layers while protecting high-sensitivity ones.

The core insight is simple but consequential: the best time to prepare a model for quantization is while it is being trained. The SWAN framework provides exactly the statistical vocabulary needed to make this preparation precise and adaptive. SAT demonstrates that sensitivity analysis need not be a post-mortem; it can be a training signal.

The immediate research agenda is clear: implement and benchmark SAT against standard pre-training and QAT baselines at multiple model scales, validate the theoretical claims about kurtosis and spectral conditioning empirically, and develop the autonomous precision management vision into a production-ready system. The longer-term agenda is more ambitious: a generation of LLMs trained from scratch with quantization geometry as a first-class objective, requiring no post-training correction and deployable natively at 4-bit or lower precision without performance compromise.


References

[1] M. S. Akhondzadeh, A. Bojchevski, E. Eleftheriou, and M. Dazzi. "KurTail: Kurtosis-based LLM Quantization." arXiv:2503.01483, 2025.

[2] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 35, 30318–30332, 2022.

[3] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." In ICLR, 2023.

[4] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration." In MLSys, 2024.

[5] S. Liu et al. "SpinQuant: LLM Quantization with Learned Rotations." In ICLR, 2025.

[6] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. "Spectral Normalization for Generative Adversarial Networks." In ICLR, 2018.

[7] A. Panferov et al. "A Study of Optimisers Under Quantization." OpenReview preprint, 2025.

[8] A. Roy et al. "Towards Superior Quantization Accuracy: A Layer-sensitive Approach." arXiv:2503.06518, 2025.

[9] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." In ICML, 2023.

[10] S. Ashkboos, I. Markov, E. Frantar, T. Zhong, X. Wang, J. Ren, T. Hoefler, and D. Alistarh. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." NeurIPS, 37, 2024.


© 2026 baa.ai. All rights reserved. Licensed under CC BY-NC-ND 4.0.

Generated from SAT research data. Last updated: February 2026.