Most ONNX models are exported in FP32 — 32-bit floating point per weight. NVIDIA's Tensor Cores can run inference at FP16 (half precision, 2× speedup, half the memory) or FP8 (1/4 memory, supported on Hopper/Ada and newer) with minimal accuracy loss for most models. This article covers when reducing precision actually pays off in an MQL5 EA, how to convert your model, and the pitfalls.
The three precisions
| Precision | Bits | Hardware required | Throughput vs FP32 |
|---|---|---|---|
| FP32 | 32 | Any CUDA GPU | 1.0× (baseline) |
| FP16 | 16 | Turing+ (compute 7.5+) | 2× on Tensor Cores |
| BF16 | 16 | Ampere+ (compute 8.0+) | 2×, better numerical range than FP16 |
| FP8 | 8 | Hopper+ (compute 9.0+) or Ada (compute 8.9) | 4×, but requires careful calibration |
When it pays off
Honest answer: only if your model is large enough to be bottlenecked by GPU compute or memory, and your hardware actually has the relevant tensor units.
- Small models (under 1M params): FP16 makes essentially no difference in inference time on retail GPUs. Don't bother.
- Medium models (1M–100M params, LSTMs, small transformers): FP16 gives a real 1.5–2× speedup on Turing+ cards. Worth doing if inference is on the hot path.
- Large transformers (100M+ params): FP16 is mandatory; FP8 helps further on Hopper/Ada. But this is rare in retail trading.
- You're running on CPU: FP16 is not faster on CPU — in many cases it's slower because of conversion overhead. Stay FP32 if CPU-bound.
Converting an FP32 model to FP16
Do the conversion at export time in Python, not at runtime in MQL5. Two options:
Option A: train in FP32, convert post-hoc
One command. Works for most models. The library converts FP32 weights to FP16 and adds explicit type casts where needed (e.g., for layers that need full precision).
Option B: train in mixed precision, export native FP16
Use PyTorch's torch.cuda.amp.autocast during training. The model runs in mixed FP16/FP32 during training, and you export with the lower-precision weights baked in. Slightly more accurate for very deep networks than post-hoc conversion, but more work.
What to verify after conversion
- Accuracy on validation data. Run inference on a holdout set in both FP32 and FP16; compare outputs. A few percent difference per prediction is normal for FP16; large divergence means the model has numerically sensitive layers that need to stay FP32 (use a "block list").
- Inference timing on target hardware. Benchmark on the actual GPU you'll deploy to. FP16 only speeds up where Tensor Cores activate — some layers don't get any benefit.
- No NaN/Inf in outputs. FP16 has narrower numeric range (max ~65504). If activations exceed this, you get NaN. The post-hoc converter usually catches this; double-check.
FP8: more aggressive, more careful
FP8 is supported by the ONNX Runtime CUDA provider on Hopper and Ada cards. The savings are real (4× memory reduction, big throughput gain on compatible matmul ops), but the calibration is non-trivial — you typically need a representative dataset to determine activation scaling factors before conversion. For most retail trading workloads, this is more engineering effort than the speedup justifies. Stick to FP16.
If you're running a large transformer that genuinely needs FP8, look at NVIDIA's TensorRT-LLM or the Hugging Face Optimum library for the conversion tooling.
MQL5 side: nothing changes
From MQL5's perspective, an FP16 ONNX model is loaded and run identically to an FP32 one. The runtime handles the dtype internally. Your OnnxCreate, OnnxSetInputShape, and OnnxRun calls don't change. Use matrixf/vectorf on the MQL5 side — they're FP32, and MQL5 converts to FP16 at the runtime boundary if needed. (Or use FP16-native MQL5 types if your model accepts FP16 input directly.)