Most ONNX models are exported in FP32 — 32-bit floating point per weight. NVIDIA's Tensor Cores can run inference at FP16 (half precision, 2× speedup, half the memory) or FP8 (1/4 memory, supported on Hopper/Ada and newer) with minimal accuracy loss for most models. This article covers when reducing precision actually pays off in an MQL5 EA, how to convert your model, and the pitfalls.

The three precisions

PrecisionBitsHardware requiredThroughput vs FP32
FP3232Any CUDA GPU1.0× (baseline)
FP1616Turing+ (compute 7.5+)2× on Tensor Cores
BF1616Ampere+ (compute 8.0+)2×, better numerical range than FP16
FP88Hopper+ (compute 9.0+) or Ada (compute 8.9)4×, but requires careful calibration

When it pays off

Honest answer: only if your model is large enough to be bottlenecked by GPU compute or memory, and your hardware actually has the relevant tensor units.

Converting an FP32 model to FP16

Do the conversion at export time in Python, not at runtime in MQL5. Two options:

Option A: train in FP32, convert post-hoc

post-training FP16 conversion
import onnx from onnxconverter_common import float16 model_fp32 = onnx.load("eurusd_fp32.onnx") model_fp16 = float16.convert_float_to_float16(model_fp32) onnx.save(model_fp16, "eurusd_fp16.onnx")

One command. Works for most models. The library converts FP32 weights to FP16 and adds explicit type casts where needed (e.g., for layers that need full precision).

Option B: train in mixed precision, export native FP16

Use PyTorch's torch.cuda.amp.autocast during training. The model runs in mixed FP16/FP32 during training, and you export with the lower-precision weights baked in. Slightly more accurate for very deep networks than post-hoc conversion, but more work.

What to verify after conversion

  1. Accuracy on validation data. Run inference on a holdout set in both FP32 and FP16; compare outputs. A few percent difference per prediction is normal for FP16; large divergence means the model has numerically sensitive layers that need to stay FP32 (use a "block list").
  2. Inference timing on target hardware. Benchmark on the actual GPU you'll deploy to. FP16 only speeds up where Tensor Cores activate — some layers don't get any benefit.
  3. No NaN/Inf in outputs. FP16 has narrower numeric range (max ~65504). If activations exceed this, you get NaN. The post-hoc converter usually catches this; double-check.

FP8: more aggressive, more careful

FP8 is supported by the ONNX Runtime CUDA provider on Hopper and Ada cards. The savings are real (4× memory reduction, big throughput gain on compatible matmul ops), but the calibration is non-trivial — you typically need a representative dataset to determine activation scaling factors before conversion. For most retail trading workloads, this is more engineering effort than the speedup justifies. Stick to FP16.

If you're running a large transformer that genuinely needs FP8, look at NVIDIA's TensorRT-LLM or the Hugging Face Optimum library for the conversion tooling.

MQL5 side: nothing changes

From MQL5's perspective, an FP16 ONNX model is loaded and run identically to an FP32 one. The runtime handles the dtype internally. Your OnnxCreate, OnnxSetInputShape, and OnnxRun calls don't change. Use matrixf/vectorf on the MQL5 side — they're FP32, and MQL5 converts to FP16 at the runtime boundary if needed. (Or use FP16-native MQL5 types if your model accepts FP16 input directly.)