Build 5572 added ONNX_ENABLE_PROFILING — a flag that, when passed to OnnxCreate, makes the runtime dump a per-node execution trace to disk on session close. The output is a JSON file in the Chrome tracing format, with one entry per operator execution, tagged with timing, memory, and which execution provider ran it.
It's the closest thing MQL5 has to a real ML profiler. This article walks through how to enable it, where to find the output, and how to read the file to answer the three most useful questions: where is the time going, is GPU actually being used, and what should I optimize first?
What's in this article
Enabling profiling
Add ONNX_ENABLE_PROFILING to the flags in OnnxCreate or OnnxCreateFromBuffer:
The flag can be combined with anything else (logging levels, GPU device pinning, CPU-only force). Profiling adds a small overhead per OnnxRun call — not significant in practice, but enough that you shouldn't leave it on in production. Enable only during diagnosis.
Finding the output file
When the EA stops — via OnDeinit or when MT5 closes — the runtime writes the JSON to:
Open the data folder via File → Open Data Folder in MT5. The OnnxProfileReports subfolder is created automatically on first profile dump. Each session of each EA produces one file — they accumulate, so clean up periodically.
JSON structure
The file is a single JSON array. Each element is an event:
Each operator in your graph appears at least once per OnnxRun call. If the EA ran 10 inferences before stopping, you'll see each node 10 times.
Question 1: where is time spent?
For each OnnxRun call, sum the dur field across all nodes — that's the total inference time. Then look at which individual nodes have the largest dur. The 80/20 rule applies: one or two nodes usually eat most of the time.
Common findings:
- One big LSTM/GRU node taking 60%+ of the time. Expected for sequence models. Reduce by shorter sequence length, smaller hidden size, or switching to a GRU.
- Multiple small ops summing to 30%+. Often shape manipulation or constants. Use
onnx-simplifierto fold them. - A single matmul dominating. Expected if you have a wide dense layer. Reduce by smaller hidden dim or by quantizing.
Question 2: GPU or CPU per node?
The args.provider field shows which execution provider ran each node. Filter by "provider":"CUDAExecutionProvider" vs "provider":"CPUExecutionProvider".
- All CUDA: clean GPU execution. Ideal.
- Mix: partial fallback. Each CPU node forces a Memcpy — see the Memcpy nodes article.
- All CPU: total fallback. The GPU isn't doing anything. Diagnose per verify CUDA is used.
Question 3: what to optimize first
Rank nodes by dur, descending. The top 3 entries are your optimization budget — everything else combined is rounding error. For each top node:
- If it's on CPU but is a normal op (matmul, conv): the CUDA provider doesn't support its dtype or shape. Re-export with simpler types.
- If it's on GPU but slow: the model architecture itself is the bottleneck. Smaller dimensions, fewer layers, different architecture.
- If it's a Memcpy node: upstream/downstream nodes are mismatched between CPU/GPU. Simplify the graph.
Viewing in chrome://tracing
Reading the raw JSON is fine for small models. For a visual timeline view:
- Open Chrome (or any Chromium-based browser).
- Go to
chrome://tracingin the URL bar. - Click "Load" and select your
.jsonfile. - Drag and zoom across the timeline. Each box is an event; click to inspect.
This is the same tool used to profile Chromium itself. The JSON format is identical, so MT5's output loads cleanly. Visual inspection makes it obvious where the time goes and which boxes are CUDA vs CPU.