Build 5572 added ONNX_ENABLE_PROFILING — a flag that, when passed to OnnxCreate, makes the runtime dump a per-node execution trace to disk on session close. The output is a JSON file in the Chrome tracing format, with one entry per operator execution, tagged with timing, memory, and which execution provider ran it.

It's the closest thing MQL5 has to a real ML profiler. This article walks through how to enable it, where to find the output, and how to read the file to answer the three most useful questions: where is the time going, is GPU actually being used, and what should I optimize first?

Enabling profiling

Add ONNX_ENABLE_PROFILING to the flags in OnnxCreate or OnnxCreateFromBuffer:

enable profiling at session creation
ExtHandle = OnnxCreateFromBuffer( ExtModel, ONNX_DEFAULT | ONNX_ENABLE_PROFILING );

The flag can be combined with anything else (logging levels, GPU device pinning, CPU-only force). Profiling adds a small overhead per OnnxRun call — not significant in practice, but enough that you shouldn't leave it on in production. Enable only during diagnosis.

Finding the output file

When the EA stops — via OnDeinit or when MT5 closes — the runtime writes the JSON to:

file location
<Terminal Data Folder>\MQL5\Files\OnnxProfileReports\<EA name>_<date>_<time>.json

Open the data folder via File → Open Data Folder in MT5. The OnnxProfileReports subfolder is created automatically on first profile dump. Each session of each EA produces one file — they accumulate, so clean up periodically.

JSON structure

The file is a single JSON array. Each element is an event:

one profile event (simplified)
{ "cat": "Node", "name": "LSTM_0", "ts": 1027, // timestamp in microseconds since start "dur": 423, // duration in microseconds "args": { "provider": "CUDAExecutionProvider", "op_name": "LSTM" } }

Each operator in your graph appears at least once per OnnxRun call. If the EA ran 10 inferences before stopping, you'll see each node 10 times.

Question 1: where is time spent?

For each OnnxRun call, sum the dur field across all nodes — that's the total inference time. Then look at which individual nodes have the largest dur. The 80/20 rule applies: one or two nodes usually eat most of the time.

Common findings:

Question 2: GPU or CPU per node?

The args.provider field shows which execution provider ran each node. Filter by "provider":"CUDAExecutionProvider" vs "provider":"CPUExecutionProvider".

Question 3: what to optimize first

Rank nodes by dur, descending. The top 3 entries are your optimization budget — everything else combined is rounding error. For each top node:

  1. If it's on CPU but is a normal op (matmul, conv): the CUDA provider doesn't support its dtype or shape. Re-export with simpler types.
  2. If it's on GPU but slow: the model architecture itself is the bottleneck. Smaller dimensions, fewer layers, different architecture.
  3. If it's a Memcpy node: upstream/downstream nodes are mismatched between CPU/GPU. Simplify the graph.

Viewing in chrome://tracing

Reading the raw JSON is fine for small models. For a visual timeline view:

  1. Open Chrome (or any Chromium-based browser).
  2. Go to chrome://tracing in the URL bar.
  3. Click "Load" and select your .json file.
  4. Drag and zoom across the timeline. Each box is an event; click to inspect.

This is the same tool used to profile Chromium itself. The JSON format is identical, so MT5's output loads cleanly. Visual inspection makes it obvious where the time goes and which boxes are CUDA vs CPU.