"Memcpy nodes added to graph for CUDAExecutionProvider"

If you enabled verbose logging during ONNX session creation, you may have seen something like this:

MT5 Experts log — verbose ONNX

[WARNING] 4 Memcpy nodes are added to the graph main for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.

It's a warning, not an error — your model still loads and runs. But the warning is the runtime telling you something specific: part of your computation graph is being executed on the CPU while the rest runs on the GPU, and data has to ferry back and forth across the PCIe bus on every inference. That ferrying is the "Memcpy node" being mentioned.

Memcpy nodes are not always avoidable, and a small number of them is fine. But if you have many — or just one in a hot inner loop — they can erase the GPU speedup you went through the trouble of enabling. This article explains why they get added, how to count them, and how to reduce them.

What a Memcpy node actually is

ONNX Runtime represents your model as a directed graph of operators. Normally, when you've selected the CUDA execution provider, every node runs on the GPU. But not every operator is implemented for every execution provider — the CUDA provider doesn't have a GPU kernel for every possible op, and some ops only make sense on CPU (shape manipulation, certain int64 operations, etc.).

When the runtime finds an op that has to run on CPU even though its surrounding context is GPU, it inserts a Memcpy node — literally a memory copy — to move the relevant tensor from GPU memory to CPU memory before the CPU op, and another Memcpy to move the result back to GPU before the next GPU op.

conceptually

// What you exported: [GPU op A] → [CPU op B] → [GPU op C] // What ONNX Runtime executes: [GPU op A] → [Memcpy G→C] → [CPU op B] → [Memcpy C→G] → [GPU op C]

Each Memcpy is a synchronous round-trip across the PCIe bus — on the order of microseconds for small tensors, milliseconds for larger ones. The warning is the runtime telling you "I had to insert these; you might want to know."

Why ONNX Runtime adds them

Common reasons your model has Memcpy-requiring ops:

Shape manipulation operators (Shape, Slice, Concat with dynamic dimensions). These often run on CPU because they reason about tensor metadata, not data. Common after an LSTM or Gather output.
Int64 arithmetic. Many CUDA kernels are float-only; int64 operations get pushed to CPU.
Control-flow operators (Loop, If). Sometimes only partly supported on CUDA depending on what's inside them.
Subgraphs with mixed-dtype operations. A graph that mixes float16, float32, and int32 in ways the CUDA provider doesn't fully cover.

None of these are bugs in your code — they're consequences of how the model was exported and what the runtime supports on each device.

When it actually hurts performance

The honest answer: it depends entirely on where the Memcpy node sits.

Low impact (don't worry about it)

1–2 Memcpy nodes near the model's boundary (input or output). These run once per inference. A few extra microseconds, lost in the noise.
Memcpy nodes on small tensors (shapes, scalar metadata). The data being copied is tiny.

High impact (worth fixing)

Memcpy nodes inside a loop or recurrent unit. If you have an LSTM with 120 timesteps and there's a Memcpy node inside the unrolled loop, that's 120 round-trips per inference.
Memcpy nodes on large activation tensors. Moving a (batch=1, seq=120, hidden=256) float32 tensor across PCIe is meaningful.
Memcpy patterns that block CUDA Graph capture. ONNX Runtime can optimize repeated GPU-only inference into a CUDA Graph — a single fused execution. Memcpy nodes break that, and the warning specifically calls this out.

Counting and locating them

The fastest way: enable profiling and read the trace.

enable profiling in OnnxCreate

ExtHandle = OnnxCreateFromBuffer( ExtModel, ONNX_DEFAULT | ONNX_ENABLE_PROFILING );

Run the EA for a minute, stop it, and check MQL5\Files\OnnxProfileReports\. The JSON file lists every node execution — including the Memcpy nodes — with their timing and execution provider. Search for "name":"Memcpy" in the JSON to find them all.

The full profiling JSON walkthrough is in the profiling guide.

How to reduce the count

The Memcpy nodes are inserted by the runtime based on what's in your model graph. To reduce them, you change the graph — which means changing what you export from Python.

Strategy 1: Simplify the input/output boundary

If your model takes raw bar data, normalizes inside the graph, and produces softmax probabilities, you have lots of small ops near the boundaries that may run on CPU. Move the normalization out of the model (do it in MQL5 before OnnxRun). Move the softmax out of the model (compute argmax in MQL5 after OnnxRun). The graph becomes pure float32 matmuls — clean GPU territory.

Strategy 2: Fix the dtype mismatches

Export with consistent dtypes. If your training code mixes float16 and float32, the export inherits the mixing — and the CUDA provider may push some ops to CPU. Standardize on float32 unless you're doing intentional FP16 inference (see FP16 inference in MQL5).

Strategy 3: Set input shapes statically where possible

Dynamic shapes force shape-manipulation ops at runtime — ops that often run on CPU. If your batch is always 1 and your sequence is always 120, fix them as constants in the exported model. OnnxSetInputShape in MQL5 then has no work to do, and the graph has fewer dynamic-shape ops.

Strategy 4: Run constant folding on the exported model

Use onnx-simplifier as a post-export step:

simplify the graph

pip install onnx-simplifier python -m onnxsim eurusd.onnx eurusd.simplified.onnx

It eliminates constant subgraphs, merges redundant ops, and often reduces the number of CPU-only nodes in the graph as a side effect. Compare the Memcpy count before and after.

Strategy 5: Accept it and run on CPU

If you've tried all the above and still have many Memcpy nodes — particularly if profiling shows that CPU+GPU mixed mode is barely faster than pure CPU — the answer is just to use CPU. Set ONNX_USE_CPU_ONLY, drop the GPU, lose the Memcpy overhead, and ship. Pure CPU is often within 30% of fragmented GPU for retail-scale models, and it eliminates the entire class of problems.

The decision tree we recommend in verify CUDA is used: benchmark both. If GPU isn't at least 2× faster than CPU on your actual workload, the operational complexity isn't worth it.

Summary

"N Memcpy nodes added" is a warning, not an error. The model runs.
Each Memcpy is a CPU ↔ GPU round-trip inserted by the runtime when an op can't run on the selected device.
A few at the boundary: harmless. Many inside a loop: meaningful overhead.
Profile to count and locate them. Simplify the graph at export time to reduce them.
If the overhead negates the GPU win, switch to CPU and stop fighting it.