Lädt...


🔧 Fast Memory Copying in C#/.NET (Cache, AVX, Threads, Unsafe, Alternatives)


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

Numerous copy routine implementations are readily available in .NET. If I were to simply list them alongside a few benchmark numbers and charts, it wouldn't make for a very interesting article.

⚠️ What if I told you upfront that none of these routines is designed to be the absolute fastest?

If you're interested in a basic comparison, I recommend checking out this article here on dev.to: What is the best way to copy an array? or a more detailed, older comparison: High performance memcpy gotchas in C#.

Here, I'll outline a list of options, but this is far from the whole story:

  • A simple for loop (hint: foreach is usually a bit faster)
  • Array.Copy
  • Span.CopyTo
  • Buffer.BlockCopy
  • Buffer.MemoryCopy
  • Marshal.Copy
  • Unsafe.CopyBlock
  • Imported memcpy

If you're currently struggling with slow array or memory copy operations, try one of the functions on this list.

However, there are some elephants in the room, and I plan to uncover a few of them.

Image description

Elephant No.1 - Cache Pollution

What You might read in many places is that framework functions are already highly optimized. This is true, but they are not necessarily optimized for the highest speed possible.

Built-in functions provided by .NET framework are optimized in various ways. One significant consideration is preventing cache pollution. Wait - what? Yes, x86 CPUs achieve their speed primarily thanks to their cache. If the cache is disabled or not utilized properly, code execution speed drops dramatically.

Cache Pollution Explained

  • Cache pollution when copying large data blocks: This occurs when frequently used data gets replaced in the CPU cache by the data being copied. Chances are that once You are done copying, You won't touch the same data ever again, so it makes placing them in a CPU cache unnecessary.
    For example, imagine a network stack - once data is sent, it's unlikely to be touched again. Similarly, when loading textures with the CPU for GPU usage, caching may be unnecessary.

  • Standard .NET functions mitigate this issue when copying larger blocks (usually sizes above 1MB) by using non-temporal access, bypassing the cache.

  • This might not suit your use case. For instance, if you are waiting for the copy to complete before doing anything else - some cache pollution would be an acceptable tradeoff. Especially when the data is likely to be cached anyway, as in simulations or CPU rendering.

▶️ How can we observe this? And what can we do about it?

Image description

Chart above compares various buffer sizes (ranging from 1MB to 100MB) and copy methods. The leftmost data points for 1MB block shows great performance for Buffer.MemoryCopy and Unsafe.CopyBlock, two best methods available in .NET for memory copying. However performance falls sharply past 1MB.

To illustrate the real HW limitation, please notice the comparison between AVX based copy routine using 256bit (orange) and 512bit (red) vectors and the difference between normal loads and non temporal 512 bit variant (light blue). These tests were run on a Ryzen 9950x, with 48kB L1 data cache, 1024kB of L2 and 32MB of L3 available to any single CPU core. Total cache sizes are 1280 KB L1 Cache (16x 32kB + 16x 48kB), L2 Cache 16 MB (16x 1024kB) and L3 Cache 64 MB (2x32MB).

Using cached AVX loads, high copy speed is sustained even for 8MB (and larger) blocks. However, when copying large blocks relative to cache size (e.g.: 100MB), built-in functions regained their advantage.

Elephant No.2 - Overhead

My first benchmark shows that splitting larger buffers into smaller chunks can improve performance at the expense of increased CPU cache utilization.

There are other factors causing slowdowns, for example:

  • Managed memory introduces framework checks for each access.
  • Unaligned memory access could decrease the CPU cache efficiency.
  • And finally, with unmanaged, 64 byte aligned memory, there is still a final question:

▶️ What is the optimal block size?

This depends on the actual CPU, but we are trying to strike a balance between call overhead and efficient CPU resource utilization.

Image description

The chart above shows the throughput for transferring a 32MB buffer using various block sizes and methods. The key takeaway is Buffer.MemoryCopy (blue) and Unsafe.CopyBlock (yellow) perform best with block sizes between 8kB to 1MB. Notably, these methods outperform themselves (orange), compared with single call for the whole 32MB buffer.

Methods represented by horizontal lines do not use variable block sizes. It is worth mentioning that AVX variants are always loading 8 vectors (of 256 or 512 bits) before storing them, thus effectively working with 256 an 512 BYTE blocks regardless of the total buffer size.

Elephant No.3 - Multi Threading

So far, we have tested only a single threaded performance, as memory speed and cache capacity are often the limiting factor. However, we are not utilizing all CPU resources, which might cost us some performance!

▶️ Can we combine the previous techniques with multiple threads?

Modern CPUs, even desktop ones, are becoming increasingly heterogeneous. By splitting the workload across multiple threads, we might take advantage of more CPU resources:

Image description

The chart above shows the chunked, multi-threaded approach. For reference, the test system uses dual channel DDR5 6000MT/s memory, with theoretical peak performance about half of the total throughput (50% of 90GB/s).

What we observe here is a synergistic effect between smaller blocks and multiple threads!

Final Thoughts

Standard framework functions are well-optimized, safe and sufficient for most use cases. If extreme performance is required, there are additional techniques and tradeoffs available.
Example with a single call to Buffer.MemoryCopy on an 8MB buffer reaching ~30GB/s speed, offers significant gains. It is possible to reach 220GB/s with multiple transfers of 128kB blocks and using multiple threads on a same CPU. This is more than 7x improvement.

...

🔧 Fast Memory Copying in C#/.NET (Cache, AVX, Threads, Unsafe, Alternatives)


📈 98.52 Punkte
🔧 Programmierung

📰 Facebook uses "unsafe-inline" and "unsafe-eval"....should users be worried about that ?


📈 31.17 Punkte
📰 IT Security Nachrichten

🔧 Danger: Unsafe Code (or How To Build On An Unsafe Foundation)


📈 31.17 Punkte
🔧 Programmierung

📰 Intel-Prozessor: Tiger Lake mit 50 Prozent mehr L3-Cache und AVX-512


📈 30.43 Punkte
📰 IT Nachrichten

🔧 Cache-Control, Netlify-CDN-Cache-Control, Cache Invalidation, Oh My


📈 27.49 Punkte
🔧 Programmierung

🔧 Old-School Graphics in C# / .Net 8, Part 2: Fireworks and Advanced Vector Extensions (AVX, SSE)


📈 26.55 Punkte
🔧 Programmierung

🔧 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
🔧 Programmierung

📰 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
📰 IT Nachrichten

🔧 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
🔧 Programmierung

🐧 Kernel threads v User threads. Do they communicate via IPC for the sake of executing system calls?


📈 22.88 Punkte
🐧 Linux Tipps

🐧 Threads v Processes. Is everything a process? Do threads exist or is it just lingo enforced by pThreads?


📈 22.88 Punkte
🐧 Linux Tipps

📰 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
📰 IT Nachrichten

🔧 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
🔧 Programmierung

📰 heise-Angebot: iX-Workshop: Umstieg von klassischen Threads in Java auf Virtual Threads


📈 22.88 Punkte
📰 IT Nachrichten

📰 AppViewX AVX ONE provides visibility, automation and control of certificates and keys


📈 21.26 Punkte
📰 IT Security Nachrichten

🔧 AVX-512 Auto-Vectorization in MSVC


📈 21.26 Punkte
🔧 Programmierung

📰 AMD Zen 5: Neue CPUs bekommen zusätzliche AVX-Fähigkeiten und mehr


📈 21.26 Punkte
📰 IT Nachrichten

🐧 AVX and thermals on Linux


📈 21.26 Punkte
🐧 Linux Tipps

🐧 Ubuntu Linux Evaluating x86-64-v3 Based Build - AVX & Newer Intel/AMD CPUs


📈 21.26 Punkte
🐧 Linux Tipps

📰 Assassin's Creed: Odyssey setzt CPU mit AVX-Unterstützung voraus


📈 21.26 Punkte
📰 IT Nachrichten

📰 Intel Raptor Lake-E: Xeon E-2400 für LGA 1700 nutzt nur P-Cores (ohne AVX-512)


📈 21.26 Punkte
📰 IT Nachrichten

📰 Assassin's Creed: Odyssey - Spiel benötigt Prozessor mit AVX-Unterstützung


📈 21.26 Punkte
📰 IT Nachrichten

📰 Downfall fallout: Intel knew AVX chips were insecure and did nothing, lawsuit claims


📈 21.26 Punkte
📰 IT Security Nachrichten

📰 AVX-Taktraten für Skylake-X: Finale Angaben zeigen bis zu 900 MHz Taktunterschied


📈 21.26 Punkte
📰 IT Nachrichten

📰 Kyocera AVX says ransomware attack impacted 39,000 individuals


📈 21.26 Punkte
📰 IT Security Nachrichten

📰 Das können Intels 28-Kern-CPUs mit AVX-512


📈 21.26 Punkte
📰 IT Nachrichten

📰 How to Know and Check Your CPU Support AVX Instruction


📈 21.26 Punkte
📰 IT Security Nachrichten

📰 Xeon Skylake-SP: Das können Intels 28-Kern-CPUs mit AVX-512


📈 21.26 Punkte
📰 IT Nachrichten

📰 Google Cloud: Erstbetreiber der Skylake-EP mit AVX-512


📈 21.26 Punkte
📰 IT Nachrichten

📰 x86: Torvalds sieht kaum praktische Vorteile in AVX-512


📈 21.26 Punkte
📰 IT Nachrichten

matomo