logo
 
  1. IT-Security >
  2. Programmierung >
  3. Performance Improvements in .NET Core 3.0


ArabicEnglishFrenchGermanGreekItalianJapaneseKoreanPersianPolishPortugueseRussianSpanishTurkishVietnamese

Performance Improvements in .NET Core 3.0

RSS Kategorie Pfeil Programmierung vom | Quelle: devblogs.microsoft.com Direktlink öffnen

Back when we were getting ready to ship .NET Core 2.0, I wrote a blog post exploring some of the many performance improvements that had gone into it. I enjoyed putting it together so much and received such a positive response to the post that I did it again for .NET Core 2.1, a version for which performance was also a significant focus. With //build last week and .NET Core 3.0‘s release now on the horizon, I’m thrilled to have an opportunity to do it again.

.NET Core 3.0 has a ton to offer, from Windows Forms and WPF, to single-file executables, to async enumerables, to platform intrinsics, to HTTP/2, to fast JSON reading and writing, to assembly unloadability, to enhanced cryptography, and on and on and on… there is a wealth of new functionality to get excited about. For me, however, performance is the primary feature that makes me excited to go to work in the morning, and there’s a staggering amount of performance goodness in .NET Core 3.0.

In this post, we’ll take a tour through some of the many improvements, big and small, that have gone into the .NET Core runtime and core libraries in order to make your apps and services leaner and faster.

Setup

Benchmark.NET has become the preeminent tool for doing benchmarking of .NET libraries, and so as I did in my 2.1 post, I’ll use Benchmark.NET to demonstrate the improvements. Throughout the post, I’ll include the individual snippets of benchmarks that highlight the particular improvement being discussed. To be able to execute those benchmarks, you can use the following setup:

  1. Ensure you have .NET Core 3.0 installed, as well as .NET Core 2.1 for comparison purposes.
  2. Create a directory named BlogPostBenchmarks.
  3. In that directory, run dotnet new console.
  4. Replace the contents of BlogPostBenchmarks.csproj with the following:
    <Project Sdk="Microsoft.NET.Sdk">
    
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
        <TargetFrameworks>netcoreapp2.1;netcoreapp3.0</TargetFrameworks>
      </PropertyGroup>
    
      <ItemGroup>
        <PackageReference Include="BenchmarkDotNet" Version="0.11.5" />
        <PackageReference Include="System.Drawing.Common" Version="4.5.0" />
        <PackageReference Include="System.IO.Pipelines" Version="4.5.0" />
        <PackageReference Include="System.Threading.Channels" Version="4.5.0" />
      </ItemGroup>
    
    </Project>
  5. Replace the contents of Program.cs with the following:
    using BenchmarkDotNet.Attributes;
    using BenchmarkDotNet.Configs;
    using BenchmarkDotNet.Jobs;
    using BenchmarkDotNet.Running;
    using BenchmarkDotNet.Toolchains.CsProj;
    using Microsoft.Win32.SafeHandles;
    using System;
    using System.Buffers;
    using System.Collections;
    using System.Collections.Concurrent;
    using System.Collections.Generic;
    using System.Collections.Immutable;
    using System.Diagnostics;
    using System.Drawing;
    using System.Drawing.Drawing2D;
    using System.Globalization;
    using System.IO;
    using System.IO.Compression;
    using System.IO.Pipelines;
    using System.Linq;
    using System.Net;
    using System.Net.Http;
    using System.Net.NetworkInformation;
    using System.Net.Security;
    using System.Net.Sockets;
    using System.Runtime.CompilerServices;
    using System.Runtime.InteropServices;
    using System.Security.Authentication;
    using System.Security.Cryptography.X509Certificates;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading;
    using System.Threading.Channels;
    using System.Threading.Tasks;
    using System.Xml;
    
    [MemoryDiagnoser]
    public class Program
    {
        static void Main(string[] args) => BenchmarkSwitcher.FromTypes(new[] { typeof(Program) }).Run(args);
    
        // ... paste benchmark code here
    }

To execute a particular benchmark, unless otherwise noted, copy and paste the relevant code to replace the // ...above, and execute dotnet run -c Release -f netcoreapp2.1 --runtimes netcoreapp2.1 netcoreapp3.0 --filter "*Program*". This will compile and run the tests in release builds, on both .NET Core 2.1 and .NET Core 3.0, and print out the results for comparison in a table.

Caveats

A few caveats before we get started:

  1. Any discussion involving microbenchmark results deserves a caveat that measurements can and do vary from machine to machine. I’ve tried to pick stable examples to share (and have run these tests on multiple machines in multiple configurations to help validate that), but don’t be too surprised if your numbers differ from the ones I’ve shown; hopefully, however, the magnitude of the improvements demonstrated carries through. All of the shown results are from a nightly Preview 6 build for .NET Core 3.0. Here’s my configuration, as summarized by Benchmark.NET, on my Windows configuration and on my Linux configuration:
    BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.437 (1809/October2018Update/Redstone5)
    Intel Core i7-7660U CPU 2.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
    .NET Core SDK=3.0.100-preview6-011854
      [Host]     : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
      Job-RODBZD : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
      Job-TVOWAH : .NET Core 3.0.0-preview6-27712-03 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26005), 64bit RyuJIT
    
    BenchmarkDotNet=v0.11.5, OS=ubuntu 18.04
    Intel Xeon CPU E5-2673 v4 2.30GHz, 1 CPU, 4 logical and 2 physical cores
    .NET Core SDK=3.0.100-preview6-011877
      [Host]     : .NET Core 2.1.10 (CoreCLR 4.6.27514.02, CoreFX 4.6.27514.02), 64bit RyuJIT
      Job-SSHMNT : .NET Core 2.1.10 (CoreCLR 4.6.27514.02, CoreFX 4.6.27514.02), 64bit RyuJIT
      Job-CHXNFO : .NET Core 3.0.0-preview6-27713-12 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26307), 64bit RyuJIT
  2. Unless otherwise mentioned, benchmarks were executed on Windows. In many cases, performance is equivalent between Windows and Unix, but in others, there can be non-trivial discrepancies between them, in particular in places where .NET relies on OS functionality, and the OS itself has different performance characteristics.
  3. I mentioned posts on .NET Core 2.0 and .NET Core 2.1, but I didn’t mention .NET Core 2.2. .NET Core 2.2 was primarily focused on ASP.NET, and while there were terrific performance improvements at the ASP.NET layer in 2.2, the release was primarily focused on servicing for the runtime and core libraries, with most improvements post-2.1 skipping 2.2 and going into 3.0.

With that out of the way, let’s have some fun.

Span and Friends

One of the more notable features introduced in .NET Core 2.1 was Span<T>, along with its friends ReadOnlySpan<T>Memory<T>, and ReadOnlyMemory<T>. The introduction of these new types came with hundreds of new methods for interacting with them, some on new types and some with overloaded functionality on existing types, as well as optimizations in the just-in-time compiler (JIT) for making working with them very efficient. The release also included some internal usage of Span<T> to make existing operations leaner and faster while still enjoying maintainable and safe code. In .NET Core 3.0, much additional work has gone into further improving all such aspects of these types: making the runtime better at generating code for them, increasing the use of them internally to help improve many other operations, and improving the various library utilities that interact with them to make consumption of these operations faster.

To work with a span, one first needs to get a span, and several PRs have made doing so faster. In particular, passing around a Memory<T> and then getting a Span<T> from it is a very common way of creating a span; this is, for example, how the various Stream.WriteAsync and ReadAsync methods work, accepting a {ReadOnly}Memory<T> (so that it can be stored on the heap) and then accessing its Span property once the actual bytes need to be read or written. PR dotnet/coreclr#20771 improved this by removing an argument validation branch (both for {ReadOnly}Memory<T>.Span and for {ReadOnly}Span<T>.Slice), and while removing a branch is a small thing, in span-heavy code (such as when doing formatting and parsing), small things done over and over and over again add up. More impactful, PR dotnet/coreclr#20386 plays tricks at the runtime level to safely eliminate some of the runtime checked casting and bit masking logic that had been used to enable {ReadOnly}Memory<T> to wrap various types, like stringT[], and MemoryManager<T>, providing a seamless veneer over all of them. The net result of these PRs is a nice speed-up when fishing a Span<T> out of a Memory<T>, which in turn improves all other operations that do so.

private ReadOnlyMemory<byte> _mem = new byte[1];

[Benchmark]
public ReadOnlySpan<byte> GetSpan() => _mem.Span;

Method Toolchain Mean Error StdDev Ratio
GetSpan netcoreapp2.1 3.873 ns 0.0927 ns 0.0822 ns 1.00
GetSpan netcoreapp3.0 1.843 ns 0.0401 ns 0.0375 ns 0.48

 

Of course, once you get a span, you want to use it, and there a myriad of ways to use one, many of which have also been optimized further in .NET Core 3.0.

For example, just as with arrays, to pass the data from a span to native code via a P/Invoke, the data needs to be pinned (unless it’s already immovable, such as if the span were created to wrap some natively allocated memory not on the GC heap or if it were created for some data on the stack). To pin a span, the easiest way is to simply rely on the C# language’s support added in C# 7.3 that supports a pattern-based way to use any type with the fixed keyword. All a type need do is expose a GetPinnableReference method (or extension method) that returns a ref T to the data stored in that instance, and that type can be used with fixed{ReadOnly}Span<T> does exactly this. However, even though {ReadOnly}Span<T>.GetPinnableReference generally gets inlined, a call it makes internally to Unsafe.AsRef was getting blocked from inlining; PR dotnet/coreclr#18274 fixed this, enabling the whole operation to be inlined. Further, the aforementioned code was actually tweaked in PR dotnet/coreclr#20428 to eliminate one branch on the hot path. Both of these combine to result in a measurable boost when pinning a span:

private readonly byte[] _bytes = new byte[10_000];

[Benchmark(OperationsPerInvoke = 10_000)]
public unsafe int PinSpan()
{
    Span<byte> s = _bytes;
    int total = 0;

    for (int i = 0; i < s.Length; i++)
        fixed (byte* p = s) // equivalent to `fixed (byte* p = &s.GetPinnableReference())`
            total += *p;

    return total;
}

Method Toolchain Mean Error StdDev Ratio RatioSD
PinSpan netcoreapp2.1 0.7930 ns 0.0177 ns 0.0189 ns 1.00 0.00
PinSpan netcoreapp3.0 0.6496 ns 0.0109 ns 0.0102 ns 0.82 0.03

 

It’s worth noting, as well, that if you’re interested in these kinds of micro-optimizations, you might also want to avoid using the default pinning at all, at least on super hot paths. The {ReadOnly}Span<T>.GetPinnableReference method was designed to behave just like pinning of arrays and strings, where null or empty inputs result in a null pointer. This behavior requires an additional check to be performed to see whether the length of the span is zero:

// https://github.com/dotnet/coreclr/blob/52aff202cd382c233d903d432da06deffaa21868/src/System.Private.CoreLib/shared/System/Span.Fast.cs#L168-L174

[EditorBrowsable(EditorBrowsableState.Never)]
public unsafe ref T GetPinnableReference()
{
    // Ensure that the native code has just one forward branch that is predicted-not-taken.
    ref T ret = ref Unsafe.AsRef<T>(null);
    if (_length != 0) ret = ref _pointer.Value;
    return ref ret;
}

If in your code by construction you know that the span will not be empty, you can choose to instead use MemoryMarshal.GetReference, which performs the same operation but without the length check:

// https://github.com/dotnet/coreclr/blob/52aff202cd382c233d903d432da06deffaa21868/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/MemoryMarshal.Fast.cs#L79

public static ref T GetReference<T>(Span<T> span) => ref span._pointer.Value;

Again, while a single check adds minor overhead, when executed over and over and over, that can add up:

private readonly byte[] _bytes = new byte[10_000];

[Benchmark(OperationsPerInvoke = 10_000, Baseline = true)]
public unsafe int PinSpan()
{
    Span<byte> s = _bytes;
    int total = 0;

    for (int i = 0; i < s.Length; i++)
        fixed (byte* p = s) // equivalent to `fixed (byte* p = &s.GetPinnableReference())`
            total += *p;

    return total;
}

[Benchmark(OperationsPerInvoke = 10_000)]
public unsafe int PinSpanExplicit()
{
    Span<byte> s = _bytes;
    int total = 0;

    for (int i = 0; i < s.Length; i++)
        fixed (byte* p = &MemoryMarshal.GetReference(s))
            total += *p;

    return total;
}

Method Mean Error StdDev Ratio RatioSD
PinSpan 0.6524 ns 0.0129 ns 0.0159 ns 1.00 0.00
PinSpanExplicit 0.5200 ns 0.0111 ns 0.0140 ns 0.80 0.03

 

Of course, there are many other (and generally preferred) ways to operate over a span’s data than to use fixed. For example, it’s a bit surprising that until Span<T> came along, .NET didn’t have a built-in equivalent of memcmp, but nevertheless, Span<T>‘s SequenceEqual and SequenceCompareTo methods have become go-to methods for comparing in-memory data in .NET. In .NET Core 2.1, both SequenceEqual and SequenceCompareTo were optimized to utilize System.Numerics.Vector for vectorization, but the nature of SequenceEqual made it more amenable to best take advantage. In PR dotnet/coreclr#22127, @benaadams updated SequenceCompareTo to take advantage of the new hardware instrinsics APIs available in .NET Core 3.0 to specifically target AVX2 and SSE2, resulting in significant improvements when comparing both small and large spans. (For more information on hardware intrinsics in .NET Core 3.0, see platform-intrinsics.md and using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios.)

private byte[] _orig, _same, _differFirst, _differLast;

[Params(16, 256)]
public int Length { get; set; }

[GlobalSetup]
public void Setup()
{
    _orig = Enumerable.Range(0, Length).Select(i => (byte)i).ToArray();
    _same = (byte[])_orig.Clone();

    _differFirst = (byte[])_orig.Clone();
    _differFirst[0] = (byte)(_orig[0] + 1);

    _differLast = (byte[])_orig.Clone();
    _differLast[_differLast.Length - 1] = (byte)(_orig[_orig.Length - 1] + 1);
}

[Benchmark]
public int CompareSame() => _orig.AsSpan().SequenceCompareTo(_same);

[Benchmark]
public int CompareDifferFirst() => _orig.AsSpan().SequenceCompareTo(_differFirst);

[Benchmark]
public int CompareDifferLast() => _orig.AsSpan().SequenceCompareTo(_differLast);

Method Toolchain Length Mean Error StdDev Ratio
CompareSame netcoreapp2.1 16 16.955 ns 0.2009 ns 0.1781 ns 1.00
CompareSame netcoreapp3.0 16 4.757 ns 0.0938 ns 0.0732 ns 0.28
CompareDifferFirst netcoreapp2.1 16 11.874 ns 0.1240 ns 0.1100 ns 1.00
CompareDifferFirst netcoreapp3.0 16 5.174 ns 0.0543 ns 0.0508 ns 0.44
CompareDifferLast netcoreapp2.1 16 16.644 ns 0.2146 ns 0.2007 ns 1.00
CompareDifferLast netcoreapp3.0 16 5.373 ns 0.0479 ns 0.0448 ns 0.32
CompareSame netcoreapp2.1 256 43.740 ns 0.8226 ns 0.7292 ns 1.00
CompareSame netcoreapp3.0 256 11.055 ns 0.1625 ns 0.1441 ns 0.25
CompareDifferFirst netcoreapp2.1 256 12.144 ns 0.0849 ns 0.0752 ns 1.00
CompareDifferFirst netcoreapp3.0 256 6.663 ns 0.1044 ns 0.0977 ns 0.55
CompareDifferLast netcoreapp2.1 256 39.697 ns 0.9291 ns 2.6054 ns 1.00
CompareDifferLast netcoreapp3.0 256 11.242 ns 0.2218 ns 0.1732 ns 0.32

 

As background, “vectorization” is an approach to parallelization that performs multiple operations as part of individual instructions on a single core. Some optimizing compilers can perform automatic vectorization, whereby the compiler analyzes loops to determine whether it can generate functionally equivalent code that would utilize such instructions to run faster. The .NET JIT compiler does not currently perform auto-vectorization, but it is possible to manually vectorize loops, and the options for doing so have significantly improved in .NET Core 3.0. Just as a simple example of what vectorization can look like, imagine having an array of bytes and wanting to search it for the first non-zero byte, returning the position of that byte. The simple solution is to just iterate through all of the bytes:

private byte[] _buffer = new byte[10_000].Concat(new byte[] { 42 }).ToArray();

[Benchmark(Baseline = true)]
public int LoopBytes()
{
    byte[] buffer = _buffer;
    for (int i = 0; i < buffer.Length; i++)
    {
        if (buffer[i] != 0)
            return i;
    }
    return -1;
}

That of course works functionally, and for very small arrays it’s fine. But for larger arrays, we end up doing significantly more work than is actually necessary. Consider instead in a 64-bit process re-interpreting the array of bytes as an array of longs, which Span<T> nicely supports. We then effectively compare 8 bytes at a time rather than 1 byte at a time, at the expense of added code complexity: once we find a non-zero long, we then need to look at each byte it contains to determine the position of the first non-zero one (though there are ways to improve that, too). Similarly, the array’s length may not evenly divide by 8, so we need to be able to handle the overflow.

[Benchmark]
public int LoopLongs()
{
    byte[] buffer = _buffer;
    int remainingStart = 0;

    if (IntPtr.Size == sizeof(long))
    {
        Span<long> longBuffer = MemoryMarshal.Cast<byte, long>(buffer);
        remainingStart = longBuffer.Length * sizeof(long);

        for (int i = 0; i < longBuffer.Length; i++)
        {
            if (longBuffer[i] != 0)
            {
                remainingStart = i * sizeof(long);
                break;
            }
        }
    }

    for (int i = remainingStart; i < buffer.Length; i++)
    {
        if (buffer[i] != 0)
            return i;
    }

    return -1;
}

For longer arrays, this yields really nice wins:

Method Mean Error StdDev Ratio
LoopBytes 5,462.3 ns 107.093 ns 105.180 ns 1.00
LoopLongs 568.6 ns 6.895 ns 5.758 ns 0.10

 

I’ve glossed over some details here, but it should convey the core idea. .NET includes additional mechanisms for vectorizing as well. In particular, the aforementioned System.Numerics.Vector type allows for a developer to write code using Vector and then have the JIT compiler translate that into the best instructions available on the current platform.

[Benchmark]
public int LoopVectors()
{
    byte[] buffer = _buffer;
    int remainingStart = 0;

	if (Vector.IsHardwareAccelerated)
	{
		while (remainingStart <= buffer.Length - Vector<byte>.Count)
		{
			var vector = new Vector<byte>(buffer, remainingStart);
			if (!Vector.EqualsAll(vector, default))
			{
				break;
			}
			remainingStart += Vector<byte>.Count;
		}
	}

    for (int i = remainingStart; i < buffer.Length; i++)
    {
        if (buffer[i] != 0)
            return i;
    }

    return -1;
}

Method Mean Error StdDev Ratio
LoopBytes 5,462.3 ns 107.093 ns 105.180 ns 1.00
LoopLongs 568.6 ns 6.895 ns 5.758 ns 0.10
LoopVectors 306.0 ns 4.502 ns 4.211 ns 0.06

 

Further, .NET Core 3.0 includes new hardware intrinsics that allow a properly-motivated developer to eek out the best possible performance on supporting hardware, utilizing extensions like AVX or SSE that can compare well more than 8 bytes at a time. Many of the improvements in .NET Core 3.0 come from utilizing these techniques.

Back to examples, copying spans has also improved, thanks to PRs dotnet/coreclr#18006 from @benaadams and dotnet/coreclr#17889, in particular for relatively small spans…

private byte[] _from = new byte[] { 1, 2, 3, 4 };
private byte[] _to = new byte[4];

[Benchmark]
public void CopySpan() => _from.AsSpan().CopyTo(_to);

Method Toolchain Mean Error StdDev Ratio
CopySpan netcoreapp2.1 10.913 ns 0.1960 ns 0.1737 ns 1.00
CopySpan netcoreapp3.0 3.568 ns 0.0528 ns 0.0494 ns 0.33

 

Searching is one of the most commonly performed operations in any program, and searches with spans are generally performed with IndexOf and its variants (e.g. IndexOfAny and Contains) In PR dotnet/coreclr#20738, @benaadams again utilized vectorization, this time to improve the performance of IndexOfAny when operating over bytes, a particularly common case in many networking-related scenarios (e.g. parsing bytes off the wire as part of an HTTP stack). You can see the effects of this in the following microbenchmark:

private byte[] _arr = Encoding.UTF8.GetBytes("This is a test to see improvements to IndexOfAny.  How'd they work?");
[Benchmark] public int IndexOf() => new Span<byte>(_arr).IndexOfAny((byte)'.', (byte)'?');

Method Toolchain Mean Error StdDev Ratio
IndexOf netcoreapp2.1 12.828 ns 0.1805 ns 0.1600 ns 1.00
IndexOf netcoreapp3.0 4.504 ns 0.0968 ns 0.0858 ns 0.35

 

I love these kinds of improvements, because they’re low-enough in the stack that they end up having multiplicative effects across so much code. The above change only affected byte, but subsequent PRs were submitted to cover char as well, and then PR dotnet/coreclr#20855 made a nice change that brought these same changes to other primitives of the same sizes. For example, we can recast the previous benchmark to use sbyte instead of byte, and as of that PR, a similar improvement applies:

private sbyte[] _arr = Encoding.UTF8.GetBytes("This is a test to see improvements to IndexOfAny.  How'd they work?").Select(b => (sbyte)b).ToArray();

[Benchmark]
public int IndexOf() => new Span<sbyte>(_arr).IndexOfAny((sbyte)'.', (sbyte)'?');

Method Toolchain Mean Error StdDev Ratio
IndexOf netcoreapp2.1 24.636 ns 0.2292 ns 0.2144 ns 1.00
IndexOf netcoreapp3.0 9.795 ns 0.1419 ns 0.1258 ns 0.40

 

As another example, consider PR dotnet/coreclr#20275. That change similarly utilized vectorization to improve the performance of To{Upper/Lower}{Invariant}.

private string _src = "This is a source string that needs to be capitalized.";
private char[] _dst = new char[1024];
[Benchmark] public int ToUpperInvariant() => _src.AsSpan().ToUpperInvariant(_dst);

Method Toolchain Mean Error StdDev Ratio
ToUpperInvariant netcoreapp2.1 64.36 ns 0.8099 ns 0.6763 ns 1.00
ToUpperInvariant netcoreapp3.0 26.48 ns 0.2411 ns 0.2137 ns 0.41

 

PR dotnet/coreclr#19959 optimizes the Trim{Start/End} helpers on ReadOnlySpan<char>, another very commonly-applied method, with equally exciting results (it’s hard to see with the white space in the results, but the results in the table go in order of the arguments in the Params attribute):

[Params("", " abcdefg ", "abcdefg")]
public string Data;

[Benchmark]
public ReadOnlySpan<char> Trim() => Data.AsSpan().Trim();

Method Toolchain Data Mean Error StdDev Ratio
Trim netcoreapp2.1 12.999 ns 0.1913 ns 0.1789 ns 1.00
Trim netcoreapp3.0 3.078 ns 0.0349 ns 0.0326 ns 0.24
Trim netcoreapp2.1 abcdefg 17.618 ns 0.3534 ns 0.2951 ns 1.00
Trim netcoreapp3.0 abcdefg 7.927 ns 0.0934 ns 0.0828 ns 0.45
Trim netcoreapp2.1 abcdefg 15.522 ns 0.2200 ns 0.1951 ns 1.00
Trim netcoreapp3.0 abcdefg 5.227 ns 0.0750 ns 0.0665 ns 0.34

 

Sometimes optimizations are just about being smarter about code management. PR dotnet/coreclr#17890 removed an unnecessary layer of functions that were on many globalization-related code paths, and just removing those extra unnecessary method invocations results in measurable speed-ups when working with small spans, e.g.

[Benchmark]
public bool EndsWith() => "Hello world".AsSpan().EndsWith("world", StringComparison.OrdinalIgnoreCase);
Method Toolchain Mean Error StdDev Ratio
EndsWith netcoreapp2.1 37.80 ns 0.3290 ns 0.2917 ns 1.00
EndsWith netcoreapp3.0 12.26 ns 0.1479 ns 0.1384 ns 0.32

 

Of course, one of the great things about span is that it is a reusable building-block that enables many higher-level operations. That includes operations on both arrays and strings…

Arrays and Strings

As a theme that’s emerged within .NET Core, wherever possible, new performance-focused functionality should not only be exposed for public use but also be used internally; after all, given the depth and breadth of functionality within .NET Core, if some performance-focused feature doesn’t meet the needs of .NET Core itself, there’s a reasonable chance it also won’t meet the public need. As such, internal usage of new features is a key benchmark as to whether the design is adequate, and in the process of evaluating such criteria, many additional code paths benefit, and these improvements have a multiplicative effect.

This isn’t just about new APIs. Many of the language features introduced in C# 7.2, 7.3, and 8.0 are influenced by the needs of .NET Core itself and have been used to improve things that we couldn’t reasonably improve before (other than dropping down to unsafe code, which we try to avoid when possible). For example, PR dotnet/coreclr#17891 speeds up Array.Reverse by taking advantage of the C# 7.2 ref locals feature and the 7.3 ref local reassignment feature. Using the new feature allows for the code to be expressed in a way that lets the JIT generate better code for the inner loop, and in turn results in a measurable speed-up:

private int[] _arr = Enumerable.Range(0, 256).ToArray();

[Benchmark]
public void Reverse() => Array.Reverse(_arr);

Method Toolchain Mean Error StdDev Ratio RatioSD
Reverse netcoreapp2.1 105.06 ns 2.488 ns 7.337 ns 1.00 0.00
Reverse netcoreapp3.0 74.12 ns 1.494 ns 2.536 ns 0.66 0.02

 

Another example for arrays, the Clear method improved in PR dotnet/coreclr#24302, which works around an alignment issue that results in the underlying memset used to implement the operation being up to 2x slower. The change manually clears up to a few bytes one by one, such that the pointer we then hand off to memset is properly aligned. If you got “lucky” previously and the array happened to be aligned, performance was fine, but if it wasn’t aligned, there was a non-trivial performance hit incurred. This benchmark simulates the unlucky case:

[GlobalSetup]
public void Setup()
{
    while (true)
    {
        var buffer = new byte[8192];
        GCHandle handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
        if (((long)handle.AddrOfPinnedObject()) % 32 != 0)
        {
            _handle = handle;
            _buffer = buffer;
            return;
        }
        handle.Free();
    }
}

[GlobalCleanup]
public void Cleanup() => _handle.Free();

private GCHandle _handle;
private byte[] _buffer;

[Benchmark] public void Clear() => Array.Clear(_buffer, 0, _buffer.Length);
Method Toolchain Mean Error StdDev Ratio
Clear netcoreapp2.1 121.59 ns 0.8349 ns 0.6519 ns 1.00
Clear netcoreapp3.0 87.91 ns 1.7768 ns 1.6620 ns 0.73

 

That said, many of the improvements are in fact based on new APIs. Span is a great example of this. It was introduced in .NET Core 2.1, and the initial push was to get it to be usable and expose sufficient surface area to allow it to be used meaningfully. But at the same time, we started utilizing it internally in order to both vet the design and benefit from the improvements it enables. Some of this was done in .NET Core 2.1, but the effort continues in .NET Core 3.0. Arrays and strings are both prime candidates for such optimizations.

For example, many of the same vectorization optimizations applied to spans are similarly applied to arrays. PR dotnet/coreclr#21116 from @benaadams optimized Array.{Last}IndexOf for both bytes and chars, utilizing the same internal helpers that were written to enable spans, and to similar effect:

private char[] _arr = "This is a test to see improvements to IndexOf.  How'd they work?".ToCharArray();

[Benchmark]
public int IndexOf() => Array.IndexOf(_arr, '.');
Method Toolchain Mean Error StdDev Ratio RatioSD
IndexOf netcoreapp2.1 34.976 ns 0.6352 ns 0.5631 ns 1.00 0.00
IndexOf netcoreapp3.0 9.471 ns 0.6638 ns 1.1091 ns 0.29 0.04

 

And as with spans, thanks to PR dotnet/coreclr#24293 from @dschinde, these IndexOfoptimizations also now apply to other primitives of the same size.

private short[] _arr = "This is a test to see improvements to IndexOf.  How'd they work?".Select(c => (short)c).ToArray();

[Benchmark]
public int IndexOf() => Array.IndexOf(_arr, (short)'.');
Method Toolchain Mean Error StdDev Ratio
IndexOf netcoreapp2.1 34.181 ns 0.6626 ns 0.6508 ns 1.00
IndexOf netcoreapp3.0 9.600 ns 0.1913 ns 0.1598 ns 0.28

 

Vectorization optimizations have been applied to strings, too. You can see the effect of PR dotnet/coreclr#21076 from @benaadams in this microbenchmark:

[Benchmark]
public int IndexOf() => "Let's see how fast we can find the period towards the end of this string.  Pretty fast?".IndexOf('.', StringComparison.Ordinal);
Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
IndexOf netcoreapp2.1 75.14 ns 1.5285 ns 1.6355 ns 1.00 0.0151 32 B
IndexOf netcoreapp3.0 11.70 ns 0.2382 ns 0.2111 ns 0.16

 

Also note in the above that the .NET Core 2.1 operation allocates (due to converting the search character into a string), whereas the .NET Core 3.0 implementation does not. That’s thanks to PR dotnet/coreclr#19788 from @benaadams.

There are of course pieces of functionality that are more unique to strings (albeit also applicable to new functionality exposed on spans), such as hash code computation with various string comparison methods. For example, PR dotnet/coreclr#20309/ improved the performance of String.GetHashCode when performing OrdinalIgnoreCase operations, which along with Ordinal (the default) represent the two most common modes.

[Benchmark]
public int GetHashCodeIgnoreCase() => "Some string".GetHashCode(StringComparison.OrdinalIgnoreCase);
Method Toolchain Mean Error StdDev Ratio
GetHashCodeIgnoreCase netcoreapp2.1 47.70 ns 0.5751 ns 0.5380 ns 1.00
GetHashCodeIgnoreCase netcoreapp3.0 14.28 ns 0.1462 ns 0.1296 ns 0.30

 

OrdinalsIgnoreCase has been improved for other uses as well. For example, PR dotnet/coreclr#20734 improved String.Equals when using StringComparer.OrdinalIgnoreCaseby both vectorizing (checking two chars at a time instead of one) and removing branches from an inner loop:

[Benchmark]
public bool EqualsIC() => "Some string".Equals("sOME sTrinG", StringComparison.OrdinalIgnoreCase);
Method Toolchain Mean Error StdDev Ratio
EqualsIC netcoreapp2.1 24.036 ns 0.3819 ns 0.3572 ns 1.00
EqualsIC netcoreapp3.0 9.165 ns 0.0589 ns 0.0551 ns 0.38

 

The previous cases are examples of functionality in String‘s implementation, but there are lots of ancillary string-related functionality that have seen improvements as well. For example, various operations on Char have been improved, such as Char.GetUnicodeCategory via PRs dotnet/coreclr#20983 and dotnet/coreclr#20864:

[Params('.', 'a', 'x05D0')]
public char Char { get; set; }

[Benchmark]
public UnicodeCategory GetCategory() => char.GetUnicodeCategory(Char);
Method Toolchain Char Mean Error StdDev Ratio RatioSD
GetCategory netcoreapp2.1 . 1.8001 ns 0.0160 ns 0.0142 ns 1.00 0.00
GetCategory netcoreapp3.0 . 0.4925 ns 0.0141 ns 0.0132 ns 0.27 0.01
GetCategory netcoreapp2.1 a 1.7925 ns 0.0144 ns 0.0127 ns 1.00 0.00
GetCategory netcoreapp3.0 a 0.4957 ns 0.0117 ns 0.0091 ns 0.28 0.01
GetCategory netcoreapp2.1 ? 3.7836 ns 0.0493 ns 0.0461 ns 1.00 0.00
GetCategory netcoreapp3.0 ? 2.7531 ns 0.0757 ns 0.0633 ns 0.73 0.02

 

Those PRs also highlight another case of benefiting from a language improvement. As of C# 7.3, the C# compiler is able to optimize properties of the form:

static ReadOnlySpan<byte> s_byteData => new byte[] { … /* constant bytes */ }

Rather than emitting this exactly as written, which would allocate a new byte array on each call, the compiler takes advantage of the facts that a) the bytes backing the array are all constant and b) it’s being returned as a read-only span, which means the consumer is unable to mutate the data using safe code. As such, with PR dotnet/roslyn#24621, the C# compiler instead emits this by writing the bytes as a binary blob in metadata, and the property then simply creates a span that points directly to that data, making it very fast to access the data, more so even than if this property returned a static byte[].
// Run with: dotnet run -c Release -f netcoreapp2.1 --filter *Program* --runtimes netcoreapp3.0

private static byte[] ArrayProp { get; } = new byte[] { 1, 2, 3 };

[Benchmark(Baseline = true)]
public ReadOnlySpan<byte> GetArrayProp() => ArrayProp;

private static ReadOnlySpan<byte> SpanProp => new byte[] { 1, 2, 3 };

[Benchmark]
public ReadOnlySpan<byte> GetSpanProp() => SpanProp;
Method Mean Error StdDev Median Ratio
GetArrayProp 1.3362 ns 0.0498 ns 0.0416 ns 1.3366 ns 1.000
GetSpanProp 0.0125 ns 0.0132 ns 0.0110 ns 0.0080 ns 0.009

 

Another string-related area that’s gotten some attention is StringBuilder (not necessarily improvements to StringBuilder itself, although it has received some of those, for example a new overload in PR dotnet/coreclr#20773 from @Wraith2 that helps avoid accidentally boxing and creating a string from a ReadOnlyMemory<char> appended to the builder). Rather, in many situations StringBuilders have been used for convenience but added cost, and with just a little work (and in some cases the new String.Create method introduced in .NET Core 2.1), we can eliminate that overhead, in both CPU usage and allocation. Here a few examples…

  • PR dotnet/corefx#33598 removed a StringBuilder used in marshaling from Dns.GetHostEntry:
[Benchmark]
public IPHostEntry GetHostEntry() => Dns.GetHostEntry("34.206.253.53");
Method Toolchain Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
GetHostEntry netcoreapp2.1 532.7 us 16.59 us 46.79 us 526.8 us 1.00 0.00 1.9531 4888 B
GetHostEntry netcoreapp3.0 527.7 us 12.85 us 37.06 us 542.8 us 1.00 0.11 616 B

 

private static CultureInfo CreateCulture()
{
    var c = new CultureInfo("he-IL");
    c.DateTimeFormat.Calendar = new HebrewCalendar();
    return c;
}

private CultureInfo _hebrewIsrael = CreateCulture();

[Benchmark]
public string FormatHebrew() => new DateTime(2018, 11, 20).ToString(_hebrewIsrael);

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
FormatHebrew netcoreapp2.1 626.0 ns 7.917 ns 7.405 ns 1.00 0.00 0.2890 608 B
FormatHebrew netcoreapp3.0 570.6 ns 10.504 ns 9.825 ns 0.91 0.02 0.1554 328 B

 

private readonly PhysicalAddress _short = new PhysicalAddress(new byte[1] { 42 });
private readonly PhysicalAddress _long = new PhysicalAddress(Enumerable.Range(0, 256).Select(i => (byte)i).ToArray());

[Benchmark]
public void PAShort() => _short.ToString();

[Benchmark]
public void PALong() => _long.ToString();

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PAShort netcoreapp2.1 33.68 ns 1.0378 ns 2.9271 ns 1.00 0.00 0.0648 136 B
PAShort netcoreapp3.0 17.12 ns 0.4240 ns 0.7313 ns 0.55 0.04 0.0153 32 B
PALong netcoreapp2.1 2,761.80 ns 50.1515 ns 46.9117 ns 1.00 0.00 1.1940 2512 B
PALong netcoreapp3.0 787.78 ns 27.4673 ns 80.1234 ns 0.31 0.01 0.5007 1048 B

 

  • PR dotnet/corefx#29605 removed StringBuilders from various properties of X509Certificate:

private X509Certificate2 _cert = GetCert();

private static X509Certificate2 GetCert()
{
    using (var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp))
    {
        client.Connect("microsoft.com", 443);
        using (var ssl = new SslStream(new NetworkStream(client)))
        {
            ssl.AuthenticateAsClient("microsoft.com", null, SslProtocols.None, false);
            return new X509Certificate2(ssl.RemoteCertificate);
        }
    }
}

[Benchmark]
public string CertProp() => _cert.Thumbprint;

Method Toolchain Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
CertProp netcoreapp2.1 209.30 ns 4.464 ns 10.435 ns 204.35 ns 1.00 0.00 0.1256 264 B
CertProp netcoreapp3.0 95.82 ns 1.822 ns 1.704 ns 96.43 ns 0.45 0.02 0.0497 104 B

 

and so on. These PRs demonstrate that good gains can be had simply by making small tweaks that make existing code paths cheaper, and that expands well beyond StringBuilder. There are lots of places within .NET Core, for example, where String.Substring is used, and many of those cases can be replaced with use of AsSpan and Slice, for example as was done in PR dotnet/corefx#29402 by @juliushardt, or PRs dotnet/coreclr#17916 and dotnet/corefx#29539, or as was done in PRs dotnet/corefx#29227 and dotnet/corefx#29721 to remove string allocations from FileSystemWatcher, delaying the creation of such strings until only when it was known they were absolutely necessary.

[Benchmark]
public void HtmlDecode() => WebUtility.HtmlDecode("水水水水水水水");
Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
HtmlDecode netcoreapp2.1 638.2 ns 8.474 ns 7.077 ns 1.00 0.1516 320 B
HtmlDecode netcoreapp3.0 153.7 ns 2.776 ns 2.461 ns 0.24 0.0191 40 B

 

Another example of using new APIs to improve existing functionality is with String.Concat. .NET Core 3.0 has several new String.Concat overloads, ones that accept ReadOnlySpan<char> instead of string. These make it easy to avoid allocations/copies of substrings in cases where concatenating pieces of other strings: instead of using String.Concat with String.Substring, it’s used instead with String.AsSpan(...) or Slice. In fact, the PRs dotnet/coreclr#21766 and dotnet/corefx#34451 that implemented, exposed, and added tests for these new overloads also added tens of call sites to the new overloads across .NET Core. Here’s an example of the impact one of those has, improving the performance of accessing Uri.DnsSafeHost:

[Benchmark]
public string DnsSafeHost() => new Uri("http://[fe80::3]%1").DnsSafeHost;
Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
DnsSafeHost netcoreapp2.1 733.7 ns 14.448 ns 17.20 ns 1.00 0.00 0.2012 424 B
DnsSafeHost netcoreapp3.0 450.1 ns 9.013 ns 18.41 ns 0.63 0.02 0.1059 224 B

 

Another example, using Path.ChangeExtension to change from one non-null extension to another:

[Benchmark]
public string ChangeExtension() => Path.ChangeExtension("filename.txt", ".dat");
Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
ChangeExtension netcoreapp2.1 30.57 ns 0.7124 ns 0.6664 ns 1.00 0.0495 104 B
ChangeExtension netcoreapp3.0 24.54 ns 0.3398 ns 0.2838 ns 0.80 0.0229 48 B

 

Finally, a very closely related area is that of encoding. A bunch of improvements were made in .NET Core 3.0 around Encoding, both in general and for specific encodings, such as PR dotnet/coreclr#18263 that allowed an existing corner-case optimization to be applied for Encoding.Unicode.GetString in many more cases, or Webseite öffnen Komplette Webseite öffnen

Newsbewertung

Kommentiere zu Performance Improvements in .NET Core 3.0






Ähnliche Beiträge