By default PerfView causes the Runtime to log an event at the beginning and end of each .NET Garbage Collection as well as every time 100K of objects are allocated. As with all events the precise time is logged, so the amount of time spent in the GC can be known. Most applications spend less than 10% of the total CPU time in the GC itself. If your application is over this %, it usually means that your allocation pattern is such that you are causing many expensive Gen 2 GCs to occur.
If the GC heap is a large percentage of the total memory used then GC heap optimization then use the Memory->Take Heap Snapshot feature to drill into GC heap usage. See Memory Usage Auditing For .NET Applications for more on memory optimization. .
During SOME GCs the application itself has to be suspended so the GC can update object references. Thus the application will pause when a GC happens. If these pause times are larger than 100ms or so, it can impact the user experience. The GC statistics will track the maximum and average pause time to allow you to confirm that this bad GC behavior is not happening.
By default, PerfView causes the Runtime to log an event every time a managed object is finalized, meaning that its finalizer (denoted in C# with the ~ syntax) is executed. For a detailed look at the costs involved in finalization, see this blog post.
PerfView is unable to determine the stack that allocated a finalizable object, but it is able to accurately report each type of object that had a finalizer executed along with the number of instances of that type that were finalized. This data is shown on the GC Stats report in the Finalized Object Counts table.
In an ideal application implementation, all finalizable objects would be cleaned up deterministically via an object's IDisposable.Dispose implementation, which should suppress the finalization of the object. Not deterministically disposing of finalizable objects can lead to degradation of both the reliability and performance of the app. You can examine the counts reported in the Finalized Object Counts table to determine whether any types have significant numbers of instances being left for finalization. Based on that, you can examine code that creates instances of these types and determine why those instances are being left for non-deterministic cleanup rather than being cleaned up deterministically with Dispose.
PerfView tracks detailed information of what methods were Just In Time compiled. This data is mostly useful for optimizing startup (because that is when most methods get JIT compiled). If large numbers of methods are being compiled it can noticeably affect startup time. This report tells you why the JIT was invoked and exactly how much time is being spent on each compilation.
The summary statistics show JIT time broken down by the three different ways the JIT can be triggered. First, foreground jitting occurs when a thread running managed code wants to invoke a particular method that has not yet been compiled, in which case the JIT is invoked synchronously to produce the code. Second, background jitting occurs when the runtime predicts a method will be invoked in the future and then pre-emptively invokes the JIT to compile it on a background thread. These compilations occur in parallel with compilations on the foreground thread and thus reduce total startup time. Third, tiered compilation is a feature that compiles code a second time to generate higher quality code for frequently used methods. Most tiered compilation JIT activity should not occur right at startup, but rather very shortly afterwards. The JIT time used for tiered compilation determines how quickly after startup the application transitions from running at a modest speed to running at an optimal steady-state speed.
If there is a large amount of time spent in the sum of Foreground and Multicore JIT Background compilations then this can cause the application to start slowly. There are several techniques to improve the startup performance.
As an additional option enabled with the JITInlining feature, PerfView can track all of the decisions made by the JIT about whether to inline or not at every call site. For very hot paths, the overhead of invoking methods has the potential to add measurable cost, and it's typical for developers to attempt to streamline their code as much as possible, with the intent that small methods and properties will be inlined in order to avoid these overheads. The information provided by PerfView and the JIT can be valuable in understanding when and where such attempts fail, with the JIT providing the reason it chose not to inline a particular call site, e.g. the callee had exception handling that prevented inlining, the callee was too big, the callee was explicitly annotated to prevent inlining, etc. Such information can then be used by the developer to tweak their code in pursuit of a faster outcome.
Background JIT compilation is a feature that was introduced in Version 4.5 of the .NET runtime. The basic idea is to take advantage of multiple processors available on most machines to speed up startup time by doing Just in Time (JIT) compilation on a background thread. Note that the .NET runtime's preferred solution to the cost o JIT compilation is to precompile the code with NGEN. This reduces the cost of JIT compilation to 0, where background JIT compilation cannot do nearly as well (it tends to reduce it by half), so using NGEN as part of application deployment should be considered first. However if using the NGen Tool is impossible (XCOPY deployment, non-admin deployment, silverlight, IL code generated at runtime), background JIT is the next best option.
There is a fundamental problem trying to push JIT compilation onto background threads, namely that the set of methods you will want to JIT compile depends on program execution and is not known until just before the method is used. Thus to take advantage of multiple processors you need 'oracle' that will tell you the methods you need to compile well before you will actually need to execute them.
The solution that the runtime uses is to rely on PREVIOUS runs of the same program to act as this oracle that will predict the methods that need to be compiled. For this to work the runtime will need to store on disk information about what methods were JIT compiled on the last run. Moreover, you really don't want the COMPLETE list of methods compiled because that list will include methods use well after startup and cause you to JIT compile things that are not that important. Thus to make background JIT compilation work well, we need help from the application. This is exposed in two new methods of the System.Runtime.ProfilerOptimization class introduced in .NET Version 4.5
When your code encounters a 'StartProfile' operations the runtime will do the following:
Thus by placing two simple calls in your program (typically at the beginning of Main()), you can opt into background JIT.
Background JIT has the following characteristics
It is important to realize that background JIT compilation does NOT reduce JIT time. If anything it INCREASES it, because it will JIT methods HOPING that they will be used shortly by the application. If they are not used, then that time is 'wasted'. However, the time background JIT uses is on a parallel thread (and only is attempted if there are 2 ore more processors), and thus JIT time on the background thread is effectively 'free'. Thus the important metric is how much JIT time was REMOVED from the foreground threads. As mentioned, you typically get about half, but the exact number is applications specific, and depends on how well the previous trace predicts the methods that need to be JIT compiled on this run.
If you have activated background JIT by placing the SetProfileRoot and StartProfile calls into your program you can view its effectiveness by turning on special background JIT compilation events. You do this by checking the 'Background JIT' checkbox on the advanced options of the 'Collection' dialog box. When you do this, the JITStats report is enhanced in several ways for processes that have called SetProfileRoot and StartProfile.
What can go wrong with background JIT compilation.
If your program has used SetProfileRoot and StartProfile, but JIT compilation (as shown in the JITStats view) does not show any or very little background JIT compilation, there are several issues that may be responsible. Fundamentally a important design goal was to ensure that background JIT compilation did not change the behavior of the program under any circumstances. Unfortunately, it means that the algorithm tends to bail out quickly. In particular
Tiered compilation is a feature that was introduced in .Net Core 2.1. It improves both startup performance and steady-state performance by hot-swapping between different compilations of the same method at runtime.
Different compilations of the same method are referred to as tiers 'Tier0' (also known as Quick Jit) and 'Tier1'. Tier0 is the initial code for each method, regardless whether it was obtained from the JIT or from a ReadyToRun image. Tier1 refers to the optimized jitted code that is compiled on a background thread
Some method types bypass tiering and are always jitted with full optimization (if optimization is enabled):
AggressiveOptimization
attributethis
When a tiering-eligible method is called, if there is Tier1 code for the method, that version of the method executes, otherwise the Tier0 code is executed.
In .NET Core 3 and later tiered compilation is enabled by default. It can be disabled by any of these mechanisms:
The "CPU (with Optimization Tiers) Stacks" view in the Advanced Group will annotate each method with its tiering information. A method that is eligible for tiering can create multiple entries in this view, one for each tiering level.
On-Stack Replacement is a feature enabled by default in .Net 7. It allows most methods with loops to participate in tiered compilation.
In earlier releases of .Net, methods with loops would bypass tiered compilation by default, because a single call to one of these these methods might invoke the Tier0 method, and run for a long time, adversely impacting performance. OSR allows individual method executions to jump from Tier0 to Tier1 in the middle of a method by creating a specially crafted OSR version of the method.
OSR versions of methods are logically Tier1 but can have slightly different performance characteristics than the full Tier1 version. OSR methods are annotated specially in the "CPU (with Optimization Tiers) Stacks" view noted above.
Each process shows a high level summary table indicating JIT time broken down by trigger. One of these triggers is 'Tiered Compilation Background.' This category contains all the Tier1 methods. The tier0 methods aren't identified explicitly as they can come from several sources:
In the individual method listings the column 'Trigger' contains the value 'TC' for each Tier1 background recompilation due to Tiered Compilation
Dynamic PGO is a feature that was introduced in .Net 6. It further improves steady-state performance by instrumenting Tier0 versions of methods to collect profile data, which is then used to better optimize the Tier1 version.
Dynamic PGO can be enabled as follows:
For .Net 6, the performance benefit of TieredPGO can be further enhanced by disabling ReadyToRun and enabling QuickJitForLoops. This can adversely impact startup.
For .Net 7, the performance benefit of TieredPGO can be further enhanced by disabling ReadyToRun. This can adversely impact startup.
For .Net 8, no other changes are needed to get maximum performance.
Dynamic PGO introduces new "instrumented" versions of methods. Both Tier0 and Tier1 can may be instrumented. In .NET 8 and later, instrumented versions are annotated specially in the "CPU (with Optimization Tiers) Stacks" view noted above.
PerfView tracks detailed information of what runtime loader operations were performed. This view provides a process and thread specific view into the detailed behavior of the CLR runtime as it executes code. Unlike the JIT view, this view shows more granular data such as ReadyToRun operations and assembly load operations. This typically makes the data difficult to understand, and less useful to most consumers, but for detailed investigation of runtime behavior may be more useful. When a loader operation in this view occurs during another operation, the nesting of the operation is represented in the view, and the outer operation is broken up to show the time spent around the inner operation. To enable data about all loader operations set the ".NET Loader" checkbox when collecting data. Information about all loader operations is restricted to analysis of .NET Core runtimes. R2R information is only available in .NET Core 3 and above. TypeLoad information is only available in .NET 5 and above. Otherwise, the data captured will be restricted to JIT and assembly load operations. The /RuntimeLoading switch may also be used.
EventPipe is a technology in .NET Core 3.1 on that allows the collection of events and CPU sampling on all platforms. The CPU sampling performed by EventPipe is only aware of managed code, which means that transitions into native code, e.g., P/Invokes, won't appear in the trace. This means that stacks that containing native code will end with the last managed frame.
For example, if an application has a sequence of methods Main->A->B->C->MyNativeFunction, where MyNativeFunction P/Invokes into native code, then samples collected while that native code is on the stack will only contain Main->A->B->C->MyNativeFunction. Any native functions after the P/Invoke method won't appear in the trace.
During a typical CPU usage investigation, you may want to ask about on-CPU vs off-CPU time to determine when your code is blocked waiting or actively doing work. Due to the only knowing about managed frames, CPU samples collected via EventPipe can't give exact on/off-CPU information. Instead, the TraceEvent library uses a heuristic to add pseudo-frames to the trace that indicate whether there are additional native frames on the stack or not. When a nettrace file is opened in the "Thread Time" view or is exported to another format, e.g., SpeedScope, the heuristic will insert either UNMANAGED_CODE_TIME or CPU_TIME onto the stacks. UNMANAGED_CODE_TIME represents stacks where there are one or more native frames after the last managed frame and that these frames may be blocked waiting or actively on the CPU. CPU_TIME represents stacks where the last managed frame is the function currently on the CPU and doing work.