PerfView User's Guide

PerfView is a tool for quickly and easily collecting and viewing both time and memory performance data. PerfView uses the Event Tracing for Windows (ETW) feature of the operating system which can collect information machine wide a variety of useful events as described in the advanced collection section. ETW is the same powerful technology the windows performance group uses almost exclusively to track and understand the performance of windows, and the basis for their Xperf tool. PerfView can be thought of a simplified and user friendly version of that tool. In addition PerfView has ability to collect .NET GC Heap information for doing memory investigation (Even for very large GC heaps). PerfView's ability to decode .NET symbolic information as well as the GC heap make PerfView ideal for managed code investigations .

Deploying and Using PerfView

PerfView was designed to be easy to deploy and use. To deploy PerfView simply copy the PerfView.exe to the computer you wish to use it on. No additional files or installation step is needed. PerfView features are 'self-discoverable'. The initial display is a 'quick start' guide that leads you through collecting and viewing your first set of profile data. There is also a built in tutorial. Hovering the mouse over most GUI controls will give you short explanations, and hyperlinks send you to the most appropriate part of this user's guide. Finally PerfView is 'right click enabled' which means that you want to manipulate data in some way, right clicking allows you to discover what PerfView's can do for you.

PerfView is a V4.6.2 .NET application. Thus you need to have installed a V4.6.2 .NET Runtime on the machine which you actually run PerfView. On Windows 10 and Windows Server 2016 has .NET V4.6.2. On other supported OS you can install .NET 4.6.2 from standalone installer. PerfView is not supported on Win2K3 or WinXP. While PerfView itself needs a V4.6.2 runtime, it can collect data on processes that use V2.0 and v4.0 runtimes. On machines that don't have V4.6.2 or later of the .NET runtime installed, it is also possible to collect ETL data with another tool (e.g. XPERF or PerfMonitor) and then copy data file to a machine with V4.6.2 and view it with PerfView.

What can PerfView do for you?

PerfView was designed to collect and analyze both time and memory scenarios.

CPU Investigation: One of the more useful events (and one that is turned on by default) is the 'profile' sampling event. This event samples the instruction pointer of each of the machine's CPUs every millisecond. Each sample captures the complete call stack of the thread current executing; giving very detailed and useful information about what that thread was doing at both high and low levels of abstraction. PerfView aggregates these stack traces and presents them in a stack viewer that has powerful grouping operations that make understanding this data significantly simpler than most profilers. If your application's performance problem is associated with excessive CPU usage, then PerfView will tell you that and give you the tools you need to understand exactly what portion of your application is mis-behaved. See Starting a CPU Analysis for more
Managed Memory Investigations: PerfView also has the ability to take a snapshot of the .NET GC heap. Because these heaps can be very large, PerfView allows control over how large of a sample is taken, and goes to some trouble to take a representative sample if the heap is too big to capture in its entirety. It then converts the graph of objects in the heap into a tree, and displays this in the same stack viewer that was used for CPU investigations. See Investigating Memory and Starting a GC Heap Analysis for more
Response Time Investigations: By collecting with the 'ThreadTime' option enough information is collected so that PerfView has the ability to measure what every thread (blocked or not), gather all the thread time associated with every request and display it as a tree. This is what the 'Thread Time (with Start-Stop Activities) view is. See Making Server Investigation Easy for more.
Wall Clock / Blocked Time Investigations: If your program is too slow, but it is not consuming excessive CPU, then it must be blocked waiting on something else (disk network, ...). PerfView can instruct the OS to log events whenever threads sleep or wake up, and has a display for visualizing where your program is waiting. See Blocked / Wall Clock Time Investigation for more.
Memory Investigations: You can also turn on events every time the OS heap memory allocator allocates or frees an object. Using these events you can see what call stacks are responsible for the most net unmanaged memory allocations. See Investigating Memory and Unmanaged Heap Analysis for more.
CPU Investigations: PerfView has the ability to read the output of the Linux 'Perf Events' collector that is built into the Linux kernel. See Viewing Linux Data for more.
Viewing your own hierarchical data in PerfView's stack viewer: PerfView's stack viewer is powerful, but it is also very flexible. PerfView defines a very simple XML or JSON format that it can read into this viewer. This allows you to easily generate data that you can then view in PerfView's powerful stack viewer. See Viewing External Data for more.

Sending feedback / Asking Questions about PerfView

Hopefully the documentation does a reasonably good job of answering your most common questions about PerfView and performance investigation in general. If you have a question, you should certainly start by searching the user's guide for information

Inevitably however, there will be questions that the docs don't answer, or features you would like to have that don't yet exist, or bugs you want to report. PerfView is an GitHub open source project and you should log questions, bugs or other feedback at

PerfView Issues

If you are just asking a question there is a Label called 'Question' that you can use to indicate that. If it is a bug, it REALLY helps if you supply enough information to reproduce the bug. Typically this includes the data file you are operating on. You can drag small files into the issue itself, however more likely you will need to put the data file in the cloud somewhere and refer to it in the issue. Finally if you are making a suggestion, the more specific you can be the better. Large features are much less likely to ever be implemented unless you yourself help with the implementation. Please keep that in mind.

Getting the latest version of PerfView

You can get the latest version of PerfView by going to the PerfView GitHub Download Page

Tutorial of a Time-Based Investigation

Perhaps the best way to get started is to simply try out the tutorial example. On windows 7 it is recommended that you doc your help as described in help tips. PerfView comes with two tutorial examples 'built in'. Also we strongly suggest that any application you write have performance plan as described in part1 and part2 of Measure Early and Often for Performance .

Tutorial.exe - A simple program that calls 'DateTime.Now' repeatedly until it detects that at 5 seconds have gone by. To make this example more interesting, it does this using two mutually recursive methods (RecSpin, and RecSpinHelper). Each of these helpers spins for a second and then calls the other helper to spin for the rest of the time. See Tutorial.cs for the complete source.

To run the 'Tutorial' example:

Click on the 'Run a command' hyperlink on the main page. This will bring up dialog indicating command to run and the name of the data file to create.
Enter 'Tutorial.exe' in the 'command' text dialog and hit <enter>.
Unless you started PerfView from an elevated environment, the operating system will bring up a user access control to run as administrator (collecting profile data is a privileged activity). Click OK to accept.
At this point it will begin running the command. The Status bar will blink to indicate that it is working on your command. You can monitor its progress by hitting the 'Log' button in the lower right corner. After it has completed it brings up a process selection dialog box. PerfView is asking which process you are focused on. In this case we are interested in the 'Tutorial' process, so we should select that. If you are interested in all process there is a button for that too.

You can also run the tutorial example by typing 'PerfView run tutorial' at the command line. See collecting data from the command line for more.

After selecting 'Tutorial.exe' as the process of interest, PerfView brings up the stack viewer looking something like this:

This view shows you where CPU time was spent. PerfView took a sample of where each processor is (including the full stack), every millisecond (see understanding perf data) and the stack viewer shows these samples. Because we told PerfView we were only interested in the Tutorial.exe process this view has been restricted (by 'IncPats') to only show you samples that were spent in that process.

It is always best to begin your investigation by looking at the summary information at the top of the view. This allows you to confirm that indeed the bulk of your performance problem is related to CPU usage before you go chasing down exactly where CPU is spent. This is what the summary statistics are for. We see that the process spent 84% of its wall clock time consuming CPU, which merits further investigation. Next we simply look at the 'When' column for the 'Main' method in the program. This column shows how CPU was used for that method (or any method it calls) over the collection time interval. Time is broken into 32 'TimeBuckets' (in this case we see from the summary statistics that each bucket was 197 msec long), and a number or letter represents what % of 1 CPU is used. 9s and As mean you are close to 100% and we can see that over the lifetime of the main method we are close to 100% utilization of 1 CPU most of the time. Areas outside the main program are probably not interesting to use (they deal with runtime startup and the times before and after process launch), so we probably want to 'zoom in' to that area.

Zooming in to a time range of interest

It is pretty common that you are only interested in part of the trace. For example you may only care about startup time, or the time from when a mouse was clicked and when the menu was displayed. Thus zooming in is typically one of first operations you will want to do. zooming in is really just selecting a region of time for investigation. The region of time is displayed in the 'start' and 'end' textboxes. These can be set in three ways

Manually entering values into the text boxes.
Selecting two cells (typically the 'First' and 'Last') cells of a particular method of interest, right clicking and selecting 'SetTimeRange'
Selecting a 'When' cell. If you click the cell again, the cell will become editable, at which point you can select a region a text right click, and select 'SetTimeRange' (or hit Alt-R) to select the time range associated with your selected characters.

Try out each of these techniques. For example to 'zoom into' just the main method, simply drag the mouse over the 'First' and 'Last' times to select both, right click, and Select Time Range. You can hit the 'Back' button to undo any changes you made so you can re-select. Also notice that each text box remembers the last several values of that box, so you can also 'go back' particular past values by selecting drop down (small down array to the right of the box), and selecting the desired value.

For GUI applications, it is not uncommon to take a trace of the whole run but then 'zoom into' points where the users triggered activity. You can do this by switching to the 'CallTree' tab. This will show you CPU starting from the process itself. The first line of is the View is 'Process32 tutorial.exe' and is a summary of the CPU time for the entire process. The 'when' column shows you CPU for the process over time (32 time buckets). In a GUI application there will be lulls where no CPU was used, followed by bursts of higher CPU use corresponding to user actions. These show up in the numbers in the 'when' column. By clicking on a cell in the 'when' column, selecting a range, right clicking and selecting SetTimeRange (or Alt-R), you can zoom into one of these 'hot spots' (you may have to zoom in more than once). Now you have focused in on what you are interested in (you can confirm by looking at the methods that are called during that time). This is a very useful technique.

For managed applications, you will always want to zoom into the main method before starting your investigation. The reason is that when profile data is collected, after Main has exited, the runtime spends some time dumping symbolic information to the ETW log. This is almost never interesting, and you want to ignore it in your investigation. Zooming into the Main method will do this.

Resolving unmanaged symbols

After zooming into the region of interest, if you are doing an unmanaged investigation, you may need to resolve symbols. Unlike managed code, unmanaged code stores its symbolic information in external PDB files which need to be downloaded and matched up. Because this can take a while it is not done by default. Instead you see question marks in the trace, (like ntdll!?) indicating that PerfView knows the sample came from ntdll, but it can't resolve the name further. For many DLLs you will never need to resolve these symbols because you simply don't care (you don't own or call that code). However if you do care, you can quickly get the symbols. Simply select a cell with at DLL!? in it, right click, and select 'Lookup Symbols'. PerfView will then look up the symbols for that DLL and redraw the screen. Try looking up the symbols for ntdll by selecting the cell

OTHER <<ntdll!?>>;

Right clicking, and select 'Lookup Symbols'. After looking up the symbols it will become

OTHER <<ntdll!_RtlUserThreadStart>>

If you are doing an unmanaged investigation there are probably a handful of DLLs you will need symbols for. A common workflow is to look at the byname view and while holding down the CTRL key select all the cells that contain dlls with large CPU time but unresolved symbols. Then right click -> Lookup Symbols, and PerfView will look them all up in bulk. See symbol resolution for more details or if lookup symbols fails.

A Bottom Up Investigation

PerfView starts you with the 'ByName view' for doing a bottom-up analysis (see also starting an analysis). In this view you see every method that was involved in a sample (either a sample occurred in the method or the method called a routine that had a sample). Samples can either be exclusive (occurred in within that method), or inclusive (occurred in that method or any method that method called). By default the by name view sorts methods based on their exclusive time (see also Column Sorting). This shows you the 'hottest' methods in your program.

Typically the problem with a 'bottom-up' approach is that the 'hot' methods in your program are

Not very hot (use < 5% of CPU)
Tend to be 'helper' routines (either in your program or in libraries or the runtime), that are used 'everywhere' and are already well tuned.

In both cases, you don't want to see these helper routines, but rather the lowest 'semantically interesting' routine. This is where PerfView's powerful grouping features comes into play. By default PerfView groups samples by

Using the GroupPats 'Just my code' pattern to form two groups. The first group is any method in any module that is in the same directory (recursively) as the 'exe' itself. This is the 'my code' group and these samples are left alone. Any sample that is NOT in that first group is in the 'OTHER' group. These samples are groups according to the method that was called to enter the group.
Using the Fold % feature. This is set to 1, which means that any method that has fewer than 1% of the samples (inclusively) in the 'byname' view (that over all the sampled indicated int the summary at the top of the view) is not 'interesting' and should not be shown. Instead its samples are folded (inlined), into its caller.

For example, the top line in the ByName view is

OTHER <<mscorlib!System.DateTime.get_Now()>>

This is an example of an 'entry group'. 'OTHER' is the group's name and mscorlib!System.DateTime.get_Now() is the method that was called that entered the group. From that point on any methods that get_Now() calls that are within that group are not shown, but rather their time is simply accumulated into this node. Effectively this grouping says 'I don't want to see the internal workings of functions that are not my code, but I do want see public methods I used to call that code. To give you an idea of how useful this feature is, simply turn it off (by clearing the value in the 'GroupPats' box), and view the data. You will see many more methods with names of internal functions used by 'get_Now' which just make your analysis more difficult. (You can use the 'back' button to quickly restore the previous group pattern).

The other feature that helps 'clean up' the bottom-up view is the Fold % feature. This feature will cause all 'small' call tree nodes (less than the given %) to be automatically folded into their parent. Again you can see how much this feature helps by clearing the textbox (which means no folding). With that feature off, you will see many more entries that have 'small' amounts of time. These small entries again tend to just add 'clutter' and make investigation harder.

More Folding

Because of the grouping and folding that PerfView did for you, you can quickly see that 'DateTime.get_Now()' is the 'hot' method (74.6% of all samples). However also note that PerfView did not do a 'perfect' job. We notice that the view has groups <ntdll!?> and <ntoskrln!?> which are two important operating system DLLs take up 9.5% and 2% of the CPU and knowing just some function in the DLL was called is not terribly useful. We have two choices

Resolve the symbols for these DLLs so that we have meaningful names. See symbol resolution for more.
Fold these entries away.

A quick way of accomplishing (2) is to add the pattern '!?' . This pattern says to fold away any nodes that don't have a method name. See foldPats textbox for more. This leaves us with very 'clean' function view that has only semantically relevant nodes in it.

Review: what all this time selection, grouping and folding is for?

The first phase of a perf investigation is forming a 'perf model' The goal is it assign times to SEMANTICALLY RELEVANT nodes (things the programmer understands and can do something about). We do that by either forming a semantically interesting group and assigning nodes to it, or by folding the node into an existing semantically relevant group or (most commonly) leveraging entry points into large groups (modules and classes), as handy 'pre made' semantically relevant nodes. The goal is to group costs into a relatively small number (< 10) of SEMANTICALLY RELEVANT entries. This allows you to reason about whether that cost is appropriate or not, (which is the second phase of the investigation).

Broken Stacks

One of the nodes that is left is a node called 'BROKEN'. This is a special node that represents samples whose stack traces were determined to be incomplete and therefore cannot be attributed properly. As long as this number is small (< a few %) then it can simply be ignored. See broken stacks for more.

Time and Percentage.

PerfView displays both the inclusive and exclusive time as both a metric (msec) as well as a % because both are useful. The percentage gives you a good idea of the relative cost of the node, however the absolute value is useful because it very clearly represents 'clock time' (e.g. 300 samples represent 300 msec of CPU time). The absolute value is also useful because when the value gets significantly less than 10 it becomes unreliable (when you have only a handful of samples they might have happened 'by pure chance' and thus should not be relied upon.

CallTree View (top-down investigations))

The bottom up view did an excellent job of determining that the get_Now() method as well as the 'SpinForASecond' consume the largest amount of time and thus are worth looking at closely. This corresponds beautify to our expectations given the source code in Tutorial.cs. However it can also be useful to understand where CPU time was consumed from the top down. This is what the CallTree view is for. Simply by clicking the 'CallTree' tab of the stack viewer will bring you to that view. Initially the display only shows the root node, but you can open the node by clicking on the check box (or hitting the space bar). This will expand the node. As long as a node only has one child, the child node is also auto-expanded, to save some clicking. You can also right click and select 'expand-all' to expand all nodes under the selected node. Doing this on the root node yields the following display

Notice how clean the call tree view is, without a lot of 'noise' entries. In fact this view does a really good job of describing what is going on. Notice it clearly shows the fact that Main calls 'RecSpin, which runs for 5 seconds (from 894ms to 5899msec) consuming 4698 msec of CPU while doing so (The CPU is not 5000msec because of the overheads of actually collecting the profile (and other OS overhead which is not attributed to this process as well as broken stacks), which typically run in the 5-10% range. In this case it seems to be about 6%). The 'When' column also clearly shows how one instance of RecSpin runs SpinForASecond (for exactly a second) and then calls a RecSpinHelper which does consumes close to 100% of the CPU for the rest of the time. . The call Tree is a wonderful top-down synopsis.

Getting a 'coarser' view

All of the filtering and grouping parameters at the top of the view affect any of the view (byname, caller-callee or CallTree), equally. We can use this fact and the 'Fold %' functionality to get an even coarser view of the 'top' of the call tree. With all nodes expanded, simply right click on the window and select 'Increase Fold %' (or easier hit the F7 key). This increases the number it the Fold % textbox by 1.6X. By hitting the F7 key repeatedly you keep trimming down the 'bottoms' of the stacks until you only see only the methods that use a large amount of CPU time. The following image shows the CallTreeView after hitting F7 seven times.

You can restore the previous view by either using the 'Back' button, the Shift-F7 key (which decreases the Fold%) or by simply selecting 1 in the Fold% box (e.g. from the drop down menu).

The Caller-Callee view

Getting a course view of the tree is useful but sometimes you just want to restrict your attention to what is happening at a single node. For example, if the inclusive time for BROKEN stacks is large, you might want to view the nodes under 'BROKEN' stacks to get an idea what samples are 'missing' from their proper position in the call tree. you can do this easily by viewing the BROKEN node in the Caller-callee view. To do this right click on the BROKEN node, and select Goto -> Caller-callee (or type Alt-C). Because so few samples are in our trace are BROKEN this node is not very interesting. By setting Fold % to 0 (blank) you get the following view

The view is broken in to three grids. The middle piece shows the 'current node', in this case 'BROKEN'. The top grid shows all nodes that call into this focus node. In the case of BROKEN nodes are only on one thread. The bottom graph shows all nodes that are called by 'BROKEN' sorted by inclusive time. We can see that most of the broken nodes came from stacks that originated in the 'ntoskrnl' dll (this is the Windows OS Kernel) To dig in more we would first need to resolve symbols for this DLL. See symbol resolution for more.

Drilling into Groups (Ungrouping)

While groups are a very powerful feature for understanding the performance of your program at a 'coarse' level, inevitably, you wish to 'Drill into' those groups and understand the details of PARTICULAR nodes in detail. For example, If we were a developer responsible for the DateTime.get_Now(), we would not be interested in the fact that it was called from 'SpinForASecond' routine but what was going on inside. Moreover we DON'T want to see samples from other parts of the program 'cluttering' the analysis of get_Now(). This is what the 'Drill Into' command is for. If we go back to the 'ByName' view and select the 3792 samples 'Inc' column of the 'get_Now' right click, and select 'Drill Into', it brings a new window where ONLY THOSE 3792 samples have been extracted.

Initially Drilling in does not change any filter/grouping parameters. However, now that we have isolated the samples of interest, we are free to change the grouping and folding to understand the data at a new level of abstraction. Typically this means ungrouping something. In this case we would like to see the detail of how mscorlib!get_Now() works, so we want to see details inside mscorlib. To do this we select the 'mscorlib!DateTime.get_Now() node, right click, and select 'Ungroup Module'. This indicates that we wish to ungroup any methods that were in the 'mscorlib' module. This allows you to see the 'inner structure' of that routine (without ungrouping completely) The result is the following display

At this point we can see that most of the 'get_Now' time is spend in a function called 'GetUtcOffsetFromUniversalTime' and 'GetDatePart' We have the full power of the stack viewer at our disposal, folding, grouping, using CallTree or caller-callee views to further refine our analysis. Because the 'Drill Into' window is separate from its parent, you can treat is as 'disposable' and simply discard it when you are finished looking at this aspect of your program's performance.

In the example above we drilled into the inclusive samples of method. However you can also do the same thing to drill into exclusive samples. This is useful when user callbacks or virtual functions are involved. Take for example a 'sort' routine that has internal helper functions. In that case it can be useful to segregate those samples that were part of the nodes 'internal helpers' (which would be folded up as exclusive samples of 'sort') from those that were caused by the user 'compare' function (which would typically not be grouped as exclusive samples because it crossed a module boundary). By drilling into the exclusive samples of 'sort' and then ungrouping, you get to see just those samples in 'sort' that were NOT part of the user callback. Typically this is EXACTLY what the programmer responsible for the 'sort' routine would want to see.

Viewing Source (Line level analysis)

Once the analysis has determined methods are potentially inefficient, the next step is to understand the code enough to make an improvement. PerfView helps with this by implementing the 'Goto Source' functionality. Simply select a cell with a method name in it, right click and choose Goto Source (or use Alt-D (D for definition)). PerfView with then attempt to look up the source code and if successful will launch a text editor window. For example, if you select the 'SpinForASecond' cell in the ByName view and select Goto Source the following window is displayed.

As you can see, the particular method is displayed and each line has been prefixed with the cost (in this case CPU MSec) spent on that line. in this view it shows 4.9 seconds of CPU time were spent on the first line of the method.

Caveats with Source code

Unfortunately, prior to V4.5 of the .NET Runtime, the runtime did not emit enough information into the ETL file to resolve a sample down to a line number (only to a method). As a result while PerfView can bring up the source code, it can't accurately place samples on particular lines unless the code was running on V4.5 or later. When PerfView does not have the information it needs it simply attributes all the cost to the first line of the method. This is in fact what you see in the example above. If you run your example on a V4.5 runtime, you would get a more interesting distribution of cost. This problem does not exist for native code (you will get line level resolution). Even on old runtime versions, however, you at least have an easy way to navigate to the relevant source.

PerfView finds the source code by looking up information in the PDB file associated with the code. Thus the first step is that PerfView must be able to find the PDB file. By default most tools will place the complete path of the PDB file inside the EXE or DLL it builds, which means that if you have not moved the PDB file (and are on the machine you built on), then PerfView will find the PDB. It then looks in the PDB file which contain the full path name of each of the source files and again, if you are on the machine that built the binary then PerfView will find the source. So if you run on the same machine you build on, it 'just works'.

However it is common to not run on the machine you built on, in which case PerfView needs help. PerfView follows the standard conventions for other tools for locating source code. In particular if the _NT_SYMBOL_PATH variable is set to a semicolon separated list of paths, it will look in those places for the PDB file. In addition if _NT_SOURCE_PATH is set to a semicolon separated list of paths, it will search for the source file in subdirectories of each of the paths. Thus setting these environment variables will allow PerfView's source code feature to work on 'foreign' machines. You can also set the _NT_SYMBOL_PATH and _NT_SOURCE_PATH inside the GUI by using the menu items on the File menu on the stack viewer menu bar.

Tutorial for GC Heap Memory Analysis

See Also Tutorial of a Time-Based Investigation. While there currently is no tutorial on doing a GC heap analysis, if you have not walked the time based investigation tutorial you should do so. Many of the same concepts are used in a memory investigation. You should also take a look at

Collecting GC Heap Data
Understanding GC heap data
Starting a GC heap analysis
Collecting Stacks at GC allocations

TUTORIAL NOT COMPLETE

Performance Investigation Best Practices

Investigating Time

Collecting Event (Time Based) Profile Data

As mentioned in the introduction, ETW is light weight logging mechanism built into the Windows Operating system that can collect a broad variety of information about what is going on in the machine. There are two ways PerfView supports for collecting ETW profile data.

The Collect->Run (Alt-R) menu item, which prompts for a data file name to create and a command to run. The command turns on profiling, runs the command, and then turns profiling off. The resulting file is then displayed in the stack viewer. This is the preferred mechanism when it is easy to launch the application of interest. If the command produces output, it will be captured in the log (click the 'Log' button in the lower right corner of the main view).
The Collect->Collect (Alt-C) menu item which only prompts for a data file name to create. After clicking the 'Start Collection' button you are then free to interact with machine in any way necessary to capture the activity of interest. Since profiling is machine wide you are guaranteed to capture it. Once you have reproduced the problem, you can dismiss the dialog box to stop profiling and proceed to analyze the data.

You can also automate the collection of profile data by using command line options. See collecting data from the command line for more.

If you intend to do a wall clock time investigation

By default PerfView chooses a set of events that does not generate too much data but is useful for a variety of investigations. However wall clock investigations require events that are too voluminous to collect by default. Thus if you wish to do a wall clock investigation, you need to set the 'Thread Time' checkbox in the collection dialog.

If you intend to copy the ETL file to another machine for analysis

By default to save time PerfView does NOT prepare the ETL file so that it can be analyzed on a different machine (see merging). Moreover, there is symbolic information (PDBS for NGEN images), that also needs to be included if the data is to work well on any machine). If you are intending to do this you need to merge and include the NGEN pdbs by using the 'ZIP' command. You can do this either by

Checking the 'Zip' checkbox on the data collection dialog box when the data is being created.
Specifying the /Zip qualifier on the command line of PerfView when the data is being created.
Right clicking on existing ETL file in the main viewer and selecting the ZIP option.

Once the data has been zipped not only does the file contain all the information needed to resolve symbolic information, but it also has been compressed for faster file copies. If you intend to use the data on another machine, please specify the ZIP option.

Viewing Stack Data

Selecting a Process of Interest

The result of collecting data is an ETL file (and possibly a .kernel.ETL file as discussed in merging). When you double click on the file in the main viewer it opens up 'children views' of the data that was collected. One of these items will be the 'CPU Stacks' view. Double clicking on that will bring up a stack viewer to view the samples collected. The data in the ETL file contains CPU information for ALL processes in the system, however most analyses concentrate on a single process. Because of this before the stack viewer is displayed a dialog box to select a process of interest is displayed first.

By default, this dialog box contains a list of all processes that were active at the time the trace was collected sorted by the amount of CPU time each process consumed. If you are doing a CPU investigation, there is a good chance the process of interest is near the top of this list. Simply double clicking on the desired process will bring up the stack viewer filtered to the process you chose.

The process view can be sorted by any of the columns by clicking on the column header. Thus if you wish to find the process that was started most recently you can sort by start time to find it quickly. If the view is sorted by name, if you type the first character of the process name it will navigate to the first process with that name.

Process Filter Textbox The box just above the list of processes. If you type text in this box, then only processes that match this string (PID, process name or command line, case insensitive) will be displayed. The * character is a wild card. This is a quick way of finding a particular process.

If you wish to see samples for more than one process for your analysis click the 'All Procs' button.

Note that the ONLY effect of the process selection dialog box is to add an 'Inc Pats' filter that matches the process you chose. Thus the dialog box is really just a 'friendly interface' to the more powerful filtering options of the stack viewer. In particular, the stack viewer still has access to all the samples (even those outside the process you selected), it is just that it filters it out because of the include pattern that was set by the dialog box. This means that you can remove or modify this filter at a later point in the analysis.

Understanding Perf Data

The data shown by default in the PerfView stack viewer are stack traces taken every millisecond on each processor on the system. Every millisecond, whatever process is running is stopped and the operating system 'walks the stack' associated with the running code. What is preserved when taking a stack trace is the return address of every method on the stack. Stackwalking may not be perfect. It is possible that the OS can't find the next frame (leading to broken stacks) or that an optimizing compiler has removed a method call (see missing frames), which can make analysis more difficult. However for the most part the scheme works well, and has low overhead (typically 10% slowdown), so monitoring can be done on 'production' systems.

On lightly loaded system, many CPUs are typically in the 'Idle' process that the OS run when there is nothing else to do. These samples are discarded by PerfView because they are almost never interesting. All other samples are kept however, regardless of what process they were taken from. Most analyses focus on a single process, and further filter all samples that did not occur in the process of interest, however PerfView also allows you to also look at samples from all processes as one large tree. This is useful in scenarios where more than one process is involved end-to-end, or when you need to run an application several times to collect enough samples.

How many samples do you need?

Because the samples are taken every millisecond per processor, each sample represents 1 millisecond of CPU time. However exactly where the sample is taken is effectively 'random', and so it is really 'unfair' to 'charge' the full millisecond to the routine that happened to be running at the time the sample was taken. While this is true, it is also true that as more samples are taken this 'unfairness' decreases as the square root of the number of samples. If a method has just 1 or 2 samples it could be just random chance that it happened in that particular method, but methods with 10 samples are likely to have truly used between 7 and 13 samples (30% error). Routines with 100 samples are likely to be within 90 and 110 (10% error). For 'typical' analysis this means you want at least 1000 and preferably more like 5000 samples (There are diminishing returns after 10K). By collecting a few thousand samples you ensure that even moderately 'warm' methods will have at least 10 samples, and 'hot' methods will have at least 100s, which keep the error acceptably small. Because PerfView does not allow you to vary the sampling frequency, this means that you need to run the scenario for at least several seconds (for CPU bound tasks), and 10-20 seconds for less CPU bound activities.

If the program you wish to measure cannot easily be changed to loop for the required amount of time, you can create a batch file that repeatedly launches the program and use that to collect data. In this case you will want to view the CPU samples for all processes, and then use a GroupPat that erases the process ID (e.g. process {%}=>$1) and thus groups all processes of the same name together.

Even with 1000s of samples, there is still 'noise' that is at least in the 3% range (sqrt(1000) ~= 30 = 3%). This error gets larger as the methods / groups being investigated have fewer samples. This makes it problematic to use sample based profiling to compare two traces to track down small regressions (say 3%). Noise is likely to be at least as large as the 'signal' (diff) you are trying to track down. Increasing the number of samples will help, however you should always keep in mind the sampling error when comparing small differences between two traces.

Exclusive and Inclusive Metrics

Because a stack trace is collected for each sample, every node has both an exclusive metric (the number of samples that were collected in that particular method) and an inclusive metric (the number of samples that collected in that method or any method that method called). Typically you are interested in inclusive time, however it is important to realize that folding (see FoldPats and Fold %) and grouping artificially increase exclusive time (it is the time in that method (group) and anything folded into that group). When you wish to see the internals of what was folded into a node, you Drill Into the groups to open a view where the grouping or folding can be undone.

Starting a CPU Analysis

If you have not done so, consider walking through the tutorial and best practices from Measure Early and Often for Performance .

The default stack viewer in PerfView analyzes CPU usage of your process. There are three things that you should always do immediately when starting a CPU analysis of a particular process.

Determine that you have at least a few 1000 samples (preferably over 5000). See how many samples do I need for more.
Determine that the process is actually CPU bound over time of interest.
Ensure that you have the symbolic information you need. See symbol resolution for more.

If either of the above conditions fail, the rest of your analysis will very likely be inaccurate. If you don't have enough samples you need to go back and recollect so that you get more, modifying the program to run longer, or running the program many times to accumulate more samples. If you program is running for long enough (typically 5-20 seconds), and you still don't have at least 1000 samples, it is likely it is because CPU is NOT the bottleneck. It is very common in STARTUP scenarios that CPU is NOT the problem but that the time is being spent fetching data from the disk. It is also possible that the program is waiting on network I/O (server responses) or responses from other processes on the local system. In all of these cases the time being wasted is NOT governed by how much CPU time is used, and thus a CPU analysis is inappropriate.

You can quickly determine if your process is CPU bound by looking at the 'When' column for your 'top most' method. If the When column has lots of 9s or As in it over the time it is active then it is likely the process was CPU bound during that time. This is the time you can hope to optimize and if it is not a large fraction of the total time of your app, then optimizing it will have little overall effect (See Amdahl's Law). Switching to the CallTree view and looking at the 'When' column of some of the top-most methods in the program is a good way of confirming that your application is actually CPU bound..

Finally you may have enough samples, but you lack the symbolic information to make sense of them. This will manifest with names with ? in them. By default .NET code should 'just work'. For unmanaged code you need to tell PerfView which DLLs you are interested in getting symbols for. See symbol resolution for more. You should also quickly check that you don't have many broken stacks as this too will interfere with analysis.

Top-down and Bottom-up Analysis

Once you have determined that CPU is actually important to optimize you have a choice of how to do your analysis. Performance investigations can either be 'top-down' (starting with the Main program and how the time spent there is divided into methods it calls), or 'bottom-up' (starting with methods at 'leaf' methods where samples were actually taken, and look for methods that used a lot of time). Both techniques are useful, however 'bottom-up' is usually a better way to start because methods at the bottom tend to be simpler and thus easier to understand and have intuition about how much CPU they should be using.

Phase 1: Choosing How to Group Methods

PerfView starts you out in the 'ByName' view that is appropriate starting point for a bottom-up analysis. It is particularly important in a bottom up analysis to group methods into semantically relevant groupings. By default PerfView picks a good set starting group (called 'just my code'). In this grouping any method in any module that lives in a directory OTHER than the directory where the EXE lives, is considered 'OTHER' and the entry group feature is used group them by the method used to call out to this external code. See the tutorial more on the meaning of 'Just My Code' grouping, and the GroupPats reference for more on grouping.

For simple applications the default grouping works well. There are other predefined groupings in the dropdown of the GroupPats box, and you are free to create or extend these as you need. You know that you have a 'good' set of groupings when what you see in the 'ByName' view are method names that are semantically relevant (you recognize the names, and know what their semantic purpose is), there are not too many of them (less than 20 or so that have an interesting amount of exclusive time), but enough that break the program into 'interesting' pieces that you can focus on in turn (by Drilling Into).

One very simple way of doing this is to increase the Fold % , which folds away small nodes. There is a shortcuts that increase (F7 key) or decrease (Shift F7) this by 1.6X. Thus by repeatedly hitting F7, you can 'clump' small nodes into large nodes until only a few survive and are displayed. While this is fast and easy, it does not pay attention to how semantically relevant the resulting groups are. As a result it may group things in poor ways (folding away small nodes that were semantically relevant, and grouping them into 'helper routines' that you don't much want to see). Nevertheless, it is so fast and easy it is always worth at least trying to see what happens. Moreover it is almost always valuable to fold away truly small nodes. Even if a node is semantically relevant, if it uses < 1% of the total CPU time, you probably don't care about it.

Typically the best results occur when you use Fold % in the 1-10% range (to get rid of the smallest nodes), and then selectively fold way any semantically uninteresting nodes that are left. This can be done easily looking at the 'ByName' view, holding the 'Shift' key down, and selecting every node on the graph that has some exclusive time (they will be toward the top), and you DON'T recognize. After you have completed your scan, simply right click and select 'Fold Item' and these node will be folded into their caller disappearing from the view. Repeat this until there are no nodes in the display that use exclusive time that are semantically irrelevant. What you have left is what you are looking for.

Phase 2: Drilling Into Groups

During the first phase of an investigation you spend your time forming semantically relevant groups so you can understand the 'bigger picture' of how the time spent in hundreds of individual methods can be assigned a 'meaning'. Typically the next phase is to 'Drill into' one of these groups that seems to be using too much time. In this phase you are selectively ungrouping a semantic group to understand what is happening at the next 'lower level' of abstraction.

You accomplish this with two commands

Drill Into - By selecting a cell that represents samples (and inclusive or exclusive column), right clicking and selecting 'Drill Into' it will bring up a new StackViewer that has been loaded with JUST THOSE SAMPLES. This allows you to change the filtering and grouping in that view WITHOUT having the samples from the rest of the run interfere with the analysis.
Ungroup - Once you have a new window that you can change the grouping / folding, you typically want ungroup one of the selected node so you can 'see inside'. The way you ungroup depends on the way the group was formed. Possibilities include
If the node was an entry point group (e.g., OTHER<<mscorlib!get_Now()>>), you can indicate that you want just the that entry point to be ungrouped. This is what right clicking and selecting 'Ungroup' does. Note that any methods that the original entry point calls now become entry points to the group so this only ungroups to 'one level'.
If the node was an entry point group (e.g., OTHER<<mscorlib!get_Now()>>), you can indicate that you want ALL methods in that MODULE to be ungrouped selecting the node and using the 'Ungroup Module' command. This tends to show most of the interesting internal structure of that group in one shot.
If the node is a normal groups (e.g., module mscorlib), you can indicate you want just that group ungrouped. The 'Ungroup' does this.
If the node has many other nodes folded into it (either because of the FoldPats or Fold %), then simply removing these will 'explode' the group. There is a right click shortcut 'Clear all Folding' which does this.

Typically if 'Ungroup' or 'Ungroup Module command does not work well, use 'Clear all Folding' If that does not work well, clear the 'GroupPats' textbox which will show you the most 'ungrouped' view. if this view is too complex, you can then use explicit folding (or making ad-hoc groups), to build up a new semantic grouping (just like in the first phase of analysis).

Summary

In summary, a CPU performance analysis typically consist of three phases

Confirming that CPU is indeed the bottleneck and that you have enough samples to do an accurate analysis.
Using grouping and folding so that methods are clustered into semantically relevant groups
Drilling into the groups of most interest by selectively ungrouping to understand finer detail.

Investigating Memory

When to care about Memory

It is pretty clear the benefit of optimizing for time: your program goes faster, which means your users are not waiting as long. For memory it is not as clear. If your program uses 10% more memory than it could who cares? There is a useful MSDN article called Memory Usage Auditing for .NET Applications which will be summarized here. Fundamentally, you really only care about memory when it affects speed, this happens when your app gets big (Memory used as indicated by TaskManager > 50 Meg). Even if your application is small, however, it is so easy to do a '10 minute memory audit' of your applications total memory usage and the .NET's GC heap, that you really should do so for any application that performance matters at all. Literally in seconds you can get a dump of the GC heap, and be seeing if the memory 'is reasonable'. If your app does use 50Meg or 100 Meg of memory, then it probably is having an important performance impact and you need to take more time to optimized its memory usage. See the article for more details.

When to care about the GC Heap

Even if you have determined that you care about memory, it is still not clear that you care about the GC heap. If the GC heap is only 10% of your memory usage then you should be concentrating your efforts elsewhere. You can quickly determine this by opening TaskManager, selecting the 'processes' tab an finding your processes 'Memory (Private Working Set) value . (See Memory Usage Auditing for .NET Applications on an explanation of Private working set). Next, use PerfView to take a heap snapshot of the same process (Memory -> Take Heap Snapshot). At the top of the view will be the 'Total Metric' which in this case is bytes of memory. If GC Heap is a substantial part of the total memory used by the process, then you should be concentrating your memory optimization on the GC heap.

If you find that your process is using a lot of memory but it is NOT the GC heap, you should download the free SysInternals vmmap tool. This tool gives you a breakdown of ALL the memory used by your process (it is nicer than the vadump tool mentioned in Memory Usage Auditing for .NET Applications ). If this utility shows that the Managed heap is large, then you should be investigating that. If it shows you that the 'Heap' (which is the OS heap) or 'Private Data' (which is virtualAllocs) you should be investigating unmanaged memory .

Collecting GC Heap Data

If you have not already read When to care about Memory and When to care about the GC Heap please do so to ensure that GC memory is even relevant to your performance problem.

The Memory->Take Heap Snapshot menu item allows you to take a snapshot of the GC heap of any running .NET application. When you select this menu item it brings up a dialog box displaying all the processes on the system from which to select.

By typing a few letters of the process name in the filter textbox you can quickly reduce the number of processes shown. In the image above simply typing 'x' reduces the number of processes to 7 and typing 'xm' would be enough to reduce it to a single process (xmlView). Double clicking on the entry will select the entry and start the heap dump. Alternatively you can simply select the process with a single click and continue to update other fields of the dialog box.

If PerfView is not run as administrator it may not show the process of interest (if it is not owned by you). By clicking on the Elevate to Admin hyperlink to restart PerfView as admin to see all processes.

The process to dump is the only required field of the dialog, however you can set the others if desired. (See Memory Collection Dialog reference for more). To start the dump either click the 'Dump Heap' button or simply type the enter key.

Understanding GC Heap Perf Data

Once you have some GC Heap data, it is important to understand what exactly you collected and what its limitations are. Logically what has been captured is a snapshot of objects in the heap that were found by traversing references from a set of roots (just like the GC itself). This means that you only discover objects that were live at the time the snapshot was taken. However two factors make this characterization inaccurate in the normal case.

Sampling: To save time, PerfView may not dump the whole heap
Freezing: To make collection less impactful, it allows the process to run as it collects data

Understanding GC Heap Sampling

For some applications GC heaps can get quite large (> 1GB and possibly 50GB or more) When GC heaps 1,000,000 objects it slows the viewer quite as well as making the size of the heap dump file very large.

To avoid this problem, by default PerfView only collects complete GC heap dumps for heaps less than 50K objects. Above that PerfView only takes a sample of the GC heap. PerfView goes to some trouble to pick a 'good' sample. In particular

The whole heap (both live and dead objects) are considered when performing the sample
It actually collects that whole heap graph in memory and for each type counts how objects there are in each type. It also knows the total number of objects in the heap.
Based on the total number of objects in the heap, and the 'target'number of object (by default 50K), it computes a 'sampling ratio'. And from that computes a 'quota' of object for each type.
It then walks the heap (linearly) randomly selecting objects to hit the quota for each type.
However, we also require that each object not only contain itself, but also a 'path to root'. To ensure this
- When the heap graph was walked, spanning tree was formed (using the same priority algorithm used for displaying the heap)
- When an object is selected, the parent chain in the spanning tree is also included in the sampled graph.
In addition, large objects (with size > 85,000 bytes) area ALWAYS collected.
After all samples are selected, any references from nodes in the sampled graph are included.

The result is that all samples always contain at least one path to root (but maybe not all paths). All large objects are present, and each type has at least a representative number of samples (there may be more because of reason (5) and (6)).

Understanding GC Heap Scaling

GC heap sampling produces only dumps fraction of objects in the GC Heap, but we wish for that sample to represent the whole GC heap. PerfView does this by scaling the counts. Unfortunately because of the requirement to included any large object and the path to root of any object, a single number will not correctly scale the sampled heap so that it represents the original heap. PerfView solves this by remembering the Total sizes for each type in the original graph as well as the total counts in the scaled graph. Using this information, for each type it scales the COUNT for that type so that the SIZE of that type matches the original GC heap. Thus what you see in the viewer should be pretty close to what you would see in original heap (just much smaller and easier for PerfView to digest). In this way large objects (which are ALWAYS taken) will not have their counts scaled, but but the most common types (e.g. string), will be heavily scaled. You can see the original statistics and the ratios that PerfView uses to scale by looking at the log when a .gcdump file has been opened.

When PerfView displays a .gcdump file that has been sampled (and thus needs to be scaled), it will display the Average amount the COUNTS of the types have been scaled as well as the average amount the SIZES had to be scaled in the summary text box at the top of the display. This is your indication that sampling/scaling is happening, and to be aware that some sampling distortions may be present.

It is important to realize that while the scaling tries to counteract the effect of sampling (so what is display 'looks' like the true, unsampled, graph), it is not perfect. The PER-TYPE statistic SIZE should always be accurate (because that is the metric that was used to perform the scaling, but the COUNTs may not be. In particular for types whose instances can vary in size (strings and arrays), the counts may be off (however you can see the true numbers in the log file). In addition the counts and sizes for SUBSETS of the heap can be off.

For example if you drill down to one particular part of the heap (say the set of all Dictionary<string, MyType>), you might find that the count of the keys (type string) and the count of values (type MyType) are not the same. This is clearly unexpected, because each entry should have exactly one of each. This anomaly is a result of the sampling. The likelihood of an anomaly like this is inversely proportional to the size of the subset of the heap you are reasoning over. Thus when you reason about the heap as a whole, there should be no anomaly, but if you reason about a small number of objects deep in some sub-tree, the likelihood is very high.

Generally speaking, these anomalies do not tend to affect the analysis much. This is because you usually care about LARGE parts of your heap, and this is exactly where sampling is most accurate. Thus typically the correct response to these anomalies is to simply ignore them. If however they are interfering with your analysis, you can reduce or eliminate them by simply doing less sampling. The Sampling is controlled by the 'Max Dump K Objs' field. By default 250K objects are collected. If you set this number to be larger you will sample less. If you set it to some VERY large number (say 1 Billion), then the graph will not be sampled at all. Note that there is a reason why PerfView samples. When the number of objects being manipulated gets above 1 million, PerfView's viewer will noticeably lag. Above 10 million and it will be a VERY frustrating experience. There is also a good chance that PerfView will run out of memory when manipulating such large graphs. It will also make the GCDump files proportionally bigger, and unwieldy to copy. Thus changing the default should be considered carefully. Using the sampled dump is usually the better option.

As mentioned, GCHeap collection (for .NET) collects DEAD as well as live objects. PerfView does this because it allows you to see the 'overhead' of the GC (amount of space consumed, but not being used for live objects). It also is more robust (if roots or objects can't be traversed, you don't lose large amounts of the data). When the graph is displayed dead objects can be determined because they will pass through the '[not reachable from roots]' node. Typically you are not interested in the dead objects, so you can exclude dead objects by excluding this node (Alt-E).

GC Heap collection: To Freeze or not to Freeze?

PerfView has the ability to either freeze the process or allow it to run while the GC heap is being collected. If the process is frozen, the resulting heap is accurate for that point in time, however since even sampling the GC heap can take 10s of seconds, it means that the process will not be running for that amount of time. For 'always up' servers this is a problem as 10s of seconds is quite noticeable. On the other hand if you allow the process to run as the heap is collected, it means that the heap references are changing over time. In fact GCs can occur, and memory that used to point at one object might now be dead, and conversely new objects will be created that will not be rooted by the roots captured earlier in the heap dump. Thus the heap data will be inaccurate.

Thus we have a trade-off

Freeze the heap and get an accurate dump but interrupt the process for seconds to 10s of seconds.
Allow the process to run and get less accurate heap dumps.

PerfView allows both, but by default it will NOT freeze the process. The rational is that for most apps, you take a snapshot while the process is waiting for user input (and thus the process acts like it is frozen anyway). The exception is server applications. However this is precisely the case where stopping the process for 10s of seconds would likely be bad. Thus a default to allow the process to run is better in most cases.

In addition, if the heap is large, it is already the case that you will not dump all objects in the heap. As long as the objects being missed by the process running are statistically similar to the ones that did not move (likely in a server process), then your heap stats are likely to be accurate enough for most performance investigations.

Nevertheless, if for whatever reason you wish to eliminate the inaccuracy of a running process, simply use the Freeze checkbox or the /Freeze command line qualifier to indicate your desire to PerfView.

Converting a Heap Graph to a Heap Tree

As described in Understanding GC heap data the data actually captured in a .GCDump file may only be an approximation to the GC heap. Nevertheless the .GCDump does capture the fact that the heap is an arbitrary reference graph (a node can have any number of incoming and outgoing references and the references can form cycles). Such arbitrary graphs are inconvenient from an analysis perspective because there is no obvious way to 'roll up' costs in a meaningful way. Thus the data is further massaged to turn the graph into a tree.

The basic algorithm is to do a weighted breadth-first traversal of the heap visiting every node at most once, and only keeping links that where traversed during the visit. Thus the arbitrary graph is converted to a tree (no cycles, and every node (except the root) has exactly one parent). The default weighting is designed to pick the 'best' nodes to be 'parents'. The intuition is that if you have a choice between choosing two nodes to be that parent of a particular node, you want to pick the most semantically relevant node.

Using Priorities to control graph-to-tree conversion

The viewer of gc heap memory data has an extra 'Priority' text box, which contains patterns that control the graph-to-tree conversion by assigning each object a floating point numeric priority. This is done in a two step process, first assigning priorities to type names, and then through types assigning objects a priority.

The Priority text box is a semicolon list of expressions of the form

PAT -> NUM

Where PAT is a regular expression pattern as defined in Simplified Pattern matching and NUM is a floating point number. The algorithm for assigning priorities to types is simple: find the first pattern in the list of patterns that match the type name. If the patterns match assign the corresponding priority. If no pattern matches assign a priority of 0. In this way every type is given a priority.

The algorithm for assigning a priority to an object is equally simple. It starts with the priority of its type, but it also adds in 1/10 the priority of its 'parent' in the spanning tree being formed. Thus a node gives part of its priority to its children, and thus this tends to encourage breadth first behavior (all other priorities being equal that is 2 hops away from a node with a given priority will have a higher priority than a node that is 3 hops away).

Having assigned a priority to all 'about to be traversed' nodes, the choice of the next node is simple. PerfView chooses the highest priority node to traverse next. Thus nodes with high priority are likely to be part of the spanning tree that PerfView forms. This is important because all the rest of the analysis depends on this spanning tree.

You can see the default priorities in the 'Priority' text box. The rationale behind this default is:

Runtime infrastructure is given large negative weight and thus are only chosen after everything else.
Local variables are also given a large negative weight because they are transient, but tend to 'short circuit' the 'true' root, because they tend to point into the 'middle' of data structures.
Framework types are given a small negative weight
User defined types are given the default weight of 0

Thus the algorithm tends to traverse user defined types first and find the shortest path that has the most user defined types in the path. Only when it runs out of such links does it follow framework types (like collection types, GUI infrastructure, etc), and only when those are exhausted, will anonymous runtime handles be traversed. This tends to assign the cost (size) of objects in the heap to more semantically relevant objects when there is a choice.

Best Practices for assigning priorities to your types

The defaults work surprisingly well and often you don't have to augment them. However if you do assign priorities to your types, you generally want to choose a number between 1 and 10. If all types follow this convention, then generally all child nodes will be less (because it was divided by 10) than any type given an explicit type. However if you want to give a node a priority so that even its children have high priority you can give it a number between 10 and 100. Making the number even larger will force even the grandchildren to 'win' most priority comparisons. In this way you can force whole areas of the graph to be high priority. Similarly, if there are types that you don't want to see, you should give them a number between -1 and -10.

The GUI has the ability to quickly set the priorities of particular type. If you select text in the GUI right click to Priorities -> Raise Item Priority (Alt-P), then that type's priority will be increased by 1. There is a similarly 'Lower Item Priority (Shift-Alt-P). Similarly, there is a Raise Module Priority (Alt-Q) and Lower Module Priority (Shift-Alt-Q) which match any type with the same module as the selected cell.

Because the graph has been converted to a tree, it is now possible to unambiguously assign the cost of a 'child' to the parent. In this case the cost is the size of the object, and thus at the root the costs will add up to the total (reachable) size of the GC heap (that was actually sampled).

Viewing the resulting heap tree

Once the heap graph has been converted to a tree, the data can be viewed in the same stackviewer as was used for ETW callstack data. However in this view the data is not the stack of the allocation but rather the connectivity graph of the GC heap. You don't have callers and callees but referrers and referees. There is no notion of time (the 'when', 'first' and 'last' columns), but the notions of inclusive and exclusive time still make sense, an the grouping and folding operations are just as useful.

It is important to note that this conversion to a tree is inaccurate in that it attributes all the cost of a child to one parent (the one in the traversal), and no cost to any other nodes that also happened to point to that node. Keep this in mind when viewing the data.

Primary vs Secondary Nodes in the stack Viewer

As described in Converting a Heap Graph to a Heap Tree, before the memory data can be display it is converted from a graph (where arcs can form cycles and have multiple parents) to a tree (where there is always exactly one path from the node to the root. References that are part of this tree are called primary refs and are displayed in black in the viewer. However it is useful to also see the other references that were trimmed. These other references are called secondary nodes. When secondary nodes are present, primary nodes are in bold and secondary nodes are normal font weight. Sometimes secondary nodes clutter the display so there is a 'Pri1 Only' check box, which when selected suppresses the display of secondary nodes.

Primary nodes are much more useful than secondary nodes because there is an obvious notion of 'ownership' or 'inclusive' cost. It makes sense to talk about the cost of a node and all of its children for primary nodes. Secondary nodes do not have this characteristic. It is very easy to 'get lost' opening secondary nodes because you could be following a loop and not realize it. To help avoid this, each secondary nodes is labeled with its 'minimum depth'. This number is the shortest PRIMARY path from any node in the set to the root node. Thus if you are trying to find a path to root with secondary nodes, following nodes with small depth will get you there.

Generally, however it is better to NOT spend time opening secondary nodes. The real purpose of showing these nodes is to allow you to determine if your priorities in the Priority Text Box are appropriate. If you find yourself being interested in secondary nodes, there is a good chance that the best response is to simply add a priority that will make those secondary nodes primary ones. By doing this you can get sensible inclusive metrics, which are the key to making sense of the memory data.

One good way of setting priorities is to us the right click -> Priority -> Increase Priority (Alt-P) and right click -> Priority -> Decrease Priority (Alt-Q) commands. By selecting a node that is either interesting, or explicitly not interesting and executing these commands you can raise or lower its priority and thus cause it to be in the primary tree (or not).

Starting an Analysis of GC Heap Dump

This section assumes you have taken determined that the GC heap is relevant , that you have collected a GC Snapshot and that you understand how the heap graph was converted to a tree and how the heap data was scaled. In addition to the 'normal' heap analysis done here, it can also be useful to review the bulk behavior of the GC with the GCStats report as well as GC Heap Alloc Ignore Free (Coarse Sampling) view.

Bottom up Analysis

Like a CPU time investigation, a GC heap investigation can be done bottom up or top down. Like a CPU investigation, a bottom up investigation is a good place to start. This is even more true for memory then it was for CPU. The reason is that unlike CPU, the tree that is being displayed in the view is not the 'truth' because the tree view does not represent the fact that some nodes are referenced by more than one node (that is they have multiple parents). Because of this the top down representation is a bit 'arbitrary' because you can get different trees depending on details of exactly how the breadth first traversal of the graph was done. A bottom up analysis is relatively immune to such inaccuracy and thus is a better choice.

Like a CPU investigation, a bottom up heap investigation starts with forming semantically relevant groups by 'folding away' any nodes that are NOT semantically relevant. This continues until the size of the groups are big enough to be interesting. The 'Drill Into' feature can then be used to start a sub-analysis. Please see the CPU Tutorial if you are not familiar with these techniques.

The Goto callers view (F10) is particularly useful for a heap investigation because it quickly summarizes paths to the GC roots, which indicate why the object is still alive. When you find object that have outlived their usefulness, one of these links must be broken for the GC to collect it. It is important to note that because the view shows the TREE and not the GRAPH of objects, there may be other paths to the object that are not shown. Thus to make an object die, it is NECESSARY that one of the paths in the callers view be severed, but it may not be SUFFICIENT.

Grouping and Folding for GC Heap Investigation

Typically, GC heaps are dominated by

Strings (typically the account for 20-25% of the total size of the GC Heap!
Arrays (often byte[]). These often account for 10% or more.

Unfortunately while these types dominate the size of the heap they do not really help in analysis. What you really want to know is not that you use a lot of strings but WHAT OBJECTS YOU CONTROL are using a lot of strings. The good news is that this is 'standard problem' that of a bottom up analysis that PerfView is really good a solving. By default PerfView adds folding patterns that cause the cost of all strings and arrays to be charged to the object that refers to them (it is like the field was 'inlined' into the structure that referenced it). Thus other objects (which are much more likely to be semantically relevant to you), are charged this cost. Also by default, the 'Fold%' textbox is set to 1, which says that any type that uses less than 1% of the GC heap should be removed and its cost charged to whoever referred to it.

The bottom up analysis of a GC heap proceeds in much the same way as a CPU investigation. You use the grouping and folding features of the Stack Viewer to eliminate noise and to form bigger semantically relevant groups. When these get large enough, you use the Drill Into feature to isolate on such group and understand it at a finer level of detail. This detailed understanding of your applications memory use tells you the most valuable places to optimize.

Once you have determined a type to focus on, it is often useful to understand where the types have been allocated. See the GC Alloc Stacks view for more on this.

Memory Leaks

A common type of memory problem is a 'Memory Leak'. This is a set of objects that have served their purpose and are no longer useful, but are still connected to live objects and thus cannot be collected by the GC heap. If your GC heap is growing over time, there is a good chance you have a memory leak. Caches of various types are a common source of 'memory leaks'.

A memory leak is really just an extreme case of a normal memory investigation. In any memory investigation you are grouping together semantically relevant nodes and evaluating whether the costs you see are justified by the value they bring to the program. In the case of a memory leak the value is zero, so generally it is just about finding the cost. Moreover there is a very straightforward way of finding a leak

Run the program to a particular place and take a heap snapshot.
Perform a set of operations (e.g. open and close something) that should be a 'no op'.
Take another heap snapshot.
Use the Diff feature of PerfView to find the difference between the heaps.
Anything in the difference is a memory leak (since the state of the program should be the same).

Note that because programs often have 'one time' caches, the procedure above often needs to be amended. You need to perform the set of operations once or twice before taking the baseline. That way any 'on time' caches will have been filled by the time the baseline has been captured and thus will not show up in the diff.

When you find a likely leak use the 'Goto callers view (F10)' on the node to find a path from the root to that particular node. This shows you the objects that are keeping this object alive. To fix the problem you must break one of these links (typically by nulling out on of the object fields).

Top Down Analysis of the GC Heap

While a Bottom up Analysis is generally the best way to start, it is also useful to look at the tree 'top down' by looking at the CallTree view. At the top of a GC heap are the roots of the graph. Most of these roots are either local variables of actively running methods, or static variables of various classes. PerfView goes to some trouble to try to get as much information as possible about the roots and group them by assembly and class. Taking a quick look at which classes are consuming a lot of heap space is often a quick way of discovering a leak.

However this technique should be used with care. As mentioned in the section on Converting a Heap Graph to a Heap Tree, while PerfView tries to find the most semantically relevant 'parents' for a node, if a node has several parents, PerfView is really only guessing. Thus it is possible that there are multiple classes 'responsible' for an object, and you are only seeing one. Thus it may be 'unfair' to blame class that was arbitrarily picked as the sole 'owner' of the high cost nodes. Nevertheless, the path in the calltree view is at least partially to blame, and is at least worthy of additional investigation. Just keep in mind the limitations of the view.

Root Information Caveats

PerfView uses the .NET Debugger interface to collect symbolic information about the roots of the GC heap. There are times (typically because the program is running on old .NET runtimes) that PerfView can't collect this information. If PerfView is unable to collect this information it still dumps the heap, but the GC roots are anonymous e.g. everything is 'other roots'. See the log at the time of the GC Heap dump to determine exactly why this information could not be collected.

GC Stats Report

A typical GC Memory investigation includes dump of the GC heap. While this gives very detailed information about the heap at the time the snapshot was taken, it give no information about the GC behavior over time. This is what the GCStats report does. To get a GCStats reports you must Collect Event Data as you would for a CPU investigation (the GC events are on by default). When you open the resulting ETL file one of the children will be a 'GCStats' view. Opening this will give you a report for each process on the system detailing how bit the GC heap was, when GCs happen, and how much each GC reclaimed. This information is quite useful to get a broad idea of how the GC heap changes over time.

GC Heap Alloc Ignore Free (Coarse Sampling) Stacks

In addition to the information needed for a GC Stats Report, a normal ETW Event Data collection will also include coarse information on where objects where allocated. Every time 100K of GC objects were allocated, a stack trace is taken. These stack traces can be displayed in the 'GC Heap Alloc Ignore Free (Coarse Sampling) Stacks' view of the ETL file.

These stacks show where a lot of bytes were allocated, however it does not tell you which of these objects died quickly, and which lived on to add to the size of the overall GC heap. It is these later objects that are the most serious performance issue. However by looking at a heap dump you CAN see the live objects, and after you have determined that a particular have many instances that live a long time, it can be useful to see where they are being allocated. This is what the GC Heap Alloc Stacks view will show you.

Please keep in mind that the coarse sampling is pretty coarse. Only the objects that happen to 'trip' the 100KB sample counter are actually sampled. However what is true is that ALL objects over 100K in size will be logged, and any small object that is allocated a lot will likely be logged also. In practice this is good enough.

Large Objects

The .NET heap segregates the heap into 'LARGE objects' (over 85K) and small objects (under 85K) and treats them quite differently. In particular large objects are only collected on Gen 2 GCs (pretty infrequently). If these large objects live for a long time, everything is fine, however if large objects are allocated a lot then either you are using a lot of memory or you are create a lot of garbage that will force a lot of Gen 2 collections (which are expensive). Thus you should not be allocating many large objects. The GC Heap Alloc Ignore Free (Coarse Sampling) view has a special 'LargeObject' pseudo-frame that it injects if the object is big, making it VERY easy to find all the stacks where large objects are allocated. This is a common use of the GC Heap Alloc Ignore Free (Coarse Sampling) Stacks view.

Net GC Heap Allocations Stacks (GC Heap Net Mem (Coarse Sampling) view)

The first choice of investigating excessive memory usage of the .NET GC heap is to take a heap snapshot of the GC heap . This is because objects are only kept alive because they are rooted, and this information shows you all the paths that are keeping the memory alive. However there are times that knowing the allocation stack is useful. The GC Heap Alloc Ignore Free (Coarse Sampling) Stacks view shows you these stacks, but it does not know when objects die. It is also possible to turn on extra events that allow PerfView to trace object freeing as well as allocation and thus compute the NET amount of memory allocated on the GC heap (along with the call stacks of those allocations). There are two verbosity levels to choose from. They are both in the advanced section of the collection dialog box

.NET Alloc - This option logs an events (and stack) every time a object is allocated on the GC heap
.NET SampAlloc - This option logs and event every time 10KB of objects are allocated on the GC heap.

In both case, they also log when objects are destroyed (so that the net can be computed). The the option of firing an event on every allocation is VERY verbose. If your program allocates a lot, it can slow it down by a factor if 3 or more. In such cases the files will also be large (> 1GB for 10-20 seconds of trace). Thus it is best to start with the second option of firing an event every 10KB of allocation. This typically well under 1% of the overhead, and thus does not impact run time or file size much. It is sufficient for most purposes.

When you turn on these events, only .NET processes that start AFTER you start data collection. Thus if you are profiling a long running service, you would have to restart the application to collect this information.

Once you have the data you can view the data in the 'GC Heap Net Mem (Coarse Sampling)', which shows you the call stacks of all the allocations where the metric is bytes of GC Net GC heap. The most notable difference between GC Heap Alloc Ignore Free (Coarse Sampling) Stacks and 'GC Heap Net Mem (Coarse Sampling)' is that the former shows allocations stacks of all objects, whereas the latter shows allocations stacks of only those objects that were not garbage collected yet.

There is basically no difference in what is displayed between traces collected with the '.NET Alloc' checkbox or the '.NET SampAlloc' checkbox. It is just that in the case of .NET SampAlloc the information may be inaccurate since a particular call stack and type are 'charged' with 10K of size. However statistically speaking it should give you the same averages if enough samples are collected.

The analysis of .NET Net allocations work the same way us unmanaged heap analysis.

PerfView Reference Guide

Canceling Operations and Status Log

One of the goals of PerfView is for the interface to remain responsive at all times. The manifestation of this is the status bar at the bottom of most windows. This bar displays a one line output area as well as an indication of whether an operation is in flight, a 'Cancel' button and a 'Log' button. Whenever a long operation starts, the status bar will change from 'Ready' to 'Working' and will blink. The cancel button also becomes active. If the user grows impatient, he can always cancel the current operation. There is also a one line status message that is updated as progress is made.

When complex operations are performed (like taking a trace or opening a trace for the first time), detailed diagnostic information is also collected and stored in a Status log. When things go wrong, this log can be useful in debugging the issue. Simply click on the 'Log' button in the lower right corner to see this information.

Quick Start for PerfView's Main View

You have three basic choices in the main view:

Collecting Data: Time Investigation
Collecting Data: Memory Investigation
Examining Existing Data

Quick Start for collecting Event (Time) data

While we do recommend that you walk the tutorial, and review Collecting Event Data and Understanding Performance Data , if your goal is to see your time-based profile data as quickly as possible, follow the following steps

Click on the Collect -> Run menu entry or type Alt-R.
If you wish to do a wall clock time investigation click the 'Thread Time' checkbox
Type the command line of the scenario you wish to collected data for and hit <Enter>. If you wish you can type 'tutorial.exe' to use the tutorial scenario. If it is not easy to launch your app from PerfView, see collecting profile data for how to collect machine wide.
PerfView will run the application. Output will go to Log (to view see button in the lower right). You are shooting for 5-10 seconds of data (see Understanding Perf Data). Run through the scenario and shut the app down. At this point you have created a file called 'PerfViewData.etl'. PerfView will then process this performance data and display the CPU data. The first step in viewing the data is to select the process of interest. Select the process you started in step 1.
Examine the CPU data it this view. Type F1 to see the stack viewer's quick start .

Quick Start for Collecting GC Heap data

While we do recommend that you walk the tutorial, and review Collecting GC Heap Data and Understanding GC Heap Data, if your goal is to see your memory profile data as quickly as possible, follow the following steps

Live Process Collection

Click on the Memory -> 'Take Heap Snapshot' menu entry or type Alt-S. This brings up the memory dump dialog box.
Type a few characters of the process name of interest into the Filter textbox. This will cause only those processes which those characters in its name to be displayed.
Double click on the process of interest (or hit Enter if it is selected). This will start the data collection and takes between 5 and 60 seconds
After PerfView has created the .gcDump file it will immediately open it and display the data showing the types that consumed the most GC heap.
Examine the GC Heap data it this view. Type F1 to see the stack viewer's quick start.

Process Dump Collection

Locate the .dmp file in the Main Viewer's file view and double click on it. This will start the data collection and can take up to a few minutes.
After PerfView has created the .gcDump file it will immediately open it and display the data showing the types that consumed the most GC heap.
Examine the GC Heap data it this view. Type F1 to see the stack viewer's quick start .

Main View Tips

In addition to the General Tips, here are tips specific to the Main View.

The Local Symbol Directory - The default symbol cache (%TEMP%\SymbolCache), works well if either all the symbol files (PDBs) needed to understand the .ETL file are on the default symbol server, or the ETL file will not be shared with other users. However if you desire to place the ETL file on a file share so that others can read it, it is a good idea to create a local symbol directory. This is simply a directory named 'symbols' that is in the same director as the ETL file. PerfView will automatically look for PDB files in this location if it exist, AND it will always place a copy of any PDB file it needed into this local cache. The result is an ETL file as well as its symbol directory is 'complete' information needed to decode the ETL file. If both are on a file share, then you will always have all the PDBS you had when you did your original analysis. To make this easier to discover, there is a 'Make Local Symbol Dir' entry when you right click on an ETL file. This command simply makes a 'symbols' directory next to the ETL file.
Drag and drop files - The file treeview supports drag and drop, so you can drag a file from the explorer or other tool and release it on the treeview in the main window to open the file.
Cut and paste to select files - If you paste a path name into the text box in the top of the treeview it will open that file.
Right click in the tree view - Operations on files are typically exposed by right clicking on items in the treeview.

PerfView's Main View

The Main view is what greets you when you first start PerfView. The main view serves three main purposes

It serves as a quick introduction to PerfView with links to important starting points in the user's guide.
It hosts all the data collection capabilities of PerfView.
Its left pane acts as a 'perf explorer' which allows you to decide which performance data you wish to examine. Double clicking on items will open them, and right clicking will do other operations.
Directory TextBox - At the top of left pane is the directory textbox. File -> 'Go To Directory'' menu option (CTRL-L) on the Main Viewer This is set to the directory to inspect. You can also enter file names into this and it will cause them to be opened. When you open directory items in the view this textbox is updated to stay in sync.
File Filter Textbox The box just below the directory textbox. If you type text in this box, then only files that match this string (case insensitive) will be displayed. The * character is a wild card. This is a quick way of finding a particular file in a large directory.

The following image highlights the important parts of the Main View.

Data Collection

Typically when you first use PerfView, you use it to collect data. PerfView can currently collect data for the following kinds of investigations

Time Investigations: ETW data (with many variations) You collect this data with items in the 'Collect' menu entry. See collecting ETW data for more.
.NET Memory Investigations: .NET Runtime managed heap. You collect this data with the 'Memory' menu entry see collecting memory data for more.

Types of Performance Data / Views

The types of data PerfView understands

ETW Event data files (.ETL, .ETL.ZIP files) - ETL Files contain event tracing for windows (ETW) data. It is collected via the 'Collect' or 'Run' PerfView operations. ETW files can contain a wealth of different information depending on exactly events where activated at the time the data was collected. If the line contains the annotation (unmerged) it means that this data consists of multiple files and does not contain all the information necessary to copy the file to another machine. If you intend to copy the data, you must use the Right Click -> Merge (or better Right click -> Zip) operation before transferring it. See merging for more.
- TraceInfo View - The TraceInfo view displays 'top level' data that does not vary with time. This includes things like when the data was collected, the machine on which it was collected, how many processors and how much memory the machine had etc.
- Process View - This View shows information about each process that was active at some point during the trace. It gives the command line, the start and stop time, the amount of CPU, and other 'coarse' information about the processes.
- Processes / Files / Registry Stacks - This is a high level view showing the processes in the system. In this view if one process spawns another it will be a child of the parent process. All DLL load and file opens are also shown. If the Registry events are turned on, you will see those as well.
- Thread Time (With StartStop Activities) Stacks - This is like Thread Time Stacks in that it shows what every thread is doing (consuming CPU, Disk, Network) at any instant of time, and it tracks the causality of System.Threading.Tasks.Tasks so that costs incurred by the task is 'charged' to the creator of the task. However in addition to all this, it looks for 'Start-Stop' EventSource events as well as HTTP, ASP.NET WCF events and creates 'activities' for each of these. These Activities are place 'at the top (near the process node) of the stack so it nicely separates all costs associated with a particular starts-stop activity (e.g. a web request). It is very valuable for doing server investigations. You will only get this view if you collected data with the Thread Time events. See Making Server Investigation Easy and Blocked time investigation for more details.
- Thread Time (With StartStop Activities) (CPU ONLY) Stacks - This trace is basically the Thread Time (With StartStop Activities) Stacks view, however because the trace was not collected with the /threadTime option, the view cannot show blocked time. The result is that you can see CPU time, grouped by the Start-stop activity, but no blocked or async time. Thus it is useful to see where CPU time is being spent grouped by the request being serviced, but does not tell you much about wall clock time. For that you need to collect with the /threadTime option.
- The Memory Group - This folder contains all the views associated with memory investigations, whether the be native C++ heap, raw Virtual Alloc, or .NET GC heap.
  - GCStats View - The GCStats view shows the activity of the .NET GC over time. A report is generated for each process that used the .NET GC, and for each such process, important statistics about each GC is displayed.
  - GC Heap Alloc Ignore Free (Coarse Sampling) Stacks - The .NET Runtime logs an event very time 100K bytes of GC heap memory is allocated. This view shows this broken down by call stack where the metric is the number of bytes allocated.= Note that much of this memory quickly becomes trash and thus does not contribute greatly to the GC heap size, however, high allocation rates DO consume CPU time and thus this view is useful for tracking down the high allocation call sites to reduce CPU. Keep in mind that this is a sample (only the allocation that 'trips' the 100K sample 'interval' is logged, however for high volume sites, this sampling will still be accurate. See the GC Heap Alloc Ignore Free (Coarse Sampling) Stacks Section for more on this view.
  - Gen 2 Object Deaths - When the DotNetAlloc or DotNetAllocSamp events are turned on the runtime will log the stack of allocations as well as when GCs happen and what objects are collected. In this view we show you the allocation stack of objects that DIED in Gen 2. If your Gen 2 GCs are expensive, then reducing these objects are the most important way of bring the cost of those Gen 2 GCs down. There is a (Coarse Sampling) version of this based on the sampling that happens by default every 100KB of allocation.
  - Server GC Stacks - This is a specialized view that shows you the CPU time that is consumed by the server GC threads. This information is also available by filtering appropriately with the CPU views.
  - Net Virtual Alloc Stacks - The stacks at which memory was allocated using VirtualAlloc. The metric is the number of COMMITTED bytes and the metric is negative when memory is freed. You will only get this view if you collected data with the Virtual Alloc events. See Unmanaged Memory Analysis for more.
  - Net Virtual Reserve Stacks - The stacks at which virtual address space was allocated using VirtualAlloc. The metric is the number of RESERVED bytes and the metric is negative when memory is RELEASED. You will only get this view if you collected data with the Virtual Alloc events. This view is only useful if you are running out of ADDRESS space not memory. Thus it is typically only useful for large 32 bit processes that throw out of memory exceptions because there simply is no more address space to allocated memory into.
  - Net OS Heap Alloc Stacks - The stacks at which memory was allocated using HeapAlloc (used by malloc, and C++ new operators) . The metric is the number of bytes allocated and the metric is negative if the memory is freed. Currently you must use XPERF to collect an ETL trace with these events.
  - GC Heap Alloc Ignore Free - This view shows you the stack of allocations weighted by the size of the allocation but does not take into account object death (GCs). It is like the GC Heap Alloc Ignore Free (Coarse Sampling) view, however unlike that view it uses the finer grained events available when the .NET Alloc and .NET Samp Alloc events are turned on. This view is useful when you are trying to carefully audit all allocations (Because for example you want to minimize GC pauses) and you want more detail than the GC Heap Alloc Ignore Free (Coarse Sampling) events give you (which is only every 100K).
  - GC Heap Snapshots - This node is present if PerfView detects any GC heap snapshots in the ETL file. These may be either JavaScript or .NET heap snapshots.
  - JS Heap Snapshot - This node represents a single snapshot of just a JavaScript heap. It will bring up a 'GC Heap Dump' view of the heap if opened.
  - GC Heap Analyzer - This node opens a viewer designed to help with GC heap analysis. It contains much of the same information as the GC Stats view but is more graphical and interactive.
- The Advanced Group - This folder contains views that are more specialize investigations that are rarer than the common CPU, wall clock time, and memory investigations.
  - Processor Stacks - Shows what every processor is doing, including at what priority. Use GroupPats to focus on Processor or Priority.
  - Thread Time Stacks - Shows what every thread is doing whether it is consuming CPU, disk, network or blocked on something else. It does not include the 'ReadyThread' information. You will only get this view if you collected data with the Thread Time events. See Blocked time investigation on more details on wall clock / blocked time investigations.
  - Thread Time (With Tasks) Stacks - This is like Thread Time Stacks in that it shows what every thread is doing (consuming CPU, Disk, Network) at any instant of time. But in addition it attributes any costs that a .NET System.Threading.Tasks.Task is doing to the thread (Task) that created the work item. This is especially useful for programs that use the C# 'async' feature. You will only get this view if you collected data with the Thread Time events. If you investigating a HTTP service or have Start-stop events in your code, it is general better to use the Thread Time (With StartStop Activities) Stacks view instead. See Blocked time investigation, Understanding Thread Time With Tasks and Making Server Investigation Easy for more details.
  - Thread Time (with ReadyThread) Stacks - Normally Thread Time Stacks do not show you the thread that unblocked a thread if it is available (ReadyThread events), because it tends to be confusing for the first level of analysis. However once important regions of blocked time are identified, it is critical to understand what UNBLOCKED that activity (after all if something blocked a long time it is the thing that unblocked it that was 'late'. At every stack that blocks, if there is information about the thread that unblocked it this is appended to the bottom of the stack with a 'READIED_BY' suffix. This allows you to 'unwind' the causality (what thread caused the unblock). You will only get this view if you collected data with the Thread Time events. See Blocked time investigation for more details on blocked time investigations.
  - Exceptions Stacks - The stack of the location where every exception was thrown. If you have high exception rates, this view allows you to quickly locate the offending code.
  - Image Load Stacks - The stacks at which any file was mapped into memory (DLL load). The metric is the number of bytes in file loaded, and the metric is negative when the image is unloaded.
  - Managed Load Stacks - The stacks which caused the load of any managed assembly. The metric is the size of the assembly loaded.
  - Pinning at GC Time Stacks - This view is designed to track down issues with unreasonable memory growth in the .NET garbage collector because of excessive pinning of GC objects (which make it hard for the GC to do its job). By default this view shows you how many objects are pinned at each GC. However if you turn on the 'clrPrivate' provider with stacks (clrPrivate:@StacksEnabled=true), it will give additional information on the exact stack where the pinning took place for each such pinned object. If you instead collect with /DotNetAlloc (very expensive) it will tell you were the pinned object was allocated (if it has not scrolled off the circular buffer)
  - Pinning Stacks - This view is designed to track down issues with unreasonable memory growth in the .NET garbage collector because of excessive pinning of GC objects (which make it hard for the GC to do its job). It only displays if the 'clrPrivate' provider is turned on with stacks (clrPrivate:@StacksEnabled=true). It works much like allocation stacks displaying all live GC pinning handles over their lifetime with the stack where they were created.
  - Disk I/O Stacks - The stacks at which disk I/O happens. The metric is the amount of time it took to service the disk operation (it does not include the time waiting for the disk to become available).
  - File I/O Stacks - The stacks at which File I/O happens. The metric is the number of bytes read or written. Note that this metric is independent of whether the File operation caused disk activity (it might have been serviced from the file system cache). You will only get this view if you collected data with the File I/O events.
  - CCW Ref Count Stacks - Show the stacks where any .NET COM Callable Wrapper (CCW), has its COM reference count changed. If you have a WinRT or COM object that is not being removed, it is because this reference counter is not going to zero. This view shows everywhere the count changed which allows you to debug the problem. In order for this view to be shown you need to collect the trace with the ClrPrivate provider (thus /Providers=ClrPrivate:@StacksEnabled=true or placing ClrPrivate:@StrackEnabled=true in the 'Additional Providers' textbox).
  - Windows Handle Ref Count Stacks - Show the stacks where any Windows OS Handle were created, duplicated or closed. When handles are created or duplicated a +1 is used as the metric and when they are closed a -1 is used. Also when a close() on a handle occurs and PerfView has seen a 'create' or 'duplicate' event for it it will use that stack for the close (this is very much like what is done for memory allocation views). Thus balanced creation and closing will cancel out if both are in the time interval of interest. There are a few handles that are allocated in a process but are closed by system processes, so may be some imbalances, but generally there are only a handful of these Thus it is relative easy to spot 'leaks' since they will show up as an imbalance. In order for this view to be shown you need to collect with the 'Handle' kernel events.
  - Heap Snapshot Pinning Stacks - (Experimental) This view is used when a heap snapshot is taken along with the ETW events for where pinning happens (Providers include ClrPrivate). This view shows for every object in the heap snapshot that is pinned, the stacks at which is was pinned. This includes objects pinned because they are pointed to by a async pinned handle at the time of the heap snapshot.
  - Heap Snapshot Pinned Object Allocation Stacks - (Experimental) This view is used when a heap snapshot is taken along with the ETW events for where pinning happens (Providers include ClrPrivate). This view shows for every object in the heap snapshot that is pinned, the stacks at which is was allocated. This includes objects pinned because they are pointed to by a async pinned handle at the time of the heap snapshot.
  - Contention Stacks - This view aggregates Contention events. Contention event is fired when a thread tries to acquire a managed lock that is currently owned by another thread. Note that each event has useful event data that is folded by default. For example, unfolding EventData DurationNs reveals individual pauses of each thread: this can be useful to correlate a particular long wait to another events in the trace. Note that not all contention events are real OS-level waits: the runtime may first spin wait to try acquire the lock fast. The metric represents the amount of time spent to acquire the lock in milliseconds.
  - WaitHandleWait Stacks - This view is similar to Contention Stacks in a sense that it allows to diagnose the blocking waits. However, there are important differences. Contention events are emitted exclusively on System.Threading.Monitor and System.Threading.Lock code paths. WaitHandleWait events are lower level and fire when the OS-provided wait handle is used to block the thread execution. For example, WaitHandleWait will catch waiting inside SemaphoreSlim.Wait() while Contention event won't. WaitHandleWait can be fired alongside the Contention event: for instance, during Monitor.Enter if spin waiting was not enough to acquire the lock. Also note that WaitHandleWait is much more noisy: you'd need to clean up the stacks from "legit" waiting like a main thread waiting for application shutdown signal, or a consumer thread that waits for new items in a blocking channel. WaitHandleWait may not cover all waits in your application, like waits that bypass the dotnet runtime or waits on other synchronous OS APIs like a blocking write to a socket. For more details, refer to this diagram from the original PR.
  - Any Stacks - This view shows every event that has a stack. It is useful when none of the more specialized stack views are available.
  - Any TaskTree - This view is designed to so you the 'task view' of any event in the trace. It is useful when looking at asynchronous or parallel operations. When using the System.Threading.Tasks library you can think of all execution as being run in a particular task which we call an activity. Threads are one kind of activity (denoted a 'Thread Activity') and any task started from the task library is also an activity. All activities besides Thread Activities have a parent which started them. Thus you can form a tree of 'parent' activities that ultimately will end with a Thread-activity. This is with the Any TaskTree will show. This view is available whenever there are events from the System.Threading.Tasks.TplEventSource event source.
  - Any Stacks (with StartStop Activities) - This view is very much like the Any Stacks view in that it shows all events that have call stacks associated with them. The only difference is at the top of the each stack (between the process and thread frames) is a list of Start-Stop tasks (that is an activity See EventSource Activities on how to define your own).
  - Any StartStopTree - This view is a simplification of the Any Stacks (with StartStop Activities) view. Like that view, it shows every event in the context of the start-stop activities that are currently active, however it does NOT show call stacks. Thus this works on any traces with minimal events and is useful when you don't want the extra detail.
  - JitStats View - The JitStats view show the activity of the .NET Just in time (JIT) compiler. It shows exactly which methods were JIT compiled, how big the are (both before and after JIT compilation) as well as the amount of time it took to JIT compile the methods. This allows you to quickly determine how much time can be saved by NGening various DLLs. This view also contains information about the 'Background JIT' feature of V4.5 of the runtime (as exposed by System.Runtime.ProfileOptimization class) that speeds up startup of applications that were not NGENed by JIT compiling on multiple processors. If background JIT is enabled, this view will add two additional columns that track background JIT specific information. The first is called DistanceAhead, and specifies in milliseconds how early the method was compiled relative to when it was called. The second is called BlockedReason, and specifies why a method was not background compiled. The most common reasons are that a module dependency has not yet been loaded, or that playback aborted because of an unsatisfied module dependency triggering a timeout. If a module dependency has not been satisfied, the name of the module appears in this column. If a timeout is triggered, then the text "Playback Aborted" appears here.
  - EventStats View - The EventStats view shows the count roll-up of every event type that was collected in the trace. It also shows which events have stack traces associated with them. This view is mostly for PerfView diagnostic purposes when other views are not working properly. It lets you quickly determine what is in the ETL file so you can determine if it is a problem with data collection (the needed events are not present), or presentation.
  - Anti-Malware Realtime Scan Stacks - This view shows the latency impact of realtime scan requests made by Windows Defender when an application does I/O.
- The Old Group - This folder contains views whose functionality has likely been superceded by another (typically more general purpose view). These are likely to go away in the future after confirming that all the functionality in them can be achieved using other views.
  - Server Request CPU Stacks - Shows CPU time (in milliseconds) rolled up by request. This view supports ASP.NET and ASP.NET hosted WCF services.
  - Server Request Thread Time Stacks - Shows wall clock time rolled up by request. Wall clock time is comprised of CPU time and blocked time and represents the amount of time that was spent on the request regardless of how many threads were used. Any asynchronous operations are not rolled up under the request, as they do not contribute to the wall clock time. This view supports ASP.NET and ASP.NET hosted WCF services.
  - Server Request Managed Allocation Stacks - This view performs the same task as the GC Heap Alloc Ignore Free (Coarse Sampling) Stacks view, with the exception that results are rolled up by request. This view supports ASP.NET and ASP.NET hosted WCF services.
  - ASP.NET Thread Time Stacks - The ASP.NET Stats view gives you a high level, aggregated view of what is going on with ASP.NET requests over time. The ASP.NET Thread Time Stacks view lets you drill into more detail. In particular this view works like the Thread Time Stacks view in that it shows you what every thread is doing with its REAL (clock) time, however unlike the Thread Time Stacks view it groups the parts of threads that are actually doing work on behalf of an particular request together. Thus when a request takes a long time (whether it be because of CPU or blocking on a database), that time will show up in this view attributed to that request. This makes it very straightforward to quickly understand why requests take a long amount of real time to complete. You will only get this view if you collected data with the Thread events. See Blocked time investigation for more details on blocked time investigations.
  - ASP.NET Thread Time (with Tasks) Stacks - This view is basically a fusion of the ASP.NET Thread Time Stacks view and the Thread Time (With Tasks) Stacks view. Like the ASP.NET Thread time view it shows you time grouped by request. But like the Thread Time with Tasks view if that request causes other tasks to be spawned as part of its operation those are considered 'children' of that request and show up in the call stack that way. This is the most useful view if you are investigating a ASP.NET server that uses async as part of its implementation.
  - ASP.NET Thread Time Stacks (CPU ONLY) - Like the ASP.NET Thread Time Stacks view except that it indicates that only CPU sampling (not thread time events) were captured so the view is impoverished. It only shows you the CPU time spent (but won't tell you where the request spent time blocked on networking/file/locks). To get the full view the /threadTime option must be enabled during data collection.
- ASP .NET View - Show when ASP.NET events were collected (these are collected by default or when the 'ASP.NET' provider is specified as a 'Additional Providers'). This view give you an 'overall view of what your ASP.NET server processes were doing, including what the average throughput was, the average request response time, and other basic information to quickly assess your server's basic health. You typically drill into more detail by using the ASP.NET Thread Time Stacks view.
- IIS Stats - Show when IIS ETW events were collected (these are collected when the IIS checkbox or the IIS:WWW Server provider is specified as a 'Additional Providers'). This view give you an 'overall view of request processing at an IIS level by analyzing IIS ETW events, provides you details on the Top 100 slowest requests in the trace, gives you details of the failed requests in the trace and allows you to drill down further on the modules causing slowness or failures. You can click on the individual requests to see all events fired in the IIS pipeline for that very request. This view helps you diagnose slow performance issues in various stages of request processing in the IIS pipeline.
- Events View - The Events view is the 'raw' view of the ETW events. Basically and ETL file is simply a sequence of event payloads where each event consists of an event type, a timestamp, an process and thread that generated the event, and additional event-specific information. The Event viewer allows this information to be filtered (by time, process, and text), and viewed in Excel.
XPERF CSV files (.CSV, .CSVZ files) - PerfView has the capability to read the .CSV files that XPERF can generated from .ETL files. This is not the recommended way of looking at ETW data because .CSV files tend to be 4 times larger, and a certain amount of information is lost when the file is converted to .CSV format. However sometimes that is all you have so PerfView supports basic operations on this data.
Windbg/CDB WT command output parsing (WT files) - Windbg/CDB has a very useful command called 'WT' which will single step through a routine (and any sub-routines) and output statistics about how many instructions where executed in each routine. However this data is very voluminous and hard to read. If it is saved to a file with a .WT extension, then PerfView can read it and display it in the stack viewer, making analysis much easier.
Windbg/DBG Debugger Stack Parser (.cdbstack files) - The WT command is useful for collecting fine-grained data about a particular routine using the debugger. It can also be useful to collect data about a particular resource (e.g. when some program API is called that allocates a resource) using a breakpoint that dumps the stack and the continues. If this output is placed in a file with a .cdbstack extension, then this can be viewed with PerfView.
PerfView Stack Views (.PerfView.XML or .PerfView.XML.ZIP files) - PerfView has the ability to save the data in a stack viewer as an XML file (or a ZIPed XML file). This is what the 'Save' operation in the stack viewer does. Typically this file is MUCH smaller than the original ETL file but contains the important information for a broad variety of data analyses. This is often the best way to 'hand off' an investigation to another programmer. It is also possible to generate this XML file using another program and thus view external data using PerfView's powerful stack viewer. PerfView supports both PerfView.xml as well as a PerfView.json variation.
Xml Tree Views (*.tree.xml files) - XmlTree files are files that encode tree based data as XML trees where each frame of the callstack is represented by and XML 'node' element. Currently this works for the export format for the Java YourKit profiler, but will include other export formats that have roughly the same structure.
VMMAP data (.mmp files) - PerfView can read the (Version 3) .MMP files that are generated by the VMMAP utility. Typically this is valuable because PerfView can do diffs between two VMMAP files. PerfView's grouping operations are also handy.
Process Dumps (.dmp files) - PerfView can extract information about the .NET GC Heap from a process dump file (created by Visual Studio, windbg, ntsd or cdb) in much the same way as with a live process. Double clicking on this entry performs the 'Take Heap Snapshot from Process Dump' action.
.NET GC Heap (SOS format) (.gcHeap files) - The .NET heap can be dumped using either PerfView or the SOS debugging extension utility to create a .gcHeap file. Opening this file will display the heap in the stack viewer. The spanning tree is generated from the roots of the graph, and this tree is shown in the stack viewer.
.NET GC Heap (Dump format) (.gcDump files) - This is the default format for dumping the GC heap in PerfView.
ClrProfiler data for CodeSize (.codeSize files) - If a data file (.log file) generated by CLRProfiler is renamed to have a .codeSize Extension, then PerfView can read it and give an analysis of the size of the code that was run during the data collection. This is useful for trying to minimize the cold startup time of managed code.
ClrProfiler data for Allocations (.allocStacks files) - If a data file (.log file) generated by CLRProfiler is renamed to have a .allocStacks Extension, then PerfView can read it and give an analysis of the allocations there were done focusing on the stack (code location) where the allocations were made. This is in contrast to the Heap Dumps, which focus on the objects that REFER to the given object, not what code allocated it.
Diagnostics Session (.diagsession files) - PerfView can open and read Diagnostics Session files that contain resources PerfView understands. Supported resources include .NET GC Heap (Dump format) (.gcDump files), Process Dumps (.dmp files) and Visual Studio Diagnostics Hub ETL (.etl) files.

Quick Start for the Object Viewer

TODO NOT DONE

TODO NOT DONE
TODO NOT DONE

Object Viewer Tips

In addition to the General Tips, here are tips specific to the Object Viewer.

TODO - fill in

The Object Viewer

The object viewer is a view that lets you see specific information about a individual object on the GC heap.

TODO NOT DONE

Quick Start for the Stack Viewer

While we do recommend that you walk the tutorial, if your goal is to understand what the stack viewer is showing you follow these steps

The first view displayed is the 'ByName' view suitable for a bottom up investigation . The items on this are sorted by the time that was spent exclusively in that item displayed. After determining that CPU is your problem, looking at the top items in this list is what you are interested in doing.
If there are ? in the names of items at the top of this list, you need to select the cell, right click and select 'Lookup Symbols'. See resolving unmanaged symbols for more. If all the time is spent in node that looks like 'OTHER<<DLL!Function>>' It means that PerfView's default grouping is 'too strong' and grouping too much for your scenario. See stack viewer troubleshooting for more.
You should make sure that you are looking at an interesting time. In particular at process shutdown when profiling is active, there is overhead that your likely want to exclude. The easiest way to do this is to restrict your analysis to the time in which your Main method was active. To do this find Main in the ByName view (Ctrl F-> type Main <Enter>) and select the first and last time by Ctrl Clicking on both of those entries then Right click -> Set Time Range. See zooming to a range of interest for more.

Setting Defaults in Stack Viewer

You can set the default value used in the GroupPats and Fold textboxes using the "File -> Set As Default Grouping/Folding" menu item. These three values are persisted across PerfView sessions for that machine. The 'File -> Clear User Config' will reset these persisted values to their defaults, which is simple way to undo a mistake.

Quick Start for the GC Heap Viewer

While we do recommend that you walk the tutorial, and review Understanding GC Heap Perf Data and Starting an Analysis of GC Heap Dump, if your goal is to see your memory profile data as quickly as possible, follow the following steps

Determine if memory is of interest (see When to care about Memory and in particular When to care about the GC Heap , and take a GC heap snapshot (Memory -> Take Heap Snapshot)
Understand what the GC stack viewer is showing you, and in particular what the difference is between primary and secondary nodes is .
Do Bottom up analysis of objects as described in Starting a GC Heap Analysis .

Stack Viewer Tips

In addition to the General Tips, here are tips specific to the Stack Viewer.

Setting the default Grouping and Folding values - Don't like the default grouping. You can change it. Use the The File -> Set As Default Grouping/Folding to set it to the current value See Setting Defaults in Stack Viewer for more.
Zooming in to a Time Range using 'When' Field: - If you click on the cell in the 'When' column it becomes editable. Select a region of text and then type 'Alt-R', which will zoom into the time range associated with the selected characters.
Resolving Unmanaged Symbols - Select a range of cells (by dragging or shift-clicking) and then right click and select 'Lookup Symbols'
Goto Source - You can select name in the stack viewer, right click and select 'Goto Source' (Alt D), and it will open a text editor with the source code of the file with that method where each line is annotated with the metric on that line. This feature is very valuable, but can be fragile. See Source Code Lookup for more.
Sorting by other columns - there is an small control directly to the right of each column name. If you click in this region, you can sort by that column (clicking again reverses the order of the sort).
Rearranging columns - By dragging (click and move mouse), on a column header, you can move a column before or after where it is by default. This can be useful before you cut and paste.
Saving Stack Views - Ctrl-S or File->Save in a stack viewer will save that view to a file. This saves all the filters, the symbol names that were looked up as well as the filtering options that were selected, the log file, any notes you made associated with the view. You can open this later and immediately return to the analysis you were doing.
StackView Notes - There is a text-editor pain in the stack viewer that you can write arbitrary notes to yourself (or paste in the names of important methods or other data from the view). These notes will be saved when the view is saved.
Opening Tree nodes- The space bar will open the currently selected tree node and move to the first child of the opened node. This means that you an open up the 'hottest' path in a tree simply by repeatedly hitting the space bar.
Cutting and pasting time ranges- If you type Alt-T (right click -> Copy Time Range) it will copy the start and end times to the cut/paste buffer These can then be pasted into the 'start' textBox to clone a time range from one view to another.
Viewing event data If you wish to see all the events (and their associated information), you can select one or two times in the stack viewer and hit Alt-V to view that time range int the event viewer.

The Stack Viewer

The stack viewer is main window for doing performance analysis. If you have not walked through the tutorial or the section on starting an analysis and understanding perf data, these would be good to read. Here is the layout of the stack viewer

The stack viewer has three main views: ByName, Caller-Callee, and CallTree. Each view has its own tab in the stack viewer and the can be selected using these tabs. However more typically you use right click or keyboard shortcuts to jump from a node in one view to the same node in another view. Double clicking on any node in any view in fact will bring you to Caller-Callee view and set your focus to that node.

Regardless of what view is selected, the samples under consideration and the grouping of those samples are the same for every view. This filtering and grouping is controlled by the text boxes at the top of the view and are described in detail in the section on grouping and filtering.

At the very top of the stack viewer is the summary statistics line. This gives you statistics about all the samples, including count, and total duration. It computes the 'TimeBucket' size which is defined as 1/32 of the total time interval of the trace. This is the amount of time that is represented by each character in the When column.
It also computes the Metric/Interval. This is a quick measurement of how CPU bound the trace is as a whole. A value of 1 indicates a program that on average consumes all the CPU from a single processor. Unless that is high, your problem is not CPU (it can be some blocking operation like network/disk read).
However this metric is average over the time data was collected, so can include time when the process of interest is not even running. Thus is typically better to use the When column for the node presenting the process as a whole to determine how CPU bound a process is.

In addition to the grouping/filtering textboxes, the stack viewer also has a find textbox, which allows you to search (using .NET Regular expression) for nodes with particular names.

Column Descriptions

The columns displayed in the stack viewer grids independent of the view displayed. Columns can be reordered simply by dragging the column headers to the location you wish, and most columns can be sorted by clicking on an (often invisible) button in the column header directly to the right of the column header text. The columns that are display are:

Name - Each frame on the stack is given a name, it starts out as a name of the form module!fullMethodName but may be morphed by grouping. There might also be a suffix of the form [N-M frames]. This is used in the CallTree view whenever a node has only one child, which is itself. In this case there is no interesting information in chain of calls and so they are combined into a single node however the nodes is annotated with the minimum and maximum number of frames that were combined for any particular call stack to show that this transformation happened. This combining occurs most frequently when the frame name is a group.
Exc - The amount of cost (msec of CPU time) that can be attributed to the particular method itself (not any of its callees) Note that this DOES include any cost that was folded into this node because of FoldPats or Fold % specifications . Can sort by it.
Exc % - The exclusive cost expressed as a percentage of the total cost of all samples. Can sort by it .
Exc Ct - The count of samples (instances) that are associated with just this entry (not its children). This does include any instances included because of FoldPats or Fold % specifications. Can sort by it.
Fold % The exclusive cost that has been folded (inlined) into this node because of FoldPats or Fold % specifications. Can sort by it.
Fold Ct - The count of items that have folded (inlined) into this node because of FoldPats or Fold % specifications. samples (instances) that are associated with just this entry (not its children). Can sort by it.
Inc - The cost associated with this node as well as all its children (callees) recursively. The inclusive cost of the ROOT contains all costs. Can sort by it.
Inc % - The inclusive cost expressed as a percentage of the total cost of all samples (will be 100% for the ROOT node) Can sort by it.
Inc Ct - The count of samples (instances) that are associated with this entry or any of children (callees) recursively. Can sort by it.
Inc Avg - The average metric per sample of samples (instances) that are associated with this entry or any of children (callees) recursively. This is simply InclusiveMetric / InclusiveCount.
When - This is a visualization of how the INCLUSIVE samples collected for that node vary over time. The total range (from the Start and End text boxes), is divided into 32 time 'TimeBuckets' and the inclusive samples for that node are accumulated into those 32 buckets. Each bucket is then represented as a digit that represents a scaled value.
- _ means no samples occurred in that bucket.
- . means that interval consumed between 0% and .1%.
- o means that interval consumed between .1% and 1%.
- 0 means that interval consumed between 1% and 10%.
- 1 means that interval consumed between 10% and 20%
- ...
- 9 means that interval consumed between 90% and 100%
- A means that interval consumed between 100% and 110%
- ...
- Z means that interval consumed between 350% and 360%
- * means that interval consumed over 360%
- a means that interval consumed between 0% and -10%
- b means that interval consumed between -10% and -20%
- ...
- z means that interval consumed between -250% and -260%
- * means that interval consumed over -260 %
For resources like CPU, or Disk or blocked time, where there is an obvious relationship of consuming a resource (cpu, disk, thread), for a period of time, 100% represents consuming 1 of those resources for the period of time of the bucket. Thus for CPU, 'A' would represent consuming a single CPU for the duration of the bucket. For these metrics you can get greater than 100% by consuming multiple resources (e.g. if on average you consume 2 CPUs over an interval than you will get 'K' (200%)). If the metric does not have an relationship with time (e.g. memory allocation), then 100% is simply half the maximum value over all buckets (that is we scale it so that you will always get one 'K'). It is quite useful to select time ranges based on the 'When' field to 'zoom in' on an area of high CPU usage. See selecting time ranges for more.
First - This is the time (in msec from the beginning of the trace) of the first inclusive sample associated with this name. See selecting time ranges for more. Can sort by it.
Last - This is the time (in msec from the beginning of the trace) of the last inclusive sample associated with this name. See selecting time ranges for more. Can sort by it.

Column Sorting

Many of the columns in the PerfView display can be used to sort the display. You do this by clicking on the column header at the top of the column. Clicking again switches the direction of the sort. Be sure to avoid clicking on the hyperlink text (it is easy to accidentally click on the hyperlink). Clicking near the top typically works, but you may need to make the column header larger (by dragging one of the column header separators). There is already a request to change the hyperlinks so that it is easier to access the column sorting feature.

There is a known bug that once you sort by a column the search functionality does not respect the new sorted order. This means that searches will seem to randomly jump around when finding the next instance.

ByName View (Group by Method)

The default view for the stack viewer is the ByName View. In this view EVERY node (method or group) is displayed, shorted by the total EXCLUSIVE time for that node. This is the view you would use for a bottom up analysis. See the tutorial for an example of using this view. Double clicking on entries will send you to the Caller-Callee View for the selected node.

See stack viewer for more.

CallTree View

The call tree view shows how each method calls other methods and how many samples are associated with each of these called starting at the root. It is an appropriate view for doing a top down analysis. Each node has a checkbox associated with it that displays all the children of that node when checked. By checking boxes you can drill down into particular methods and thus discover how any particular call contributes to the overall CPU time used by the process.

The call tree view is also well suited for 'zooming in' to a region of interest. Often you are only interested in the performance of a particular part of the program (e.g., the time between a mouse click and the display update associated with that click) These regions of time can typically be easily discovered by either looking for regions of high CPU utilization using the When column on the Main program node, or by finding the name of a function known to be associated with the activity an using the 'SetTimeRange' command to limit the scope of the investigation.

Like all stack-viewer views, the grouping/filtering parameters are applied before the calltree is formed.

If the stack viewer window was started to display the samples from all processes, each process is just a node off the 'ROOT' node. This is useful when you are investigating 'why is my machine slow' and you don't really know what process to look at. By opening the ROOT node and looking at the When column, you can quickly see which process is using the CPU and over what time period.

See the tutorial for an example of using this view. See stack viewer for more. See flame graph for different visual representation.

Caller Callee View

The caller-callee view is designed to allow you to focus on the resource consumption of a single method. Typically you navigate to here by navigating from either the ByName or Calltree view by double-clicking on a node name. If you have a particular method you are interested in, search for it ( find textbox ) in the ByName view and then double click on the entry.

The ByName view has the concept of the 'Current Node'. This is the node of interest and is the grid line in the center of the display. The display then shows all nodes (methods or groups) that were called by that current node in the lower grid and all nodes that called the current node in the upper pane. By double clicking on nodes in either the upper or lower pane you can change the current node to a new one, and in that way navigate up and down the call tree.

Unlike the CallTree view, however, a node in the Caller-Callee view represents ALL calls of the current node. For example in the CallTree view the node representing 'SpinForASecond' represent all instances of that function that have the SAME PATH TO THE ROOT. Thus you will see several instances of 'SpinForASecond' in the CallTree view. However if I was trying to understand the impact of 'SpinForASecond' on the whole program, it would be hard to do so in the CallTree view because it would look at all those nodes. The Caller-Callee view aggregates all the different paths to 'SpinForASecond' so you can understand quickly ALL the callers of 'SpinForASecond' and all the callees of 'SpinForASecond' over the entire program.

It is important to realize that as you double click on different nodes to make the current the SET OF SAMPLES CHANGES. When the current node is 'SpinForASecond' then this view shows ONLY samples that had SpinForASecond' in their call stack. However if you double click on 'DateTime.get_Now' (a child of 'SpinForASecond') then the view will now include samples where 'DateTime.get_Now' was called by call stacks that did not include 'SpinForASecond' and will NOT include call stacks that called 'SpinForASecond' but not 'DateTime.get_Now' . This can be confusing if you are not aware it is happening.

Sometimes you wish to view all the ways you can get to the root from a particular node. You can't do this using the caller-callee view directly because of the issue of changing sample sets. You can simply search for the node in the CallTree view, however it will not sort the paths by weight, which makes finding the 'most important' path more difficult. You can however select the current node, right click and select 'Include Item'. This will cause all samples that do NOT include the current node to be filtered away. This should not change the current caller-callee view because that view already only considered nodes that included the current node. Now however as you make other nodes current, they TOO will be only consider nodes that include the original node as well as the new current node. By clicking on caller nodes you can trace a path back to the root.

Because the caller-callee view aggregates ALL samples which have the current node ANYWHERE in its call stack there is a fundamental problem with recursive functions. If a single method occurs multiple times on the stack a naive approach would count the same SINGLE sample MULTIPLE times (once for each instance on the call stack), leading to erroneous results. You can solve the double-counting problem by only counting the sample for the first (or last) instance on the stack, but this skews the caller-callee view (it will look like the recursive function never calls itself which is also inaccurate). The solution that PerfView chooses is to 'split' the sample. If a function occurs N times on the stack than each instance is given a sample size of 1/N. Thus the sample is not double-counted but it also shows all callers and callees in a reasonable way.

See stack viewer for more.

Callers View

The callers view shows you all possible callers of a method. It is a treeview (like the calltree view), but the 'children' of the nodes are the 'callers' of the node (thus it is 'backwards' from the calltree view). A very common methodology is to find a node in the 'byname' view that is reasonably big, look at its callers ('by double clicking on the entry in the byname view), and then look to see if there are better semantics groupings 'up the stack' that this node should be folded into.

If you double click on an entry in the Callers view it becomes the focus node for the callers view, callees view and caller-callees view. Thus it is fairly common to double click on an entry, switch to the Callees view, double click on another entry and switch back.

In the callers view the top node is always the aggregation of all uses of a particular method regardless of the caller. Thus the top line's statistics should always agree with the statistics in the 'By Name' view. Moreover any children of a node represent the callers of the parent node. This means

The sum of the inclusive time of all children nodes will be equal to the parent's inclusive time.

Any children in the Callers view represent callers of the parent node. These will always have an exclusive time of 0, because by definition a caller is NOT the terminal method of the stack (since it called something else).

Handling of Recursion in the Caller and Callees view

Both the callers view and the callees view is formed by finding all samples that contain the focus frame an looking at the appropriate related node (caller or callee) related frame. However when the focus frame is a recursive function there is a because there are multiple choices for the caller and callees depending on which recursion instance is chosen.

PerfView resolves this by always choosing the 'deepest' instance of the recursive function in the stack. Thus if A calls B calls C calls B calls D, and the focus node was B, then this sample would have a caller of C (not A) and a callee of D (not C).

Callees View

The callees view is a treeview that shows all possible callees of a given node. It is very similar to the treeview, but where the treeview always starts at the root, the callees view always starts at the 'focus' node and includes ALL stacks that reach that callee. In the calltree view the different instances of the node would be scattered across the call tree, and would be hard to focus on.

If you double click on an entry in the Callees view it becomes the focus node for the callees view, callers view and caller-callees view. Thus it is fairly common to double click on an entry, switch to the Callers view, double click on another entry and switch back.

Like the Caller's view there is an issue with double counting when recursive functions are involved. See Handling of Recursion in the Caller and Callees view for more.

Flame Graph View

The flame graph view shows the same data as call tree view, but using different visualization. It gives you very intelligible overview. The graph starts at the bottom. Each box represents a method in the stack. Every parent is the caller, children are the callees. The wider the box, the more time it was on-CPU. The samples count is shown in the tooltip and in the bottom panel. To change the content of the flame graph you need to apply the filters for call tree view. To learn more about Flame Graphs please visit http://www.brendangregg.com/flamegraphs.html

The flame graph view in PerfView traditionally reflects the amount of consumed memory, but this can change when we graph the stack differences. After garbage collection, amount of memory consumed by a type can be negative when inspected in stack differences. In those cases, the corresponding flame graph boxes are drawn with a blue hue, pointing to a memory gain. Increasing memory usage is drawn with yellow/red tint as usual.

Notes View

This allows you to keep notes. This view is contains the same data as in the 'Notes Pane' that you can toggle with the F2 key. These notes are saved when the view is saved, and thus allows you to keep information like the leads you need to follow up on during the investigation. The notes pane is particularly useful i you need to 'hand off' the investigation to another person. By putting the 'explanation' of the performance problem in the note pane, and sending the saved view, the next person can 'pick up' where you left off.

Reusing Filtering Parameters

Naming Parameter sets

It is often the case that the grouping and filtering parameters definition get reasonably complex however they have a relatively simple semantic meaning. It is also useful to be able to save and reuse these parameters for other investigations. To facilitate this, filter parameter sets can be given a name (simply by entering text in the Name text box, and this name can later be used to identify this filter parameter set.

Named Parameter set are current not used by PerfView.

Diffing Two Traces

PerfView has the capability of taking the difference between two stack views. This is very useful for understanding the cause of a regression caused by a recent change. To use this capability you should

Open a stack view for both the 'test' and the 'baseline' that you are interested in.
Apply any filtering to isolate the scenario of interest (e.g if you only care about startup, set the time filter to exclude any other samples). It MUST be the case that the two traces represent equivalent work. Moreover, the smaller the trace, the easier it will be to analyze. Thus the more you can filter it down, the better. While you can just skip this step, some effort here will pay off later.
Resolve any symbols you think you might need (Right click -> Lookup Warm Symbols is often a fine choice). This is because 'Lookup Symbols' does not work for diffs.
Go to the stack view for the 'test' data select the 'Diff' menu bar. Under it you will find every other open stack view (and in particular the baseline you also opened). Select this baseline.

PerfView will then open up a stack view which contains the different between the 'test' view and the 'baseline' you selected. The algorithm it uses to do this is VERY simple. It simply negates the metric for the baseline, and then combines these samples with the samples of the test (which are unmodified). The result is a trace that has a sample which has the sum of the samples from of the 'test' and 'baseline' however the count value and metric value for all the samples in the baseline are NEGATIVE. This means that the counts and metric values will often 'cancel out', leaving just what is in the test but not the baseline.

Like a normal investigation you should start your 'diff' investigation using the 'By Name' view. In a typical investigation the 'test' trace has strictly more metric (the regression) than the baseline, and this is reflected in the totals for the diff (the total metric for the diff should be the total metric for the test minus the total metric for the baseline). The 'ByName' view then shows you where this difference came from with respect to the groups that have been selected with the 'GroupPats' (just like a normal trace).

If you are lucky, each line in the 'By Name' view is positive (or a very small negative number). This is the 'easy' case, and when this happens you have the information you are interested in (the precise groups that have additional cost in the test but not the baseline are at the top of the By Name view. From this point the diff investigation works just like a normal investigation (you can drill down, look at other views, change groupings, fold etc...)

However, it is not uncommon to have large negative values in the view. When this happens the diff is not that useful because we are interested in the ADDITIONAL time in the test trace, but the negative numbers in the view are telling us that the are big places where the baseline used more time than the test. Clearly the sum has to add up to the final regression, but as long as there are large negative values in the view, we can't trust the large positive values in the view because they MAY be canceled by the negative values.

Thus analysis of a diff trace always has an addition step: After you have formed the diff view but before you have don any analysis, you must use the grouping/folding/filtering operators to ensure that negative values have been 'cancel out' sufficiently . The view needs to have only has positive metric numbers (or inconsequential negative numbers).

In fact PerfView already helps with this. Normally a process and thread node in the stack display contains the process and thread ID for that node. While this is useful information it also means the nodes from the baseline and test trace are likely to NEVER match (since they have different IDs). If left uncorrected, this would cause the 'TreeView' to become pretty useless (it would show a large positive number under the 'test' process, and a slightly smaller large negative number under the 'baseline' but there would be no cancellation. PerfView fixes this by providing groupings that effectively remove the process and thread ID from the nodes. Now the nodes match and you get the desired cancellation.

PerfView can only do so much, however. It can anticipate the need to rewrite the process and thread IDs, but it can't know that you renamed some function, or that lazy initialization caused the cost of some initialization to move from one place to another. In short PerfView can't know all the 'expected' differences that you wish to ignore. It is your job as the analyst to make 'expected' differences 'match exactly' and thus cancel out.

PerfView's powerful folding and grouping operators are tools you will use to create this cancellation.. The mantra to remembers is 'grouping is your friend', keep your groups as large as possible. In particular

If you are having problems with cancellation, First try using the 'Group By Module' group in the 'GroupPats' textbox to isolate the difference to a module.

The rationale behind this strategy is straightforward. The larger the groups you form, the more likely 'inconsequential' differences will simply 'cancel out'. Modules tend to be the most useful 'big group' and thus grouping all samples by module is likely to show you a view where cancellation worked (only small negative numbers in the view). Once you identify the samples in a particular module that are responsible for the regression, you can then use the 'Drill Into' functionality to isolate JUST THOSE SAMPLES, and change the groupings to show you more detail. This tends to be a very useful strategy.

More Diffing Cancellation Strategies

The main technique for achieving cancellation in a diff is to pick big groups and then Drill into only those samples that are of interest. However there are some other useful things to remember.

Keep the scenario as small as possible.
Typically only a 'bottom up' analysis works for diffs. It is just too easy for there to be differences 'near the top' of the stack that will frustrate cancellation. Avoid this by doing a bottom up analysis (the 'By Name' view and the callee's view).

Fixing Renamed functions

Grouping lets you literally rename any node name to any other node name. Thus you can 'fix' any 'expected' differences in a trace. For example if MyDll!MethodA was renamed to MyDll!MethodB, you could add the grouping pattern

MyDll!MethodA-> MethodA;MyDll!MethodB->MethodAAl!MethodB->MethodA

which 'renames' both of them to simply 'MethodA' and resolves the diff. Folding can also be used to resolve differences like this. For example if these two methods are not event interesting (you don't need to see them on the call stacks), then you could simply fold both of them always with the folding pattern

MethodA;MethodB

which makes both of them disappear (and thus can't cause a difference).

Regression Investigation with Overweight Analysis

Overweight analysis is a fairly simple technique in which the inclusive cost of all symbols from two traces are analyzed. Normally a time metric is used but any inclusive cost could work.

The idea is this: using the base and the test runs it's easy to get the overall size of the regression. Let's say it was 10%. From there you could take as your null hypothesis that everything is just 10% slower. What you're looking for is symbols that changed more than 10% and are therefore in some sense more responsible for the change. The overweight report in this case would simply compute the ratio of the actual growth compared to the expected growth of 10%. When you find symbols with greater than 100% overweight those are of great interest.

Suppose main calls f and g and does nothing else. Each takes 50ms for a total of 100ms. Now suppose f gets slower, to 60ms. The total is now 110, or 10% worse. How is this algorithm going to help? Well let's look at the overweights. Of course main is 100 going to 110, or 10%, it's all of it so the expected growth is 10 and the actual is 10. Overweight 100%. Nothing to see there. Now let's look at g, it was 50, stayed at 50. But it was 'supposed' to go to 55. Overweight 0/5 or 0%. And finally, our big winner, f, it went from 50 to 60, gain of 10. At 10% growth it should have gained 5. Overweight 10/5 or 200%. It's very clear where the problem is! But actually it gets even better.

Suppose that f actually had two children x and y. Each used to take 25ms but now x slowed down to 35ms. With no gain attributable to y, the overweight for y will be 0%, just like g was. But if we look at x we will find that it went from 25 to 35, a gain of 10 and it was supposed to grow by merely 2.5 so its overweight is 10/2.5 or 400%. At this point the pattern should be clear:

The overweight number keeps going up as you get closer to the root of the subtree which is the source of the problem. Everything below that will tend to have the same overweight. For instance if the problem is that x is being called one more time by f you'd find that x and all its children have the same overweight number.

This brings us to the second part of the technique. You want to pick a symbol that has a big overweight but is also responsible for a largeish fraction of the regression. So we compute its growth and divide by the total regression cost to get the responsibility percentage. This is important because sometimes you get leaf functions that had 2 samples and grew to 3 just because of sampling error. Those could look like enormous overweights, so you have to concentrate on methods that have a reasonable responsibility percentage and also a big overweight. The report automatically filters out anything with less than +/- 2% responsibility.

Most of this summary is available online with more examples here.

Quick Start for the Event Viewer

The Event Viewer is a relatively advanced feature that lets you see the 'raw' events collected in an ETL file. To get started as quickly as possible

First go back to the ETL file in the main viewer and double click the 'EventStats' icon under the ETL file. This will give an HTML report of the counts of all the events that were collected. This gives you a 'rough' idea of what is actually in the file.
Next launch the Event Viewer (double click on the 'Events' icon for the ETL file. Click on the left pane and hit Ctrl-A to select all the events and hit the enter key.
This will display all the events in the trace from in chronological order in the right pane.
Typically you will want to select a process of interest (select from the dropdown view in the 'Process Filter' textbox). You will also only want to select particular events (by selecting events names in the left pane), and a particular time range (in the Start and End text boxes). You can also filter the events to those that only contain a certain .NET Regular expression by typing something in the 'Text Filter' text box.
Finally you often will only want to see some of the fields of the events, which you can select by the 'Cols' dropdown menu. The order in which you click the columns determines the order in which they are displayed in the viewer.
You can cut and paste items out of this view, or right click -> Export To Excel to view the data in the right view in Excel for further analysis.

Event Viewer Tips

In addition to the General Tips, here are tips specific to the Event Viewer.

Canceling - A variety of actions (hitting return in a textbox or double clicking on an event name) will cause a refresh (which can take a while). You can cancel the refresh by hitting the 'ESC' key or the cancel button (lower right corner).
Viewing Stacks If an event has a stack trace associated with it the 'HasStack' field will be true. By selecting the time of that event and hitting Alt-S you can see the stack for that time (which will include that events). You can also select two times, and that region will be shown in the AnyStack view.
Selecting columns - hitting the 'cols' dropdown allows you to select only particular columns to display. The order in which the columns are selected determines the order in the display. Very handy.
Column Sums, Event Counts- When the display is refreshed, the count of the number of events processed is given in the status bar (and log file), as well as the sum of any columns that are number values.
Selecting Event Name Quickly - It is not uncommon to have many event names in the left window. To find an event quickly simply type some substring of the event name in the 'Filter:' textbox immediately above the event names list view. As you type characters the event name pane is filtered to only contain event names that contain the filter string. In a few keystrokes you can narrow your search and then double click on the entry you want. This is also an easy way of selecting all events with a particular substring (simply select all names in a view and hit enter). The filter textbox accepts regular expressions, and one of the more useful is the or operator |. Thus the text 'image|process' will filter to all the events that have image or process in their name.
Using the histogram to select an event range - When event are displayed it also populates the Histogram textbox. By selecting ranges in this box, you can get a reading of the start, stop, count and rate information for that range. Hitting Alt-R will 'zoom into' just that range.
Show Local Time - The timestamp is displayed in the time zone where events are collected (Trace Local). You can switch to the local machine's time zone by clicking on the 'View - Show Local Time' option.
Launch in Excel - Right clicking and selecting 'Open in Excel' will save the entire view is saved as a temporary CSV file and then opened in excel This allows you to do pivot tables and more advanced filtering. Cut and paste into excel also works, but this works better if you wish to capture the entire view.
Save as CSV - Right clicking and selecting 'Save as CSV' save the entire view as a CSV file you specify. This can be opened in excel or otherwise post-processed.
Save as XML - Right clicking and selecting 'Save as XML' saves the view as an XML file. This file contains ALL the columns (not just the ones in the view). Typing 'start excel FILE.XML' will load the file in excel.
Cutting and pasting time ranges- If you type Alt-T (right click -> Copy Time Range) it will copy the start and end times to the cut/paste buffer These can then be pasted into the 'start' textBox to clone a type range from one view to another.
Tab instead of Enter - If you wish to change several filter textboxes use Tab instead of return to complete the entry so that you don't cause a refresh.
Visualize Event Counters- Right click the event name, you can choose the option to Show EventCounter graph, and it will popup a HTML based line graph for the data. The visualization will respect the text filter so you can choose to display a particular subset of the counters.

The Event Viewer

Some data file (currently on XPERF csv and csvz files) support a view of arbitrary events sorted by time. The Event Viewer is a window that is designed to display this data. Basically it is a view of events in chronological order in time, which can be filtered and searched. A typical scenario is that the application has been instrumented with events (like System.Diagnostics.Tracing.EventSource), and these events are used to determine a time of interest.

The View has two main panels. The panel on the left contains all the events types in the trace. You simply select the ones of interest by clicking on them with the control key held down (to select several simultaneously. The right window contains the actual events records. It is relatively expensive to perform the scan over the data to form the list so you must explicitly ask for the right panel to be updated. You can do so in several ways

Click the 'Update' button in the upper left corner
Hit F5
Double click on an entry in the left panel (If you have multiple selections you must also hold the Ctrl key down to not lose your selection)
Right click and select the 'Update' menu item.
Hit enter in any filtering text boxes at the top of the window.

Filtering by Process

In addition to filtering by event type, you can also filter by process by placing text in the 'Process Filter' text box. This text is a .NET regular expression and only records with processes that match this text will be selected. The matching is case insensitive, and only has to match a substring in the process name. You can use the standard regular expression ^ and $ operators to force matches of the complete string. Note that for context switch events, the process filter will match both the process being switched from (OldProcessName) as well as the new process being switched to (ProcessName).

Limiting the number of records returned

Traces can be very large, and thus a very large number of results can be returned in the right panel. To speed things up, on a reasonable number (by default 10000) of records are returned. This is the 'MaxRet' value. If it is too small, you can update this textbox to something larger.

Filtering by Text

In addition to filtering by process, you can also filter by text in the returned events. Only records whose entire displayed text matches the pattern will be display. Thus if you change the column's displayed it CAN affect the filtering if the there is text in the 'Text Filter' text box. The string in the 'Text Filter' is interpreted as a .NET regular expression and like the process filter by default the match only has to match a substring to succeed. If the pattern begins with a '!' character, then only entries that do NOT match the pattern will be shown.

Selecting Columns

Fields that are specific to the event are shown as a series of NAME=VALUE pairs in the 'Data' column. This data column can be quite long and often the most interested elements are at the end, making the view inconvenient. You can fix this by indicating which of these event-specific columns you wish to have displayed by placing a field names (case insensitive) in the 'Columns to Display' textbox . This can be populated easily by clicking on the 'Cols' button. This displays a popup list of all the columns, and you can simply click on the ones of interest (shift and ctrl clicking to select multiple entries), and hitting 'enter' to continue. The columns will display in the order that you selected the items, and the '*' can be used as a wild card that represents all columns that have not already been selected. A maximum of 4 fields will be displayed in their own columns. After the first 4 the rest of the specified columns will be displayed in the 'rest' column.

Filtering On Select Columns

Events can be filtered using the Columns to Display textbox by specifying expressions combined with boolean operators: || and && based on the selected column within square brackets ([]). The format of individual queries is: LeftOperand Operator RightOperand where:

LeftOperand can either be the name of the property or the name of the event followed by "::" and the name of the property.
Operator can be one of the following: ==, !=, <, <=, >, >=, Contains.
RightOperand can either be a string or a numeric quantity that'll be interpreted as a double.

Notes:

Once a query is specified, the logical OR operator || / the logical AND operator && can be used to combine individual expressions.
Individual expressions can be encased in parentheses ().
Currently only 26 expressions can be created.
Spaces are required whenever Contains is used as an operator.
If you don't specify any fields to display, all fields will show up as part of the "Rest" column.

Examples:

Examples of simple queries include:

[(ThreadID == 1,240) && (ProcessName == devenv)] ThreadID ProcessorNumber
[(GC/Start::Depth > 1) && (ProcessName==devenv)]
[(ProcessName Contains ServiceHub) || (ProcessName Contains devenv)] ProcessName Count ProcessorNumber Depth

Examples of some more complex expressions:

[(Count>10 && (Depth >= -1) && (Count<=30) && (Count <= 30 || ProcessorNumber == 2))] Count ProcessorNumber Depth
[(Count > 10 && (Count <= 30) && (Count <= 30 || ProcessorNumber == 2))] Count ProcessorNumber Depth

Some video examples of the usage:

Event Types

The left hand panel contains all the events that are in the trace. These include the events collected by the OS kernel, as well as the .NET runtime, and any others that you indicated when you collected the data.

Filtering the event list

Because the number of event types can be large (typically dozens), there is a 'Filter' text box at the top of the event type pane. If you are looking for a particular event, simply type some part of the event name in this text box and the displayed list will be filtered to those events that contain the typed text somewhere in the name. The text you type here is really a .NET Regular expression, which means you can use wild cards (. and *) and perhaps most importantly the | operator to mean 'or'. This allow you to filter out all but some interesting events quickly. Also remember that Ctrl-A will select everything in the view.

Event Histogram

When the event view is updated, in addition to populating the main listbox, it also generates a histogram of event counts which shows how frequency of the selected events varies over time. The time interval as designated by the Start and End textboxes is divided into 100 buckets and the event count for each of these buckets is calculated This number is then scaled so that the largest bucket represents 100% and the same convention used in the stackviewer's When Column is used to convert this percentage into a number (or letter). This displayed just above the listbox. Like the When Column you can select a portion of this display and 'zoom in' by using the 'Set Range Filter' command (Alt-R). In addition when you change the selection in the histogram text box PerfView will calculate the start and end times, total event count and average event rate and display these values in the status bar.

Important Kernel Events

Here are some Kernel and .NET Events that are worth knowing more about

Windows Kernel/SystemConfig/CPU - Tells the name of the machine, CPU speed, amount of memory the machine has
Windows Kernel/Process/Start - Tells every process that started during the trace, including its command line.
Windows Kernel/Process/End - Tells every process that ended during the trace, including its exit code.
Windows Kernel/Image/Load - This tells when every DLL that was loaded in the process
Windows Kernel/TcpIp/Recv - Shows when network packets arrive, including the source and target IP address and port.
Windows Kernel/PerfInfo/Sample - These are the samples taken every 1msec per CPU that are used in the stack CPU stack viewer.

Windows Kernel/Thread/CSwitch - Shows whenever a thread either gains or loses the use of a physical CPU.

Microsoft-Windows-DotNETRuntime/Runtime/Start- Indicates the version of the .NET runtime used as well as the startup flags (which say if you are using the SERVER or CONCURRENT GC).
Microsoft-Windows-DotNETRuntime/GCStart (and GCStop) - Indicate when a GC starts and stops. Use the GC View for a more useful visualization.

The ETW Data Collection Dialog

Before starting collection PerfView needs to know some parameters. It fills in defaults for all but the command to run. Thus in the common scenario you only need to fill in the command to run (you are using the 'Run' command) and hit return to start collecting data.

The Command TextBox - This is only active for the Run command. It is the command line to run after collection has been turned on. This textbox is hidden for the Collect command.
The Focus Process TextBox - This is only active for the Collect command. It is either a decimal process ID or a name of an executable without the directory part but WITH the suffix (e.g. MyProgram.exe). This allows you to only turn on non-Kernel events for a particular process, and thus cut the overhead / size of the collection when there are many active processes on the system.. Note that it does not have an effect on kernel events (which are often the most common, but not always), so it may not help as much as you would like, but DEFINITELY helps during rundown (if you have many managed processes, they all do rundown which can be impactful). So it always helps when there are many managed processes (because of rundown) but can help quite a lot if many of those processes allocate a lot, or use the threadpool (which both can create many events). This textbox is hidden for the Collect command.
The Data File TextBox - This is the name of the output file. It defaults to PerfViewData.etl. If you change the file name, you should use the .ETL extension or the viewer will not recognize it as an ETW data file.
The Current Directory TextBox - This is the current directory where the command will be run.
The Merge Checkbox -When data is collected it is collected in multiple files. If you will analyze the data on the same machine it was collected on, viewing these raw files is fine. However if you wish to copy the data to another machine these files need to be merged before copying. Checking this box will cause this merging to happen immediately after data collection. It tends to take 10s of seconds. See the merging section for more.
The Zip Checkbox - By default PerfView does only the work that is needed to analyze the data on the machine on which it was collected. If you intend to copy the data to another machine you should zip it first. This does a number of things, including merging all ETL file (see merging) as well as creating symbolic information for .NET Native images (see NGen Pdbs), and creating a compressed ZIP file that contains all of this information. This is the recommended way to create a file that can be analyzed on any machine. This will take some extra time to do (e.g. 10s of seconds). If you do not do this at the time data was collected it can be done any time afterward (on the machine where the data was collected), by right clicking on the file in the Main Viewer and selecting the Zip item.
The Circular MB TextBox - When collecting ETW data, you can collect a lot of data quickly, it typically takes only a few minutes for the file to reach a gigabyte in size. When the files get this big they become too large to handle easily. Instead you want to limit the amount of data collected (preferably to only a few seconds. One good way of doing this is to collect using a circular buffer. In this mode, the file size is limited (to the specified number of megabyte), and when the limit is reached the oldest data is overwritten. This keeps file size under control. The number in this text box is the file size limit. Note that because the ETW system collects up to 3 etl files (one for kernel events, one for non-kernel, and one for 'rundown' events), the file sizes can be bigger than this number but it should not be bigger than a factor of 2. This text box corresponds to the /CircularMB=XXX command line parameter.
The Thread Time Checkbox - This option is needed in any Blocked / Wall Clock Time Investigation. It causes a kernel event to be logged every time a thread gets to use the CPU (a context switch). It also turns on 'ReadyThread' events that are logged when one thread 'awakens' another thread (e.g. sets an event that another thread is waiting for). This option also includes all the 'default' events. This is a relatively expensive option (without ThreadTime overhead is typically about 3% with thread time it is more typically 10%), but still low enough to use any production scenario. If you care about non-CPU time you will want to turn this on. This can be turned on in the command line by using the /threadTime option (which is a shortcut for /kernelEvents=ThreadTime).
The Mark TextBox - It is possible to place markers in the event log file by pressing the 'Mark' button while data collection is happening. Each of these marks has a message string associated with it. The text in this textbox is used for this message string. Marks show up in the 'Events' view for the ETL file generated. Their provider name is called 'PerfView' and their task name is 'Mark', and the mark text is in the payload.

Whether you use the 'Run' or 'Collect' command, profile data is collected machine wide. In order to collect profile data you must have administrator rights. If you do not, PerfView will try to elevate (bring up a UAC dialog box), and relaunch itself with administrator privileges.

Advanced Options

PerfView chooses a useful default set of ETW events to log which allow common performance analysis to be done, however, there are numerous ETW events that could be turned on. Here is a sampling of some of the most useful of these more advanced events.

The Kernel Base Checkbox. Turns on or off the basic set of kernel events. These base events are low volume events like process, thread, image and network events. Typically these events are only turned off if you are doing monitoring rather than performance analysis. (e.g. you just want to see the messages from your EventSource. The option /kernelEvents=None can be used to achieve this effect at the command line.
The Cpu Samples Checkbox. Turns on or off the CPU sampling (by default every 1MSec per CPU). This is on by default. Removing the Profile flags to the '/kernelEvents' option (e.g. /kernelEvents=default-Profile) can be used to achieve this effect at the command line.
The Page Fault CheckBox - Causes a kernel event to be logged every time a page fault or other primitive memory operation is done. A stack is logged for this event. Adding the Memory flags to the '/kernelEvents' option (e.g. /kernelEvents=default+Memory) can be used to turn this option on the command line.
The File I/O Checkbox - Causes a kernel event to be logged every time a file operation is initiated. Many of these may not cause disk events because the data is in the file system cache. Adding the FileIOInit flags to the '/kernelEvents' option (e.g. /kernelEvents=default+FileIOInit) can be used to turn this option on the command line.
The Registry CheckBox - Causes a kernel event to be logged every time a registry operation is performed. Adding the Registry flags to the '/kernelEvents' option (e.g. /kernelEvents=default+Registry) can be used to turn this option on the command line.
The Virtual Alloc CheckBox - Causes a kernel event to be logged every time a VirtualAlloc call (primitive memory allocation) is make (or freed). See Unmanaged Memory Analysis for more. Adding the VirtualAlloc+VAMap flags to the '/kernelEvents' option (e.g. /kernelEvents=default+VirtualAlloc+VAMap) can be used to turn this option on the command line.
The IIS CheckBox - Causes detailed logging for the Internet Information Service IIS to be logged. You can do this from the command line by using /providers:Microsoft-Windows-IIS. Note that PerfView has special logic that notices when this provider is set to a verbose level and turns on a number of other IIS related providers in that case. Thus 17 providers are actually turned on in this case. These events can be viewed in the 'events' view.
The RefSet CheckBox - Causes detailed kernel events associated with memory usage (called Reference Set) to be logged. This information can be viewed in the 'Any Stacks' view (look for PageAccess). Adding the ReferenceSet flags to the '/kernelEvents' option (e.g. /kernelEvents=default+ReferenceSet) can be used to turn this option on the command line.
The Handle CheckBox - Causes detailed kernel events the creation and closing of Window OS kernel handles to be logged. You will see 'Windows Kernel/Object/*' events to show up in the 'events' view as well as activating the Windows OS Handle Stacks view (as well as the 'Any Stacks' view). Adding the Handle flags to the '/kernelEvents' option (e.g. /kernelEvents=default+Handle) can be used to turn this option on the command line.
The .NET Alloc CheckBox - Causes an event to be fired every time a .NET object is allocated. This can be also activated by the /DotNetAlloc command line option. Note that this only affect processes that start AFTER data collection has started. See GC Heap Net Mem (Coarse Sampling) for more.
This option tends to have a VERY noticeable impact on performance (2X or more). Also, if the application allocates aggressively, so many events will be fired so quickly that events will be lost even when the /BufferSizeMB qualifier is used to set the size very large (e.g. 500Meg). For these reasons it is usually a better idea to use the .NET SampAlloc option instead if at all possible.
The .NET SampAlloc CheckBox - Cause a 'smart' sampling of the allocations done by the program. Basically the dynamic rate of allocation for each type is measured on the fly, and the sampling for each individual type is adjusted so the number of allocations per sec stays under 100. The window for measuring the rate is roughly 16-80 msec long (it is an exponentially decaying window). This means that it is for any type that allocates less than 20 instances in any 100 msec window is unlikely to be trimmed at all. Also all objects larger than 10K are never trimmed. However commonly allocated things (e.g. strings byte[] and object[] will be trimmed, typically by 10 to 1, 100 to one or even 1000 to one until the allocation rate is roughly 100 / sec. When an object is trimmed, its size is remember and added to the next object of that type which is sampled. Thus the size reported is representative of the true allocation size, but the stack associated with that size will be shared (and thus may be inaccurate). Statistically speaking, if you have several seconds of trace (and thus hundreds to thousands of samples) what is reported is likely to be close to the true statistics.
The overhead of turning on .NET SampAlloc CheckBox is much less than the .NET Alloc CheckBox. Typically the overhead is 10-20% (unlike 2X or more), and produces 200 Meg per minute of trace. This is a bit more expensive than turning on /threadTime however low enough that you can leave it on in production (especially if the application does not allocate heavily).

Note that this only affect processes that start AFTER data collection has started. This can be also activated by the /DotNetAllocSampled command line option. See GC Heap Net Mem (Coarse Sampling) for more.
The ETW .NET Alloc CheckBox CURRENTLY ONLY RECOMMENDED for .NET NATIVE and PROJECT K SCENARIOS. This checkbox collects 'Smart sampled' object allocation information which is essentially the same data as the .NET SampAlloc CheckBox. It does this, however in a different way. The .NET SampAlloc CheckBox works by injecting a .NET Profiler DLL (ETWClrProfiler) into any process that starts after data collection begins. The ETW .NET Alloc Checkbox accomplishes the same thing by turning on ETW events built into the runtime since Version V4.5.2. Thus there are times when each of them will work but the alternative will not. In particular the ETW version does not work on older runtimes, but the Profiler based solution does not work on runtimes that do not support the .NET profiler API (e.g. .NET Native or Project K). Eventually older runtimes will not be interesting and using the ETW based solution will be the uniform choice. This checkbox just sets the GCSampledObjectAllocationHigh bit of the /ClrEvents flags, so this option can be turned on at the command line using /ClrEvents=default+GCSampledObjectAllocationHigh.
The .NET Calls CheckBox - Causes an event to be fired every time you enter the prolog of a .NET method (thus methods that are NOT implemented in .NET do not show up). This can be also activated by the /DotNetCalls command line option. Note that this only affect processes that start AFTER data collection has started. You can however start tracing, start the program then start and stop collection (multiple times) afterward to capture different scenarios.
The events from this option are called 'CallEnter' and show up in the 'AnyStacks' view in the 'Advanced Group' view. Most likely you will want to filter out all other events in the view by selecting the CallEnter node -> right click -> Include Item.

This option tends to have a VERY noticeable impact on performance (5X or more). If the application runs a lot of code (common), it may be necessary to make /BufferSizeMB qualifier very large (e.g. 1000Meg). and even that may not be enough This option is really only meant for small isolated tests.

There is an command line option /DotNetCallsSampled which works like /DotNetCalls, however it samples every 997 calls rather than every call. This cuts the overhead (and file size) by a factor of ~1000 which is better if overhead is a concern.

By default the runtime does not disable inlining of methods. Thus you will not see inlined calls in your trace. There is also a command line option /DisableInlining which disables inlining so you will see every call. This slows things down even more so should only be used in 'small' scenarios.
The JIT Inlining Checkbox - Causes an event to be captured for every inlining decision made by the JIT. The results are available as two tables in the JIT Stats report, one showing all of the successfully inlined call sites and one showing all of the failed inlining call sites (where the JIT decided not to inline). Each table shows the method being compiled, the caller, and the callee; the failed table also shows the reason provided by the JIT for why inlining wasn't performed. For fine-tuning performance of hot paths, this information can be very valuable in understanding where functions you expected to be inlined aren't being inlined, allowing you to then examine why and potentially tweak your code accordingly (such as by separating out a fast path into its own method that's more likely to be inlined, by using [MethodImpl(MethodImplOptions.AggressiveInlining)], etc.). This feature can also be activated with the /JITInlining command line option.
The .NET Native CCW - Causes an event to be captured produced via .NET COM Callable Wrapper (CCW), an increase and decrease references count. Works only with .NET Native applications. Also shows the stacks according to collected events.
The Net Capture Checkbox This option turns on Microsoft-Window-NDIS-PacketCapture events using in the 'netsh trace' command built into windows. The full payload of every packet will be logged to the ETL file. PerfView's event viewer has rudimentary packet parsing capabilities, but for non trivial scenarios it is recommend that you use the NetMon option and use the NetMon tool to parse the packets. The /NetworkCapture option enables this from the command line.
The NetMon Checkbox This option is like does everything that the Net Capture option does (log every packet to the ETL file. However it also generates another ETL file (_NetMon.etl) that has just the Networking packets and can be read directly by the NetMon tool. Thus you can use the full power of the NetMon tool to inspect the networking behavior but also have all the system events as well. The /NetMonCapture option enables this from the command line.
The VS Checkbox - Turns on the providers built into Visual Studio. Only people profiling Visual Studio itself should care about this option
The .NET Checkbox - Turns on the default .NET providers. Unless you have No .NET code in the process of interested, you should leave this provider on. The /ClrEvents=None command line option achieves the same effect.
The .NET All Checkbox - Turns on all .NET providers, even the more verbose ones. Currently the only additional CLR events are the Interop and just in time (JIT) tracing options (which tell about inlining decisions).
The Background JIT Checkbox - Version 4.5 of the .NET Runtime introduced a class called 'System.Runtime.ProfileOptimization' which allow programs to save information about what methods where JIT compiled on that execution. Checking this box will enable events that will allow examination of this background compilation. See Background JIT Compilation for more.
The GC Only Checkbox - Turns off all providers (including the default ones) except for those needed to do a .NET GC Heap analysis. In this mode relatively few events are logged, so you can collect data about a large period of time (say an hour), in a reasonable size (say 200Meg). This option shows you stacks for a SAMPLING of GC allocations. In addition to the GC events it also turns on the MemInfo and VirtualAlloc events, which are useful for tracking down memory issues. The /GCOnly command line option achieves the same effect.
The GC Collect Only Checkbox - Turns off all providers except those that describe garbage collections. Thus even the GC allocation sampling is turned off. This mode logs even less data than GC Only, and thus can collect data over an every longer period of time (say a day) in reasonable size (say 200 meg). The /GCCollectOnly command line option achieves the same effect.
The .NET Stress Checkbox - Turns on .NET events when the runtime does 'rare' operations that have proven to be useful tracking down non-deterministic 'stress' bugs. You can also turn these events on by specifying the 'ClrStress' provider in the Additional Providers textbox. The option /Providers=ClrStress can achieve this at the command line.
The MemInfo Checkbox - Turns on the Microsoft-Windows-Kernel-Memory so that every half second a snapshot of memory statistics for every process in the system is taken. These show up in the 'events' view as 'MemInfo' 'MemInfoSessionWS' and 'MemoryProcessMemInfo' events. This is really just a shortcut for specifying the Microsoft-Windows-Kernel-Memory in the AdditionalProviders textBox. so you can specify this option at to command line with /Providers=Microsoft-Windows-Kernel-Memory.
The Additional Providers TextBox - A comma separated list of specifications for providers corresponding to the /providers command line qualifier. This can be specified by using the ... button or by the following textual specification. Each provider specification has the general form of provider:keywords:level:values. The keyword and levels specification parts are optional and can be omitted (For example provider:keywords:values or provider:values is legal).
- provider is either
  - The name of an ETW provider registered with the operating system. The providers that come with the operating system are all registered in this way. Use the 'logman query providers' for a complete list. Typically the most interesting providers start with Microsoft-Windows in their name.
  - The syntax *EventSourceName. This is another way of specifying an EventSource provider. Every ETW provider is uniquely identified by a GUID that is specified when the EventSource is created. However it is strongly encouraged that programmers don't specify an explicit GUID but let a GUID be generated from the EventSource's name using a web-standard procedure (4122). This allows you to always find the GUID if you know the name of the provider. The * syntax says to look operate on the provider whose GUID is formed from EventSourceName using RFC4122. This allows you to turn on an EventSource without knowing where it is defined or what its GUID is as long as you know its name. Names are case insensitive because the name is always upper cased before applying RFC4122.
  - GUID (e.g. 77765ec1-a648-502a-0ba0-2beb13633b47). Fundamentally the OS just needs the GUID to turn on a particular ETW provider. Thus you can always simply specify just the GUID.
- keywords is a 64 bit hexadecimal number (which can have a 0x prefix), that specifies the events groups (called keywords in ETW nomenclature). The meaning of these varies from provider to provider (logman query provider providerGuid will tell you the meaning of keywords). Choosing 0xFFFFFFF is good to start with if you are unsure. Omitting this or using '*' means all keywords.
- level is one of the following (Critical = 1, Error = 2, Warning = 3, Informational = 4, Verbose = 5). You can use either the names or the numbers to specify the level. Omitting this or using '*' means Verbose.
- values this is a list of semicolon-separated values KEY=VALUE, which are used to pass extra information to the provider or to the ETW system. KEY values that begin with an @ are commands to the ETW system. Everything else is passed on the the provider (EventSources have direct support for accepting this information in its OnEventCommand method). The special ETW keywords include
  - @StacksEnabled - If this key's value is 'true' then the stack associated with the event is taken (for every event in the provider)
  - @ProcessIDFilter - a space separated list of decimal process IDs to collect data from. Only events from these processes (or those named in the @ProcessNameFilter) will be collected. Since IDs only exist after a process is created, this only works on processes that are running at the time collection starts.
  - @ProcessNameFilter - a space separated list of process names (a process name is the file name (no path) of the executable INCLUDING the .EXE extension). Only events from the names processes (or those named in the @ProcessIDFilter) will be collected. It does not matter if the process was running before collection or not.
  - @EventIDsToEnable -a space separated list of decimal event ID numbers to collect. Event ETW event has a unique event ID and any IDs in this list will be collected in addition to any events specified by the Keywords.
  - @EventIDsToDisable - a space separated list of decimal event ID numbers to collect. Event ETW event has a unique event ID and any IDs in this list will be suppressed from those specified by the Keywords.
  - @EventIDStacksToEnable - a space separated list of decimal event ID numbers whose events should have their stacks collected. Event ETW event has a unique event ID and any IDs in this list will have a stack logged as well as the event information.
  - @EventIDStacksToDisable - a space separated list of decimal event ID numbers whose events should have their stack collection suppressed. Event ETW event has a unique event ID and any IDs in this list will not have a stack collected even though the @StacksEnabled would otherwise have cause a stack collection.
  Because some of the lists use whitespace as a separator if you specify these on the command line, you will need to quote the command line qualifier.

In addition to the more advanced events there are additional advanced options that you rarely have to change.

The Sample Interval Text Box. - By default, when CPU sampling is turned on, the system takes a sample once a millisecond per CPU. This number changes this default. It can be fractional but the system does enforce a minimum time (typically .125 MSec). This can be set from the command line with the /CPUSampleMSec:XXX option.
The .NET Symbol Collection Checkbox - In order to get symbolic names for .NET methods, it is necessary to ask the .NET Runtime to dump this symbolic information to the data file. Checking this box causes this to happen for every .NET process in the system just before data collection stops. Because the .NET runtime already dumps this information at process shutdown, if the process you are interested in shuts down before data collection completes, then doing this rundown is not necessary. This is why this defaults to 'off' for 'Run' commands since they always shut down before data collection completes. However if you are using the 'Collect' command the default is to perform this dump. If you know that the process has already shutdown, however you can uncheck this box and save some time and disk space.
The No 3.X NGEN Symbols Checkbox - In version 4.0 of the runtime and beyond, the runtime has the capability of generating symbolic information (PDBs) directly from the NGEN images. However previous version requires all that symbolic information to be dumped into the ETL file. Be default PerfView only dumps this information for processes that need it (that have runtime versions before V4.0), however this still means that if there are V3.5 processes running, even if they are 'uninteresting' processes they will still dump a lot of data to the ETL file. If you know that the processes that you are interested in run Version 4.0 or greater, than you can avoid bloating the file in this way by checking this checkbox.
The Symbol Timeout TextBox - If the .NET Symbol Collection checkbox has been checked PerfView will signal all .NET processes to dump their symbolic information. Symbolic information can take few seconds to over a minute to complete depending on how many processes are running .NET applications. To determine how long to wait, PerfView monitors CPU activity and when it drops to a low value it assumes rundown is complete. However this heuristic is not foolproof. If there is a run-bound process on the system, PerfView could wait forever. Thus some fall back timeout is needed. This is what the Symbol Timeout Textbox is for. It defaults to 30 seconds which is typically more than enough, but may need to be increased.
The Max Collect TextBox - This is the number of seconds that collection will continue. Useful for automation collection. See Production monitoring for more details.
The Stop Trigger TextBox - This is of the form CATEGORY:COUNTERNAME:INSTANCE OP NUM (where CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as PerfMon), OP is either < or >, and NUM is a number. When that condition is true then collection will stop. See Production monitoring for more details.
The Cpu Ctrs TextBox - On PerfView has the ability to sample stacks based on CPU sampling counters (like instructionsRetired, DCache Miss rates Branch Mispredictions etc) in addition to the standard sampling based on time. To turn these on you enter a space-separated list of strings of the form CpuCtrName:RolloverCount, where CpuCtrName is the name of the counter and RolloverCount is the number of such events to skip after taking a sample (thus BranchInstruction:10000 will take a stack trace once ever 10K branch instructions. The set of CPU counters that are supported depends on the processor. In addition to the GUI you can access this feature using the command line /CpuCounters:XXX qualifier. (See PerfView -> Command Line Help for more)
You can select several of these options from the drop down menu and the modify the counts if desired. Currently there is no special view for these events, they show up in the 'Any Stacks Stacks' view as the PMCSample event. Thus going to that view and doing a 'Include Item' on this item will allow you to see at what stacks the samples where taken.
Note: Earlier than Windows 10 this feature does not work if there is a hypervisor enabled. If you only have 'Timer' events available this is the cause. Disable the hypervisor to use this feature.
The OS Heap Executable TextBox - Windows has the ability to log every time an memory allocation is made from the OS memory heap (GlobalAlloc or LocalAlloc APIs). Most unmanaged languages allocate their memory from this heap. However unlike other ETW events, this one is so voluminous that it is not turned on machine wide. Instead you specify the name of the process EXE (no directory name, with or without the EXE extension) and only process that start after collection starts and and have this name will log OS Heap events. If you want to log events from a process that has already started use the OS Heap Process ID Textbox. See Unmanaged Memory Analysis for more.
The OS Heap Process ID TextBox - Windows has the ability to log every time an memory allocation is made from the OS memory heap (GlobalAlloc or LocalAlloc APIs). Most unmanaged languages allocate their memory from this heap. However unlike other ETW events, this one is so voluminous that it is not turned on machine wide. Instead you specify the process ID of the process you wish to log events. If you wish to log events of a process that has not yet started, use the OS Heap Executable Textbox. See Unmanaged Memory Analysis for more.

Provider Browser

The Provider Browser is a dialog box generated from the ... button on the right of the additional providers textbox. The Provider Browser allows the user to inspect the providers that are available as well as the keywords available any particular provider.

Because there so many ETW providers available machine wide, the Browser also allows the search to be filtered to only those providers that are relevant for a particular process.

The process selector has a list of all of process on the system when the window was created. This is here to allow providers to be viewed for a given process. The "*" will selected all the registered providers. This is the default selection.
The provider selector is a list of either all of the registered providers or those for the selected process
The keyword selector is a list of all of the keywords for the selected provider. Multiple keywords can be selected for the provider specification.

Viewing Manifests

While the name of the provider and its keywords are often sufficient to decide whether what events to turn on, it is not unusual that you want more information about what the possible events are. This is what the 'View Manifest' button is for. Many providers register a XML document called a manifest that describes all the events the provider can generate in relatively fine detail. Included in this manifest is

A complete list of all the keywords (bits in a bitset) that can be specified to control what events are enabled
A description of each event that includes
- The task and opcode for the event (which make up its name)
- The name and type of each property that is part of the payload for the event

This information is typically sufficient to understand determine the optimal keywords to set for any given application. See the official docs for more details of the information in the manifest).

The Abort command

The model for ETW data collection is that data is collected machine-wide. Moreover, data collection can exceed the lifetime of the process that started collection . While this characteristic is useful (it allows independent start and stop command line commands), it also means that it is possible to accidentally leave ETW collection running for an indefinite period of time. PerfView goes to some length to ensure that data collection is stopped in typical cases, however if PerfView was terminated abnormally, or if the command line 'start' operation was used it is possible that ETW data collection is left on. The Collect->Abort command is designed for this case. It ensures that any ETW providers turned on by PerfView are off.

Finally, is also easy to launch PerfView from the command line to collect profile data. See collecting data from the command line for more.

Memory Collection Dialog

The memory collection Dialog box allows you to select the input and output for collecting GC Heap data as well as set additional options on how that data is collected.

Process Dump TextBox This textbox is only present when extracting the GC heap from a process dump. It indicates the dump file that will be used as input.
Process Filter TextBox Any .NET Regular expression in the Filter textbox is used as a filter for the Process List View. Only processes whose name or process ID match the given regular expression will be shown in the listview. This textbox is only present when dumping a GC heap from a live process. see filtering by process for more.
All Procs Check Box Normally only processes that have a GC heap (.NET and JavaScript processes) are displayed in the process window. Checking this box will show all processes on the system. This checkbox is only present when dumping a GC heap from a live process. see filtering by process for more.
Processes ListView This listview is only present when dumping a GC heap from a live process. This listview shows all the processes in the system that you currently have sufficient rights to access. If you don't see the process of interest, it may be because you don't have sufficient rights. You can click the 'Elevate to Admin' hyperlink to relaunch PerfView with Admin rights, which typically corrects the problem.
Data FileName TextBox This textbox holds the path name of the file that heap dump data will be written to.
Max Dump TextBox By default to keep dump file size under control, there is a limit to the amount of the GC heap that will be dumped. By default this is 250K objects. This is typically more than enough to get a good sample (and PerfView tries hard to get a representative sample). This text box allows you to set this default. See Understanding GC Heap Sampling for more.
Freeze CheckBox When collecting from a live process, by default the process is NOT frozen for the duration of the dump but only in short (~ 100msec) bursts. This make the process of dumping the heap unimpactful in server scenarios (it allows the server to continue to service requests). However because the heap is changing while the heap is being dumped, it is not a true snapshot in time. If this inaccuracy is important, and you are willing to have the process frozen for the time it takes to make the dump, then checking the freeze checkbox will cause the process to be frozen while dumping happens. Note that when dumping in the CLRProfiler format, the process is always frozen. See Process Freezing for more.
Save ETL CheckBox The WPA tool can also display a GC heaps (either JavaScript or Project N .NET), however it uses an ETL file as its format for the heap dump (not a GCDump file). By selecting this checkbox PerfView will the GC heap to an ETL file which can be viewed with the WPA. Note that this currently only works for JavaScript or .NET project N (not desktop .NET). tool.

Filtering / Grouping Stack Data

Simplified Pattern matching

Unfortunately the syntax for normal .NET regular expressions is not very convenient for matching patterns for method names. In particular the '.', '\' '(' ')' and even '+' and '?' are used in method or file names and would need to be escaped (or worse users would forget they need to escape them, and get misleading results). As a result PerfView uses a simplified set of patterns that avoid these collisions. The patterns are

* - Represents any number (0 or more) of any character (like .NET .*). This is not unlike what * means in Windows command line
% - Represents any number (0 or more) of any alpha-numeric characters or the '.' character (like .NET [\w\d.]*)
^ - Matches the beginning of the pattern (like .NET ^)
| - is an 'or' operator that allows the text on either side (like .NET |)
{} - Forms groups for pattern replacement (like .NET ())

This simplified pattern matching is used in the GroupPats, FoldPats, IncPats, and ExcPats text boxes. If you need more powerful matching operators, you can do this by prefixing the ENTIRE PATTERN with a @. That indicates to PerfView that the rest of the rest of the pattern follows .NET Regular expression syntax.

Simplified pattern matching is NOT used in the 'Find' box. For that true .NET regular expressions are used.

Grouping (The GroupPats TextBox)

Grouping precedence and exclusion groups

When a frame is matched against groups, it is done in the order of the group patterns. Once a match occurs, no further processing of the group pattern is done for that frame (first one wins). Moreover, if the GROUPNAME is omitted, it means 'do no transformation'. These two behaviors can be combined to force certain methods to NOT be in a group. For example the specification

myDirectory\*!->;{%}!->module $1

Force a module level view for all modules (the red grouping pattern), however because of the first (blue) pattern, any modules that have 'myDirectory; in their path are NOT grouped by the red pattern (they are excluded). This can be used to create a 'just my code' effect. Functions of every module except the code that lives under 'myDirectory' is group together. Powerful!

Entry Groups

The examples so far as 'simple groups'. The problem with simple groups is that you lose track of valuable information about how you 'entered' the group. Consider the example of grouping all modules in System32 into a group called OS that was considered before. This works well, but has limitations. You might see that a particular function 'Foo' calls into the OS can that whatever it did in the OS takes a lot of time. Now it may be possible simply by looking at the body of 'Foo' to 'guess' what OS function was being called, but this clearly an unnecessary pain. The data collected knows exactly which OS function was entered, it is just that our grouping has stripped that information.

This is the problem entry groups solve. They are just like normal groups but use the => instead of -> to indicate they are entry groups. An entry group creates the same group as a normal group but it instructs the parsing logic to take the caller into account. Effectively a group is formed for each 'entry point into the group. If a call is made from outside the group to inside the group, the name of the entry point is used as the name of the group. As long as that method calls other methods within the group, the stack frame is marked as being in the group. Thus boundary methods are left alone (they always form another group, but internal methods (methods that call within the group), are assigned to whatever entry point group called it.

This fits very nicely into people normal notion of modularity. While grouping all functions within the OS as a group is reasonable in some cases, it is also reasonable to group them by 'public surface areas (a group for every entry point into the OS). This is what entry groups do. Thus the command

system32\*!=>OS

Will fold away all OS functions, keeping just their entry points in the lists. This is VERY powerful!

Group Descriptions (comments)

Groups can be a powerful feature, but often the semantic usefulness of a group is not clear simply by looking at the pattern definition. Because of this groups are allows to have a description that precedes the actual group pattern. This description is enclosed in square brackets []. PerfView ignores these descriptions, however they are very useful for humans to look at to understand the intent of the pattern.

Folding (inlining)

Folding by name (FoldPats TextBox)

Folding away small nodes (The Fold % TextBox)

Generally speaking, if a method does not consume more than say 1% of the total in the view then it is usually just 'cluttering' up the display. The Fold % TextBox is designed to remove this noise. Any method whole total aggregate inclusive metric (that is what is shown in the ByName view in the 'Inc' column) is less than 1% of the total metric, is removed and its metric is given to its direct parent.

While it is tempting to increase this number to a large value (say 10% or more), to force most callstacks to be 'big' this generally produces inferior results. The reason is that the % does not take into account the semantic relevance of the node. Thus folding might fold a very semantically meaningful node into a 'helper' of some higher level function. Thus it is usually better to select nodes that 'you don't understand' to fold away so that what you are left with is nodes that are meaningful to you.

Filtering

Filtering Stacks with Particular Frames (The ExcPats TextBox)

Grouping and folding have the attribute that they do not affect the total sample count in the trace. Samples are not removed, they are simply renamed or assigned to another node. It is also useful to exclude nodes altogether. The ExcPats text box is a semicolon list of simplified regular expression (See Simplified Pattern matching). If any frame in the stack matches ANY of the patterns in this list, then it is removed from the view. The pattern does not have to match the complete frame name unless it is anchored (e.g. using ^). The patterns are matched AFTER grouping and folding.

A common use of exclusion filtering is to find the 'second most problematic' performance problem in an app. In this scenario you discover that a particular method (say 'Foo') was poorly designed and you even understand how you might fix it, but you also know that is not your only problem. What you want is to find the next most important issue. By excluding the samples that call 'Foo' you can effectively simulate how the program would behave if Foo was 'perfect' (took no time). This is typically a good approximation of what the program will look like after the fix is applied. Thus by simply excluding these samples you look for the next perf problem and thus tackle many of them quickly.

Filtering any Stacks that do not Include a Particular Frame (The IncPats TextBox)

By default events are captured machine wide, but often you are only interested in some of the samples. For example it is very common to only be interested in one process, or one thread, or isolate yourself to only one method. This is what the IncPats textbox does. The contents of the text box is a semicolon separated list of simplified regular expressions (see Simplified Pattern matching). It is required that a stack matches at least ONE of the patterns in the IncPats list for it to be included in the trace. The pattern does not have to match the complete frame name unless it is anchored (e.g. using ^). The patterns are matched AFTER grouping and folding.

As mentioned, it is very common to use the IncPats textbox to restrict your analysis to a single process. It is also very useful to use the '|' (or) operator here so that you can include just two (or more) processes and exclude the rest.

Filtering by Time (The Start and End Filtering by Time (The Start and End TextBox)

It is very useful to 'zoom in' to a particular time of interest and filter out samples outside this range. This is done by setting the 'Start TextBox' and 'End TextBox' appropriately. These ranges are inclusive (on both ends), and are expresses as msecs from the start of the trace. You can of course enter times manually or cut and paste numbers from other parts of the display. In addition if you paste two numbers into the 'start' textbox it will set both the start and end values. There are a few other nice shortcuts for setting a time interval.

Selecting Time Ranges

The 'First' and 'Last' columns of tree node are often a useful range to filter on. To do this easily, simply select both the boxes (either by dragging or by holding the 'Ctrl' key as you click additional entries), Once you have selected two cells you can right click and select 'Set Time Range' which will set both the start and end time to the first and last column. You can also select a time range by coping two numbers to the clipboard (select two cells and press Ctrl-C) and then pasting the numbers into the 'Start' textbox. This textbox is smart enough to recognize that the pasted value is a range and will set the 'End' time appropriately.

It is also very useful to select time ranges based on the 'When' column. To do this, first select a 'When' cell of interest. This will cause the status bar at the bottom of the view to display the 'When' text. By dragging the mouse over the characters, highlight the region of interest (it is typically the region of high cost). Then move your mouse off the selected region, right click and select 'Set Time Range'. This will set the 'Start' and 'End' time to the region you selected. You may end up repeating this process to further 'zoom in' to a region.

Speeding up StackViewer display with sampling.

If there are more than 1M data samples being viewed in the stack viewer, the responsiveness becomes very sluggish (it takes 10 > seconds to update). To avoid this some stack source (most notably the memory stack source), support the concept of sampling. The basic idea behind sampling is to only process every Nth sample. Thus by setting the sampling text box to 10 the stack view will only have to process 1/10 of the data and thus should be 10 times faster. When Sampling is enabled, the stack-viewer automatically scales all counts (and therefore metrics too) in the view by the sampling rate. Thus the resulting metric and counts are approximately the same as without sampling (you can see this because all counts are a multiple of the sampling rate.

Finding Items in the View (The Find TextBox)

Text searches of names in the view can be performed by typing a search pattern in the 'Find:' text box in the upper right corner of the stack viewer. Ctrl-F will bring you to this search box quickly. The search pattern uses .NET regular expressions, and is case insensitive. Searching starts at the current cursor position and will wrap around until all text is searched. The F3 key can be used to find the next instance of the pattern. When all the text has been searched the app will beep. The next F3 after that starts over. Specification of expressions combined with boolean criteria can be done similar to filtering select columns in the Columns to Display textbox.

Presets (Save Grouping and Folding Preferences)

GroupPats, FoldPats and Fold% text boxes can be edited to contain custom patterns. These patterns combined together can be saved as a named preset.

In order to create new preset use Preset -> Save As Preset menu item. If GroupPats text box contains description (enclosed in []), then the description will be offered as a preset name. Otherwise automatically generated name will be suggested.

All created presets are added to the Preset menu for all active PerfView windows. Select menu item in the Preset menu to activate a preset. The name of the preset will be shown in [] in the GroupPats textbox. Presets are saved across sessions. Preset -> Manage Presets menu item allows editing existing presets as well as deleting them.

Blocked/Wall Clock Time Investigation: The Thread Time Views

Why Blocked/Wall Clock Time Investigations are harder

Wall clock time investigations break down into two cases. Either most of that wall clock time is dominated by CPU (in which case a CPU investigation is will work), or it is not dominated by CPU time, in which case you also need to understand the blocked (non-CPU) time being consumed. Thus the 'hard' part' of doing a wall clock investigation is understanding blocked time.

Blocked time investigations are inherently harder than CPU investigations. CPU investigations are reasonably straightforward because in most scenarios any CPU usage is 'interesting' to investigate regardless of where it happens. Thus the trivial algorithm of attaching the same weight to every msec of CPU regardless of where it happened is appropriate. This is actually not true in some scenarios. For example, if there was a background CPU-bound task on a multi-processor machine, the CPU associated with that background task is likely not very interesting because it is not consuming 'precious' resources and is not on the critical path of some user operation. Thus if you were investigating CPU on such an application you would need a way of filtering out this 'background' activity so you could concentrate on the 'important' CPU use. Typically this would be easy to do because the threads that execute such background CPU activity are dedicated to background activities (so you can just exclude all samples from those threads). However imagine if the background thread was a 'service' and important foreground CPU activity was scheduled on it interleaved with the idle background activity. This would make analysis quite difficult.

This bad situation is EXACTLY the situation you have with blocked time. Typically there are many threads that spend most of their time blocked, and most of this blocked time is never interesting because it is not part of a critical path. However these threads wake up at least some of the time and PARTS of their execution can be on the critical path (and thus are very interesting). Unfortunately is no simple, general way of separating 'important' blocked time (on a critical path), from uninteresting blocked time without additional 'help' (annotation) of the INTENT of the program. Thus the 'trick' to doing a blocked time analysis is to use scenario specific mechanisms to tag the 'important' blocked time and allow it to separated from the (large amount) of unimportant blocked time.

Understanding Thread Time

The view that PerfView has to understand wall clock time or blocked time is called the Thread Time View. This view is based on the observation that at any instant in time every thread is doing 'something'. It might be consuming CPU, or it is not (which we will defined as BLOCKED). If it is BLOCKED it might be because it waiting for its turn to use a processor (which we call READIED), or it may be waiting on something else (e.g. for a DISK request to respond, or the NETWORK to respond or for some synchronization object (e.g. Event, Mutex, Semaphore ...) to change state. Whatever it is doing there is a stack associated with it. Thus at every instant of time every thread has a stack and that stack can be marked with a metric that represents wall clock time that the thread consumed at that call stack. This is a 'perfect' model of what every thread is doing on the system.

If you set the 'thread time checkbox on the collection dialog, or pass the /ThreadTime qualifier to the command line, PerfView will ask the operating system to collect the following information:

Every millisecond what stack that processor (CPU) is working on (this is present event without the /ThreadTime qualifier)
On every context switch (when a thread transitions from running to blocked) the stack of the thread that is starting to run
The time any thread gets created or destroyed.

With this data we have 'perfect' information on where we are blocked. We know the exact time when we started to block and when we ended, and thus can attribute exactly the correct amount of time to that particular stack. We also have approximate information where CPU time is spent. If we get a sample (which might be a CPU sample or a context switch) we can attribute that stack with the time spent since the last sample was taken (which again is either a context switch (e.g. if the thread had the CPU less than 1 msec) or another CPU sample (e.g. if it has been longer than 1msec since the last context switch). Thus the events above we can do a VERY good job of detailing exactly where each thread spent its time. It is interesting to note that you get 'perfect' information on EXACTLY how much CPU time things use (since you know exactly when threads start consuming CPU time and when they stop consuming CPU). The only imperfection is that the stacks associated with CPU is only a sampling.

This transformation of context switch and CPU samples is the foundation of the 'Thread Time Stacks' view in PerfView and is the view of choice to understand wall clock time (or blocked time). Like the CPU stacks view, the Thread Time Stacks view shows inclusive 'tree' which aggregates all these stacks of where threads spend their time. At the bottom (away from thread start) end of each stack a pseudo-frame is appended which indicate what information is known about that stack (CPU_TIME, DISK_TIME, HARD_FAULT (disk time to fetch mapped files), NETWORK_TIME, READIED_TIME or BLOCKED_TIME). For some things more is known (like the file or network port, so pseudo-frames get inserted for those too. These tags make it easy to use PerfView's folding and grouping and filtering capabilities to look at only certain causes of delay.

A Wall Clock Time Investigation

In broad strokes, a clock time investigation consists of the following steps

Collect a trace with the Thread Time events. This is done using the PerfView Run or PerfView Collect commands, but you need to tell PerfView to also collect the context switch information by either
1. Setting the ThreadTime checkbox in the Data collection dialog box
2. Passing the /ThreadTime qualifier on the command line to PerfView
Open the 'Thread Time Stacks' View of the resulting ETW data.
Find the segment of time in a single thread that is interesting to you. This is the critical part because you really only want to see the wall clock time (or blocked time) that is on your critical path. Techniques for doing this depend on your scenario. Here are some possibilities for 'easier' cases:
1. For simple sequential programs with synchronous I/O (a very common case including typical application startup), you simply need to find the method that represents the 'work' you are interested in. and use the 'Include Item' (Alt-I) operation to narrow it to that method (which is on a single thread).
2. For ASP.NET applications that don't use Asynchronous I/O, the ASP.NET Thread Time View will group those fragments of threads that were on the critical path for a particular request together. Thus using 'Include Item' on the frame representing a request (or groups of request), you can see only 'interesting' time.
3. If the application uses System.Threading.Threads.Tasks, you can use the 'Thread Time (with Tasks) view. This marks the segment of a task that is executing a single task with the ID of that task. I also attributes a Task's time to the call stack of the task that activated it. In this way concurrent programs can be analyzed as if they were singly threaded sequential programs.
4. You can use System.Diagnostics.Tracing.EventSource to emit events for interesting (often small) operations in your application. If these operations do not do Async I/O or otherwise spawn work on another thread, the events can be used to find a interesting segment of a single thread. You can then use the 'Include Item' on the thread of interest, as well as the 'start' and 'end' time ranges to find an interesting part of a thread to analyze.
Once you have narrowed your interest to the time range of a single thread, you can proceed to analyze it. Typically you do this by switching to the 'By Name' view and simply looking at the 'types' of time being consumed (CPU, BLOCKED, HARD_FAULT, READIED, DISK, NETWORK). From here the analysis is much like a CPU analysis.

To recap, a Wall clock (or blocked time) investigation always starts with filtering to find 'interesting' wall clock time (typically on a single thread). Until you get to this point you can't sensibly interpret the 'Thread Time View', but after you have found the interesting time, it proceeds much like a CPU analysis.

Blocked time and Causality (ReadyThread)

Sometimes identifying the size and call stack of blocked time is sufficient to understand a particular performance problem. For example analyzing the cold startup time of an application falls into this category because understanding why the blocked time is as long as it is is clear (a Disk read was needed), and so the only questions are how long are these operations and where did the occurred (what stack caused them). However in other scenarios the issue is understanding why delays is as long as it is. For example, if a thread is blocked waiting on a lock, the interesting question is why was some other thread holding the lock so long? To answer this question you need to determine which thread was holding the lock. Questions like this are what the ReadyThread event helps answer.

When you you turn on the /ThreadTime events, not only do you turn on the context switch events, you also turn on the ReadyThread events. A ReadyThread event fires when one thread causes another thread to change from being BLOCKED to being runnable (that is it make a thread READY to run). Thus if thread A is waiting on a lock that thread B owns, when thread B releases the lock it make thread A ready to run. When a ReadyThread event fires in this example it logs both threads A and B as well as the stack of thread B. Loosely speaking, READYTHREAD logs the fact that thread B CAUSED thread A to wake up.

PerfView has a special view for displaying READYTHREAD information called the 'Thread Time (with ReadyThread)' view. This view works just like the 'Thread Time' view but in addition, every stack where a thread blocks is 'extended' with additional frames that tell you the thread and stack that woke it up. These extra frames are suffixed with '(READIED_BY)' so that you know that you can easily see these are not ordinary frame (and you can fold them away if you like). In the example of a Thread A waiting on a lock and being awakened by Thread B releasing the lock you would see

Process X
Thread A
ntdll!RtlThreadStart
<Additional Frames>
X!LockEnter
<Frames of calls into the operating system that block the thread (typically WaitForSingleObject or WaitForMulitpleObject)>
READIED BY Thread B Waited < 1msec for CPU
Process X (READIED_BY)
Thread B (READIED_BY)
ntdll!RtlThreadStart (READIED_BY)
<Additional Frames, all suffixed with (READIED_BY)>
X!LockExit (READIED_BY)
<Frames into the operating system that unblock the thread (typically SetEvent), suffixed by (READIED_BY)>
BLOCKED_TIME

Which clearly shows that after blocking in 'X!LockEnter' the thread was awakened by thread B calling 'X!LockExit'.

How Tasks make Thread Time Easy (The Thread Time (with Tasks) View)

If you have not already read the basics of Understanding Thread Time you should read that now. This section builds on those basics.

It is strongly recommended that if you need to do asynchronous or parallel operations, that you use the .NET System.Threading.Tasks.Task class to represent the parallel activity or the 'continuation' of the thread after an asynchronous operation completes (the 'await' feature in C# uses Tasks). What makes Tasks valuable to PerfView is that this class logs events when Tasks are created (along with an ID for the created task), when there body of the task is invoked (along with an ID for the task), and when the task's body completes (again along with an ID). This helps us in two important ways

Task bodies represent real user work, and thus can be used to segregate 'important blocked time', from 'uninteresting infrastructure time (time these threads spend blocked waiting for user work). This is VERY useful.
Tasks know where they were recreated (who 'caused' them), so there is a very natural way of 'charging' the creator of the task for all the time (or other resources a task uses) to the creator.

The 'Thread Time (with Task)' view does exactly this. When a thread calls a task creation method, this view inserts a pseudo-frame at this point that indicates that a task has been scheduled, and then inserts all the events for the body of that task at that point . Here is an example

Process32 X
Thread (1276) CPU=733ms (Startup Thread)
ntdll!_RtlUserThreadStart
BlockedTime!BlockedTime.Program.Main
BlockedTime!Program.DoWork
mscorlib.ni!TaskFactory.StartNew
Task Scheduled
Task Executing on Thread 848
mscorlib.ni!IThreadPoolWorkItem.ExecuteWorkItem
BlockedTime!BlockedTime.Program+<>c__DisplayClass5.<DoWork>b__3

In this example the 'Main' Program called 'DoWork' which had the code

Task.Factory.StartNew(delegate {
// Body Code ...
});

This call causes another thread (in this case thread 848 to start up, and start executing the body (the delegate {...}). This 'inline delegate' code is called an anonymous delegate, and the C# compiled generates name for it (in this case 'c__DisplayClass5.<DoWork>b__3'), which does the the work (note PerfView's 'Goto Source' (Alt-D) option is VERY handy at this point for seeing exactly what this code is).

The important part here is that from a source code level it is very natural to think that any costs (time) spent in this anonymous delegate should be 'charged' to 'DoWork' because that code caused that delegate to actually run (on a different thread). This is EXACTLY what the Thread Time (with Tasks), view does. If your application uses Tasks, you should be using this view.

Making Server Investigations Easy (The Thread Time (with Start-Stop Tasks) View)

At its heart, a server investigation is typically about response time. Thus to do an server investigation you would like all costs that contribute to making this response time longer rolled up together in the display. This is exactly what the Thread Time with Start-Stop Tasks View does.

Like all thread time views, it keeps track of where every thread is (what its current stack is) regardless of whether it is blocked or using CPU. Like all thread time views it needs the 'ThreadTime' checkbox (or /threadTime command line parameter) to be used when collecting the data so the necessary events are present.
Like all 'with Tasks' views it also knows how to track any Asynchronous or concurrent activity done by thread pool threads and assign that cost to the code that caused that work to happen.
Finally on top of this it identifies events declared to be 'Start-Stop pairs' which identify 'interesting' units of time. The .NET Framework has declared a one such start-stop pair when IIS or ASP.NET requests begin, but there are others when WCF operations start and stop, as well as when HTTP requests or SQL requests are made to other machines. In addition you can define start-stop requests of your own that PerfView will recognise (see below). Once a 'Start' event is emitted, anything on that thread (or any Task caused by that thread) will be part of that start-stop activity until the Stop event for that start-stop pair is seen.

This is best shown by example. This is an example of a ASP.NET Web server that was monitored using 'PerfView /threadTime collect'. Because we use the /ThreadTime parameter, information on context switches and tasks is collected that allows 'Thread Time' views to be displayed including the 'Thread Time (with StartStop Tasks)' display . Here is the result of opening this view and focusing on the W3WP process (which is the web server process).

At the top of the tree, we see the process node, but then immediately all costs are segregated into two parts, things that are associated with some start-stop activity, and everything else. Thus this lets you quickly focus on the thread time that is likely to be of interest.

Under the 'Activities' node you see all 'top level' start-stop activities, sorted by cost (that is thread time attributed to that activity). In the view above we opened the 'IISRequest' activity (which has a particular ID number and URL) that happens to have 730.7 msec of thread time. This IISRequest Activity happens to cause another nested Start-stop pair for an AspNetReq activity, so that is shown, from there all stacks associated with the AspNetReq activity are shown. In this example we can see the call stack through user code to the method MyOtherAsyncMethod which does a 'await' that takes 524.5 msec)

Hopefully you can immediately see how useful this view is. Basically it takes all the thread time associated with semantically relevant things (start-stop tasks that someone instrumented into the code), and displays the stack based on causality (thus event if execution hops threads the stacks 'follow' it). Thus it becomes trivial to see exactly where time is being spent.

A typical strategy is to immediately select the '(Activities)' node, right click -> Include Item, which will exclude all the non-activity thread time. This works well most of the time however keep in mind that some important costs may be in this (Non-Activities) node, in particular things like the GC (in server or background GC), or any non-threadpool threads did work but never logged a start and stop event. This is why PerfView does not hide this, but typically you start by looking at the activities, only look outside that if you are lead there. Typically if you will filter to just look at the non-activities and only the CPU_TIME, to see what is 'interesting' in that group.

Thread Time is not Elapsed Wall Clock Time

It is important to note that what is being shown is STILL thread time, NOT wall clock time. Thus if there is concurrency going on, the total metric is very likely to add up to more than elapsed wall clock time. This is easy to determine this is the case (because you will see more than one thread as children of the activity), and you can even see the overlap (by looking at the 'when' column of each of the children). Still it is something to be aware of. See Understanding Thread Time and for more.

It is also possible that the thread time will be LESS than elapsed wall clock time. This should be a much rarer case. It happens when the code causes work to happen but does not use the mechanisms that have been instrumented to detect that work on another thread was caused by the current thread. Because of this the current thread may return to the threadpool (at which point its time is NOT attributed to the activity anymore), but because the work on the other thread is unknown to PerfView, it can't properly attribute that time to the activity (it ends up under the non-activities node). Thus there can be 'gaps' in the thread time for a request. PerfView tries to fill these gaps with a pseudo-node called 'UNKNOWN_ASYNC', so that at the cost in the view is never less than the wall clock time for sorting purposes, but sometimes PerfView's algorithm is not perfect. In either case, however it becomes very difficult to determine what was going on during these gaps. Hopefully this simply won't happen to you...

Making your own Start-Stop tasks

Often the 'standard' instrumentation in the .NET Framework gives you good 'starting' activities to work with (as the IISRequest and AspNetReq did above). However if those are not sufficient, you can define start-stop activities of your own. If your code is running on V4.6 of the .NET Framework or beyond, then it is trivial to add new start-stop activities that will show up in this view. See EventSource Activities for details of doing this. You will want to turn your events on using the /Provider=*YOUR_EVENT_SOURCE_NAME when collecting data, and this view will simply incorporate them automatically.

Unmanaged Memory Analysis

PerfView can also be used to do unmanaged memory analysis. Typically the first step in a memory investigation (whether it be a managed or unmanaged memory investigation is to use a tool like the free SysInternals vmmap tool to determine what the memory make up is of your process. This tool can break down the current memory usage into half a dozen categories including

Mapped DLLs and EXEs
Memory allocated by the .NET runtime (the GC heap)
Memory allocated by the unmanaged OS heap (e.g. C malloc or C++ 'new' new operator, called simply 'Heap' by vmmap)
Memory allocated with Virtual Alloc directly (this is called 'Private Data' in vmmap)

Depending on which of these is big (and thus interesting, you attack it differently. If mapped DLLs or EXEs are the issue, you need to load fewer of them. PerfView's 'Image Load Stacks' will show you where you are loading DLLs. If the problem is GC Heap, you need to do a GC Heap investigation as described in 'When to care about the GC heap'. If the problem is either of the last two, then this section tells you how to drill into that problem.

In the end, all memory in a process is either mapped (e.g. DLLs or EXEs) or is allocated by windows VirtualAlloc API. PerfView allows you to collect a stack trace on every VirtualAlloc call (and every VirtualFree call), by checking the 'Virtual Alloc' checkbox on the advanced collection dialog box. VirtualAlloc was designed to be used to allocate large chunks of data (in fact the minimum size is 64K), and so turning this option on is not likely to affect the performance of your app, so feel free to do so. However precisely because VirtualAllocs are called infrequently (typically when another allocator needs more memory), this information is often 'to coarse' and is only useful when your user code directly calls this API (which is unusual).

Much more commonly, you will notice in your VMMAP the that 'Heap' entry in the display is large, and thus you want to drill into the OS heap. To do this we need to collect data every time an OS heap allocation or free happens. This is MUCH more common. In fact it is so common that the operating system does not provide a way to turn it on system wide (that would be too much data) instead there are two dialog boxes in the advanced section of the collection dialog box.

The OS Heap Exe textbox - Specify an EXE name (no path or extension) to turn on OS heap events for a process which has not yet started.
The OS Heap Process textbox - Specify an EXE name or process ID to turn on OS heap events for a process that is already started.

Using one these two techniques you can turn on OS heap events for the process of interest. Optionally you can also turn on VirtualAlloc events.

Once you have done this and collected data, you will get the following views

The OS Heap Alloc Stacks view if you asked for OS heap events
The VirtualAlloc Stacks view if you ask for VirtualAlloc events.

The two views work the same way. Every allocation in the trace is given a weight equal to the number of bytes allocated. Every free is given a negative weight and and the CALL STACK OF THE ALLOCATION (this way they perfectly 'cancel out'). Frees that can't be matched up with allocations in the trace as a whole are ignored. After this PerfView treats the stacks just like any other stack-based data it processes. It only considered samples that match its filters and displays the result. Note that this means that VALUES CAN BE NEGATIVE. If you select a time rage where only frees happen then you will get a negative number. The basic invariant is that the view shows you the NET memory allocation for the range you select. Because metrics can now be negative the 'When' column might need to show negative numbers. These are displayed by using lower case letters (see When Column for more).

Note that this means that if you display the TOTAL execution of a program in theory you should see a value of 0 (you freed everything you allocated). In practice this is not true but what IS true is that you are not usually interested in the FINAL memory used just before process termination, but the PEAK memory allocation. To get that you need to find the time where memory allocation was at its peak.

You can do this (roughly) by going to the ' CallTree View' and selection the When Column for the root of hierarchy. As you drag regions of the when column PerfView will compute the net and peak metric in the region that you dragged. Thus by dragging you can quickly determine where the peak is. Typically you the simply need to hit 'Set Range' (Alt-R) and now you have the region of time where you built up to the peak memory usage.

You can also easily investigate the net memory usage of any particular operation by selecting the time rage over that operation. All the normal filtering, folding and grouping operators work. for the memory case. Finally by opening two views you can use the Diff feature to do an analysis of two runs of the application.

Directory Size Analysis

The directory size menu entry will generate an *.directorySize.perfView.xml.zip file that is a hierarchical summation of the sizes of all files in a directory (recursively). Thus it is a very good tool for determine what is taking up disk space on a disk drive and 'cleaning up' less valuable files.

Selecting this menu entry will bring up a directory chooser that you use to select the directory to analyze as well as the name of the file that will hold the gathered data. Once selected PerfView will do a recursive scan on that directory which make take a while. When it finishes (which may take a while for large directories), it will automatically open the data file it generates). You may reopen the file at any time later simply by clicking on it in PerfView's main tree view.

The 'when' field for directory size works a bit different than for most performance data. For each data file, its 'Timestamp' is the number of days (which can be fractional) from the time that the data was collected, to the time it was last modified. Thus by selecting the time range from 0 to 7 you will see all files that were modified less than one week ago. This information can be very useful for seeing how 'old' the data is (which is often useful to determine whether to keep it or not).

Image Size Analysis

Collecting data

Selecting the Size -> Image Size menu entry will bring up a dialog box you use to specify the DLL or EXE to do the size analysis one. In addition it will allow you to set the name of the output file that holds the resulting data. The dialog will derive a output file name from the input file name and generally this default is fine.

Analyzing the data

The image size menu entry will generated a .imagesize.xml file the describes the breakdown of the size of a DLL or EXE file. It does this by looking up every symbol for the DLL/EXE in its PDB file and using those names for each chunk of the file. It also looks for references from on part of the file to another (for example pointers in memory blobs or assembly code to other memory blobs or assembly code. Because these references can form arbitrary graphs of dependency in the same way the GC heap objects form a graph of dependency, PerfView displays this data in very much the same way as a GC heap. Like a GC heap, the 'When', 'First' and 'Last' columns do not show the time but represent an address of where the particular item is in the virtual address space when loaded. Thus you can also use this to get an idea of the locality of different symbols within the file when loaded.

Flattening the Trace

As mentioned, by default PerfView tries to create a 'GC heap' of the items in the DLL if one item refers to another it will have a link from the referencer to the object being referenced. However this behavior can interfere with some analysis. . In particular if you use the 'include pats or 'exclude pats' textboxes, it will include or exclude ON THE ENTIRE PATH. When this is not what you want, one easy way to fix the problem is to 'flatten' the graph.

Flattening a set of nodes takes one set of nodes, and returns a new 'GC Heap' where

All links between nodes are ignored. Instead you get a 'flat' list, where every node is a child of 'ROOT' and has no children of its own.
Any grouping is 'frozen' int the name. Normally the 'Group Pats' text box just effects how the nodes are displayed, but the nodes still have their original names. After flattening the node name is really what is being displayed (changing the grouping will no longer have an effect).

Thus if you to to the 'RefTree' view select the metric associated with the 'ROOT' node, right click, and select 'Flatten' you will get a new view in which there is no links between nodes. Now the 'include pats' and 'exclude pats' will select a node based on ONLY THAT NAME (not the name of any of its parents).

Meaning of certain tags in a Image Size analysis

Many of the names used in the image size report are the symbol names that symbolic names that have a direct relationship with the names in the source code. However other names describe entities of the Portable Executable (PE) format which are needed to prepare the code/data in the DLL/EXE to be run. Here we describe some of these that may show up prominently in the output.

Section .relocs - The .relocs section describes relocations. A PE file may be loaded at at effectively any location in memory. However the code/data in the file may be expecting to be loaded at a particular address (called its preferred base address). A relocation is a description of a 'fixup' needed to 'patch' at point in a file that needs to change if the image but there may be 1000s of them which can add up. Both code and data (especially vtables) can cause relocations to be necessary.
Section .pdata - In X64 and ARM processors, exceptions are supported by having a table that will allow the operating system at runtime to convert a code address into the method that contains it (along with exception handling information). This is needed to support 'unwinding' of the stack to support exceptions. The .pdata is this table that maps code address to unwind information. It is proportional to the number of methods.
_imp_* - import dispatch cells. When one DLL calls a method in another DLL, it makes an indirect call through a memory cell that will be fixed up at runtime to point at the target method. The _imp_ symbol points at this cell. It always a pointer sized cell and the number of them are proportional to the number distinct cross DLL targets.

Other names are associated with the .NET Runtime Native file format.

ReadOnlyBlobSection - This is a set of bytes that were emitted as a blob and no more is known about them. Currently .NET Metadata needed for reflection is emitted in this way and typically is the reason this section is large. Using an runtime directive file to limit the amount of reflection used by the app can make this smaller.
vtable * - a vtable is short for virtual dispatch table. It is used to implement virtual methods on a class. It is directly proportional to the number of virtual methods a class needs to implement (both directly and inherited virtual methods).
FrozenString - a frozen string is the bytes needed to represent a .NET System.String literal (quoted strings in the source code). Having fewer literal strings will make this smaller.
MethodToGCInfoMap - In order to support garbage collection (GC) the .NET Runtime needs to find every reference to the GC heap that is on every methods stack frame. To find this data it needs to map a code address to the GC information. This is the table that does this. It is proportional to the number of .NET Methods.
MethodToEHInfoMap - In order to support exception handling (EH) the .NET Runtime needs to map methods to their exception handling information.
CodeManagerSection - The code manager is the logic in the .NET Runtime that knows how to decode stack frames. It is needed by both the GC and EH system.

IL Size Analysis

Collecting data

Selecting the Size -> IL Size menu entry allows you to do a analysis of what is in a .NET Intermediate File (IL), which is what .NET Compilers like C# and VB create. It will generate a .gcdump file that makes graph of types, methods, fields and other structures in the IL file where each node of the graph indicates how big it is in the file, and the arcs between the nodes are references from one item to another. Thus you can do dependency analysis (what things refer to what other things), in the same way as objects in a GC heap.

The Size -> IL Size menu entry will bring up a dialog box you use to specify the DLL or EXE to do the size analysis on. This file needs to be a DLL or EXE that contains .NET IL (e.g. the output of a .NET compiler). In addition it will allow you to set the name of the output file that holds the resulting data. The dialog will derive a output file name from the input file name and generally this default is fine.

Analyzing the data

The image size menu entry will generated a .gcdump file the describes the breakdown of types methods fields and other items in the IL file. It works in much the same way as the GC heap analysis or the native Image Size Analysis.

Multi-File heap

The Menu entry only allows you to specify one IL file when creating the node-arc graph for the IL code. Any references outside this file are not traversed, but simply marked as a special 'external reference' node. It is sometimes useful to select a group of IL files (e.g. representing a complete application) which are traversed and only when you leave this group would you use 'external reference' nodes. You can do this with the 'ILSize.ILSize' user command. Thus the command

PerfView userCommand ILSize.ILSize File1.dll File2.dll File3.dll

Will create a GC heap of File1.dll File2.dll and File3.dll as if they were one file.

Multi-Scenario Analysis (Aggregating Traces))

Often, it is useful to analyze performance of one program across multiple traces. These traces might represent one large project in a variety of scenarios, or the behavior of a common library being used by multiple programs. PerfView supports several features for this sort of multi-scenario analysis.

A main challenge when doing analysis of multiple scenarios (data files) simultaneously is simply the quantity of data being manipulated. Individual scenarios can often have an ETL file that is 100s of megabytes, and and if you have 100 such scenarios you are now talking 10-100 GB of information to process. Because of this, the process is designed to reduce the data volume as quickly as possible and to persist this 'lean' form so that the data volumes at viewing time are kept under control. Thus there are two main steps in working with a multiple multiple scenarios

For each .ETL (or .ETL.ZIP file), create a new file (a .PERFVIEW.XML.ZIP file), that contains just the information needed to view the data in the PerfView Stackviewer. This reduces the data volume by a factor of 100 or more. This step can be done 'off-line' and once complete does not need to be repeated until new data comes in. The tool is 'smart' in that if new input files are added to an existing set of data file, it skips the files that were already converted. This process takes a few seconds to 10s of seconds for each data file actually converted. If you have important unmanaged DLLs in your scenario it is important that the PDB symbol path (e.g. _NT_SYMBOL_PATH) is set properly at his stage. Once converted to an XML.ZIP it is no longer possible to resolve symbols.
A new kind of viewing file (a .SCENARIOSET.XML file) that represents the aggregation of a set of PERFVIEW.XML.ZIP files. When you open a file of this type PerfView will show you the data from all the data files simultaneously. You can generate many of these files to form different subsets of the same data files. When PerfView opens these files, each data file is given a 'top node' (above the 'process node') that represents the data. PerfView's standard grouping techniques can then be used zero in on the area of interest (e.g. how much a particular library or a function is used across all scenarios, or where CPU time is spend 'on average' over all scenarios). In addition PerfView has special features (the 'which column') that help you quickly understand which scenarios are contributing to any particular metric. Once 'hot' areas are discovered, you can use the 'which column' to understand how uniformly the problem is distributed across scenarios.

The following is more detailed instructions on performing these steps.

Step 1: Preprocessing ETL Data and Forming the ScenarioSet Representing All the Data Files

The first step in viewing multiple data file simultaneously is to preprocess the data into a 'Scenario Set'. You can do this with the 'SaveScenarioCPUStacks' user command(currently only CPU sampling aggregation is supported). You can run it from the PerfView GUI using the 'File->UserCommand' menu item or from the command line by executing the following

PerfView userCommand SaveScenarioCPUStacks MyDataDirectory

The SaveScenarioCPUStacks command takes one argument. This argument can be a directory name (as in the example above), or the path to an XML config file.

If you pass in a directory, SaveScenarioCPUStacks will run in "automatic" mode. It will process all ETL and ETL.ZIP files found in the directory (or any sub-directory), using a heuristic method to automatically detect the process of interest for the trace. The heuristic used to pick the process of interest is

If the trace contains a Win8 store app, then the first Windows Store app is chosen.
If there is no Windows Store app, then the first executable to start that runs for more than half the trace length (this will tend to ignore setup scripts).
If no app matches (2) then the first app to start after the trace starts.

Typically this heuristic approach works well, however if you need control over how SaveScenarioCPUStacks runs, you can pass in an XML configuration file that gives you fine control over the processing of the ETL files. Here's an example XML config file:

<ScenarioConfig>
    <Scenarios files="*.etl" name="Win8 Store scenario [$1]" />
    <Scenarios files="ScenarioProcess.etl.zip" name="PerfView" process="procexp64"
         start="1000" end="5000" />
</ScenarioConfig>

As you can see, a config file is composed of a root ScenarioConfig element, which contains one or more Scenarios elements. Each Scenarios element has attributes set that control how scenarios are processed:

The files attribute is the only required attribute of the Scenarios element . This attribute's value is a wild card pattern of files to match. All files matched by this pattern will be preprocessed and included in the output scenario set. (Relative paths are relative to the directory containing the XML config file.)
The name attribute controls the name of the scenario, as it is displayed in the GUI. This can contain $-substitutions as specified in the .NET framework documentation. Each * in the wild card pattern will be converted to a capture group, which will allow its use with the $number substitution. (For example, the first * in the pattern can be referred to as "$1".)
The process attribute allows you to override the process-of-interest detection logic for a trace. The value of this attribute should be the name of the process you wish to include (without any .exe file extension). If you use '*' as the process name then all processes from the trace are processed into the perfView.xml.zip file. Like the 'name' attribute you can use $1 in this attribute which will be replaced with the corresponding capture group.
The start and end attributes allow you to set the time range of interest from the matched traces. Events at time start will be will be at time 0 in the processed output, and any events outside of the time range will be dropped

The result of running the SaveScenarioCPUStacks command are the following output file.

One *.perfView.xml.zip file for every trace matched. These will only be generated if they do not exist, or if their corresponding ETL trace data is newer, but if they are up to date, nothing is done for that file.
One *.scenarioSet.xml file for the entire scenario set. This file is necessary for the viewing the data in step 2.

If you'd like, you can also generate your own scenarioSet.xml file. A scenarioSet file is similar to a scenario config file, but with slightly different attributes. Here is an example scenarioSet file:

<ScenarioSet>
    <Scenarios files="*.perfView.xml.zip" namePattern="Example scenario [$1]" />
    <Scenarios files="foo.perfView.xml.zip" namePattern="Example scenario [baz]" />
</ScenarioSet>

As you can see it is basically a list of file patterns (which indicate which files in the directory (or any subdirectory) of the directory holding the ScenarioSet.xml file should be included), as well as a pattern that allows you to take that file name and convert it to scenario name. You can make your own XML files to create interesting subsets of some data.

Step 2: Viewing Multiple Scenarios

Once you've processed your scenario data, you can then proceed to view it. To do this, use the treeview in the main view to browse to the generated scenarioSet.xml data file and double-click to open it.

For the most part, this is the familiar Stack viewer you use on a single ETL file, the main difference is that each stack from a particular data file (scenario) has a new pseudo-frame at the very top that identifies the scenario that the sample comes from. Thus stacks belong to threads belong to processes belong to scenarios. Everything else about the stack viewer works as it did in the single-scenario case. The stack view appears as if every scenario simultaneously on the same machine.

In addition to the new 'top' node for each stack, the viewer has a couple of enhancements that only are visible in the multi-scenario case. You will see:

A 'which column' displaying a histogram of the scenarios in which a given frame occurred.
As at the top of the display there is the Scenarios textBox that lets you filter and rearrange the scenarios shown in the 'which' column.

In the same way that the 'when' column allows you to see for every row in the view a small graph displaying the samples as function (histogram) in time, the 'which' shows you a histogram of the scenarios that had samples contributing to that row. Thus you can quickly determine whether the cost of that row was uniformly distributed across scenarios or whether just a handful of scenarios contributed to the cost.

The which field has a number of handy features associated with it.

You can select the 'which' field, then select a range and as you drag the range the names of the scenarios will be displayed in the status line at the bottom of the view. This allows you to see the name of values in the histogram.
You can select a 'which' field, right click -> Scenarios -> Sort -> Sort by this Node. This causes the scenarios to be reorders in the histogram so that the current node's metrics will be sorted from the scenario that use the most metric to the scenarios that use the least metric. You can undo this with Scenarios -> Sort -> Sort by Default.
When you select a range in the 'which' field you can right click -> Scenarios -> Set Scenario List, which will filter the trace to just the scenarios represented by the selected range. This is typically used in conjunction with the 'sort' feature (first you sort the scenarios by how expensive they are for a particular node, and then select some subrange of those scenarios to drill into (looking at the scenarios that either used a lot or a little of the metric).

Merging

If you intend to transfer the data collected with PerfView to another machine an additional step called merging is needed.

PerfView uses the Event Tracing for Windows (ETW)Windows (ETW) facility built into windows to collect profiling information. This infrastructure does not naturally create a single file for the data, but segregates data that came from the OS kernel from other events. Thus the 'raw' data generated consists of two files (one which is just etl, and another .kernel.etl). Moreover these files are missing some information that is needed to fully decode the file on another machine (most notably, the mapping of OS kernel names to NTFS file names and the symbol server 'keys' that allow unambiguous lookup of symbolic information (PDBs). Neither of these limitations are a problem if you consume the data on the same machine as it was collected on, but if you wish to transfer it to another machine, you should first merge the data.

Merging is a process by which the .kernel.etl is merged into the main .etl file. In addition the missing system-specific information is gathered up and also placed in the .etl file. The result is a single file that can be copied to a different machine for analysis. This process can take a non-trivial amount of time (10s of seconds), which is why PerfView does not do it by default. You can perform merging by

Right clicking on the file in the main tree view an selecting 'Merge'
Using the Collect->Merge menu item.
Clicking the 'Merge' checkbox when the data is collected
Collect the data from the command line (using 'run' or 'collect') commands and specify the /merge qualifier.

Once the file is merged, you can simply copy the single file to another machine for 'off-line' analysis. Note however that while the ETL file contains symbolic information for .NET Runtime code, it does NOT contain symbolic information for unmanaged code. Thus if it is important to see the symbolic names for unmanaged code, you need to ensure that the machine on which analysis occurs has access to the PDB files that contains this information.

NGen Pdbs (and Zipping)

Merging an operation necessary to view ETL files on a machine other than the machine the data was collected on. However it is not sufficient for all cases. While the resulting merged file has all the information to look up symbolic information (for stack traces), it does not guaranteed that the symbolic information will be available. In particular, when collecting traces whose processes use the .NET runtime, it is necessary to reference the symbolic information (PDB files) for the native code images (NGEN images), of the managed code (if it was NGENed). These NGEN Pdbs are NOT the PDB file for the IL images (something created by IL compilers like CSC.exe, or VBC.exe). The NGEN PDBs are generated by the NGen.exe command that comes with the .NET framework and can only be reliably generated on the machine that generated the NGEN image.

As part of the ZIPPing process, PerfView will look up all addresses in the ETL file and determine which NGEN images were used, and if necessary generate the PDB files for those images. It will then ZIP both the ETL file as well as any NGEN PDBs into a single ZIP file that can now be viewed on any machine (PerfView knows how to automatically unpack these files).

Collecting Data from the Command Line (Scripting, Automation)

See also PerfView Extensions for advanced automation by building an extension for PerfView.

See also Command Line Reference for a complete list of the options you can use at the command line

PerfView is designed so that you can automate collecting profile data be using a batch file or other script. The three likely scenarios are:

The user simply wants to quickly collect data from the command line for immediate analysis, either on the same machine or a different machine.
The user wants to make a simple script to automate data collection but still needs to be present during collection (e.g., hand testing a GUI app), but does not wish to immediately analyze the data (someone else will do that).
Data collection is completely automated, for completely unmonitored collection.

In the first case you are likely to want to use either the 'run' or 'collect commands

PerfView run Command_and_Args
PerfView collect

The 'run' command immediately runs the command and launches the stack viewer. This is the preferred option if it is easy to launch the program and it can be run to completion. However sometimes it is difficult to do this (the app is part of a service, or is activated by a complicated script), then you can start system wide collection with the 'collect' command.

Skipping Rundown (/NoRundown)

By default the 'collect' command performs a 'rundown' where information to properly decode symbolic information collected before profiling stops. This operation can be relatively expensive (takes seconds, and increases file size by 10s of Meg). This information is naturally provide when processes shut down, but the 'collect' command does not know if you shut down the process of interest, so it performs the rundown. If you know that the process of interest has exited, then rundown is pointless and can be avoided by specifying the /NoRundown qualifier. This option can save time and file size.

Suppressing Viewing (/NoView)

By default PerfView assumes you wish to immediately view the data you collected, but if the person collecting the data (e.g. a tester) is not the person analyzing the data (e.g. a developer), then we wish to suppress the viewer. This is what the /noView qualifier does and it works on the 'collect' and 'run' command. Thus

PerfView /noView run Command_and_Args

Will turn on logging and run the given command. It will also merge the file, under the assumption that the file is likely to be moved off the current system. It will however still bring up the GUI and it will not exit automatically when it is done (so that the user can react to any failures or messages and is required for the 'collect' command so that the user can indicate when collection should stop).

Automating Collection (/LogFile:FileName)

See also Command Line Reference for a complete list of the options you can use at the command line

The /NoView makes sense where is it hard to fully automate data collection (measuring ad-hoc scenario in a GUI app). However for fully automatic collection you don't want the GUI at all. This is what the /LogFile qualifier is for. By specifying this qualifier you indicate that no GUI should be opened and that the program should exit after running the command on the command line. Any error messages that would have been reported in the GUI instead are APPENDED to the log file (we append so you can use the same file for several PerfView commands. The exit code of the PerfView process will indicate the success or failure of the collection and the log file will contain the detailed diagnostic messages.

Note that the /LogFile qualifier will suppress the GUI, but it will not suppress the generation of a console if the 'Collect' command is specified and no /MaxCollectSec qualifier is given. The reason is that without /MaxCollectSec=XXX the Collect command could run forever and you would have not way of stopping it cleanly (you would have to kill the process). If you wish to use /LogFile and Collect (because you wish to use the /StopOn* qualifiers), and wish to suppress any consoles, you can do this by specifying a very large /MaxCollectSec value.

In addition to the /logFile qualifier it is good to also apply the /AcceptEula qualifier to scripts that call PerfView. By default the first time PerfView is run on any particular computer it displays a pop-up that asks the user to accept the usage agreement (EULA). This can be problematic for scripts since it requires human interaction. To avoid this you can use the /AcceptEula qualifier on the command line that does this operation silently.

Thus a typical use of the /logFile and /AcceptEula qualifiers is the command

PerfView /logFile=perfViewRun.log /AcceptEula run tutorial.exe

which runs the 'tutorial.exe' from a script (no GUI). If you need to collect system wide, (you want to use 'collect' not 'run') there is a problem because PerfView does not know when to stop. There are two ways to solve this problem. The first is to use the '/MaxCollectSec' qualifier.. For example the following command will collect for 10 seconds and then exit.

PerfView /LogFile=PerfViewCollect.log /AcceptEula /MaxCollectSec:10 collect

If you wish to control the stopping by some other means besides a time limit, you can also use the 'start' and 'stop' and 'abort' commands.

PerfView start /AcceptEula /LogFile=PerfViewCollect.log
PerfView stop /AcceptEula /LogFile=PerfViewCollect.log
PerfView abort /AcceptEula /LogFile=PerfViewCollect.log

These are meant to be used in scripts. The first will start logging and leave it on even after program exit. The second stops logging. You should avoid using these (use collect /MaxCollectSec instead), if you can. The reason is if the script where to fail between the start and stop commands, logging might not be stopped and will run 'forever'. Thus some care is necessary in using these. The 'abort' command is meant to help ensure that PerfView is not logging. It is meant to be called at locations where you know that PerfView should NOT be running, and it ensures that indeed it is not. You should use it liberally in scripts that use the 'start' command.

Minimizing Impact of Collection on the System (/LowPriority)

The normal Event Tracing for Windows (ETW) logging is generally very efficient (often < 3%) however after a trace has completed, PerfView normally does relatively expensive things to package up the data (including merging, NGEN symbol creation and ZIP compression). These operations obviously can use resources that may slow down whatever else is running on the machine.

If you pass the /LowPriority option to PerfView on the command line, it PerfView will do these operations at low CPU priority. This can significantly slow down the time it takes to package up the data, but it minimizes the impact to the system.

Using PerfView inside Windows Server (Docker) Containers

Containers can be best thought of as a light weight virtual machine. See Windows Containers on Windows 10 for more background on containers for windows. In particular windows supports a light weight container called a 'Windows Server Container' in which the kernel is shared among all the containers running on a machine. Such containers are used in conjunction with a tool called Docker, which allows you to create OS images and run applications in the virtualized environment.

Ideally containers should be irrelevant to using PerfView, since containers are a kind of windows operating system and PerfView is just a windows application running there. This is mostly true, but there are some differences that need to be considered.

Because containers share the kernel, and the ETW events that PerfView relies on are generated by the kernel, it requires special support in the operating system to 'virtualize' the events and forward them to the ETW session in the appropriate container. This support was added in version RedStone (RS) 3 (also called version 1709 released 10/2017) of the operating system. The command 'cmd -c ver' will tell you the BUILD version of the OS you are currently running on and the Windows 10 version history page can correlate that to your windows 10 version. Note that as of that release only the CPU and context switch events are supported. but that is enough to do a lot of useful analysis.
Containers don't have GUIs, and PerfView is a GUI app. What this means is that if you run PerfView from a command prompt in a container, it will seem to do nothing. What it was doing is launching the GUI, which you don't see, and detaching from the current console. Thus it is doing exactly what it always does, it is just not as useful in a container. However PerfView supports powerful command line options to automate collection and these work fine in a container.

Thus PerfView works in a container, but need to ensure you have a new enough version of the operating system, and that you use the techniques in Automating Collection to collect data without using the GUI.

Container Use Example

A example is worth a thousand explanations, so here is an example. First you need to set up install Docker for windows from the web. There are plenty of good tutorials on line for that. Once you have docker set up you can do the following

docker pull microsoft/windowsservercore:1803 cmd

which will pull down the 1803 version of Windows Server Core (it is about 5GB) and run the 'cmd' command in it. Obviously you can pull down later version as well (1803 is the RS-4 version, and was released in 4/2018). The important part is that it is RS-3 or later. The result is a C> command prompt.

At this point you can copy PerfView into your container (e.g. 'net use \\SomeShare\SomeSpot). Once you have PerfView copied you can do

PerfView /logFile=log.txt /maxCollectSec=30 collect

Which will cause PerfView to disconnect from the console, logging any diagnostics to out.txt. Ultimately this command will create a PerfViewData.etl file in the normal way. You can do 'type log.txt' to see how things are progressing as it runs. If you put this command in a batch file, it will not detach from the console and thus the batch file will not continue until the collection is done. Thus you can make a batch file that calls PerfView, and then copies the resulting file somewhere. You can also use the 'start' and 'stop' PerfView commands instead of the 'collect' command if you wish to have your batch file start collection, kick off some operation while monitoring, and then stop it. The point is that this works just like normal windows, and PerfView is very flexible. You will be able to do just about anything.

Windows Nanoserver and PerfViewCollect

The windowsservercore docker image is a pretty complete version of windows. In particular it has a complete .NET Runtime on it, which is what PerfView needs to run. Microsoft also supports a even smaller Docker image of windows called microsoft/nanoserver (which is 300 MB not 5GB). This OS does support ETW, and thus in theory you could collect PerfView data on it, but it does not have the desktop runtime, so the PerfView.exe tool itself can't run. This is what the 'PerfViewCollect' tool is for.

PerfViewCollect is a version of PerfView that has been stripped of its GUI (it only does collection), and built using the .NET Core runtime. When building .NET Core applications you can build them to be self-contained meaning that the application comes with all the .NET runtime and framework DLLs needed to run it. Thus you only need the basic OS functionality, and in particular it will run on the NanoServer.

Currently we don't create a binary distribution of PerfViewCollect, it must be built from the source code at https://github.com/Microsoft/perfview. To build, however you don't need visual studio, you only need the .NET Core SDK Thus the procedure is

Install the .NET Core SDK. This gives you the 'dotnet' command
Install Git for windows if you not already
git clone https://github.com/Microsoft/perfview
cd PerfView\src\PerfViewCollect
dotnet publish -c Release --self-contained -r win-x64

This last command will build the PerfViewCollect application as a self contained application. The tool tells you where it put it, but it should be in src\PerfViewCollect\bin\Release\netcoreapp3.1\win-x64\publish. The tool is the PerfViewCollect.exe in that directory. You can do a PerfViewCollect /? to get some help (but it will be exactly the same command line help for PerfView.exe).

If you copy this directory to your nanoserver you should be able to run the PerfViewCollect.exe there as well Thus you can do the command

PerfViewCollect.exe /logFile=log.txt /maxCollectSec=30 collect

To collect data on Window nanoserver.

Known issues (in Windows Version 1803 or earlier)

There is a known issue as of 10/2018 (or earlier). Basically the issue is that DLLs that are part of the operating system in the container (e.g. the kernel, ntdll, kernelbase ...) end up using the HOST paths not the CONTAINER paths. This would not be that big of a deal, except that the DLL load events do NOT contain a special unique identifier that is used to find the symbol file for the DLL on the Microsoft symbol server. Normally as part of preparation (merging) of the file to be copied off system, these unique IDs are added to the trace. However because this is done IN THE CONTAINER and the events have the HOST paths, the logic that does this fails so there are no unique IDs for the system.DLLs. This means PerfView can't look up the symbol names.

There is a work-around. If you get the correct symbol files (PDBs) and place them in a directory and use the File -> Set Symbol Path to include this directory, AND you pass the /UnsafePDBMatch option to PerfView, then it should work.

There are a variety of ways of getting the correct symbol file, but one way is to use a debugger in the container and ask the debugger to load the necessary system files. Then go to where the debugger put them.

Production Monitoring

See also Command Line Reference for a complete list of the options you can use at the command line

PerfView has a few features that are designed specifically to collect data on production workloads to diagnose performance problems that only occur under real-world loads. We have already seen the /noView option that indicates that after data collection is completes PerfView should simply exit (rather than try to display the data). There are a couple other useful command line options that can be used for production monitoring. First is the /MaxCollectSec:N qualifier. The command

PerfView collect /MaxCollectSec:20 /AcceptEula /logFile=collectionLog.txt

Will indicate that PerfView should collect for at most 20 seconds. Thus this command needs no user interaction to collect a sample of data. Because the /logFile option was also given, any diagnostic information about the collection will be sent to 'collectionLog.txt'. Thus this completely automates collection of data on a server machine in a single command line command.

Using Performance Counters to trigger collection stop (Stop Trigger qualifier)

The /MaxCollectSec qualifier is useful to collect sample immediately. However it is not uncommon that servers experience intermittent performance problems (e.g. bouts of high CPU or high GC usage etc). Thus what is desired is the ability to monitor the server and only capture a sample when something 'interesting' is happening. This is what the /StopOnPerfCounter option is for. The basic syntax for the /StopOnPerfCounter qualifier is

PerfView collect /StopOnPerfCounter:CATEGORY:COUNTERNAME:INSTANCE OP NUM

Where CATEGORY:COUNTERNAME:INSTANCE indicates a particular performance counter (following the same naming convention that PerfMon uses), OP is either a < or a > and NUM is a number. For example

PerfView collect "/StopOnPerfCounter:.NET CLR Memory:% Time in GC:_Global_>20"

Indicates that PerfView should collect data until the _Global_ instance (which represents sum of all GC heaps for all processes on the system) of the '% Time in GC' for the '.NET CLR Memory' category is greater than 20%. Thus this specification will trigger when GC time is high. By default the 'collect' runs in 'circular buffer mode' with a default size of 500MB. Thus the command above will only collect 500MB of data (typically this is a few minutes of data) and then it starts discarding the oldest data. When the performance counter triggers, then the command stops and you will have the last few minutes of data that lead up to the 'bad perf' (in this case high GC time).

Some counters (like the system global counters 'Memory:Committed Bytes' do not have an instance because there is only one for the whole machine. For these specify an empty string. For example

PerfView collect "/StopOnPerfCounter:Memory:Committed Bytes: > 50000000000"

will stop collection when the committed bytes for the entire machine exceed 50GB. Notice that the counter is still CATEGORY:NAME:INSTANCE, but in this case INSTANCE is the empty string (the trailing :).

The performance counter will trigger when PerfView detects that the counter has satisfied the condition for a certain number of seconds, defaulting to 3 seconds. You can control this with the flag /MinSecForTrigger:N to set the threshold to N seconds.

When the performance counter triggers, PerfView actually collects 10 more seconds of trace before stopping. This way you get both the conditions up to and slightly after the event that you are interested in. PerfView logs an event called StopReason to the ETW event stream when the performance counter is triggers so you can see exactly when this happened when looking at the data.

To find the exact names of performance counters to use in the /StopOnPerfCounter' qualifier you can use the PerfMon utility built into windows. To start it simply type 'start PerfMon' at a command line. Then click on the 'Performance Monitor' icon in the left hand pane. This brings up the performance counter graph in the right hand pain. You can click on the + icon at the top to add new performance counters. This will bring up and 'Add Counters' dialog box with the performance counters categories populated. For example you can open the '.NET CLR Memory' category and you will see counters like '# bytes in all heaps' and '% time in GC'. Selecting one of these will then show you all the instances (processes) that have those counters. These three names (category, counter, instance) are the values you need to give to the '/StopOnPerfCounter qualifier.

You will want to test your /StopOn* specification before waiting a long time to see if it captures a trace properly. If you open the log (or use /MaxCollectSec=XXX to force it to stop quickly and then look at the file specified by /LogFile or look for this captured log file in the 'TraceInfo view of the '*.etl.zip'), you will find diagnostic messages as it monitors the perf counter. You should see messages that show it setting up the perf counter as well as the values it sees every few seconds. This can give you confidence that you did not misspell the counter, that you have the correct instance, and you picked a reasonable threshold.

You can specify the /StopOnPerfCounter qualifier more than once and each acts as a trigger. Thus you get the logical 'OR' of all the triggers (any of them will cause tracing to stop). There is currently no way of specifying a logical 'AND'.

If the process you want to monitor lives a long time, then you can specify the instance of that process in the /StopOnPerfCounter qualifier. Sometimes, however it is difficult to identify the process instance you want. Some counters (like the GC counters and others), have a special instance that represents 'all' processes in some way. Look for these in the 'instances' listbox in PerfMon. These can be handy. If don't have a aggregate instance, you can /StopOnPerfCounter for each process instance that MIGHT exist. This is not hard to do because Perf Counters are given names like EXE, EXE#1, EXE#2 etc. Thus you can specify /StopOnPerfCounter for each of the N from 1 up to the maximum number of instance you expect. PerfView is robust to instances that don't exist (it waits for them to exist), so you get the behavior you want.

Here are some other useful /StopOnPerfCounter examples

PerfView collect "/StopOnPerfCounter=Processor:% Processor Time:_Total>90" - This command will trigger if the total CPU time used by the machine exceeds 90%

Monitoring Performance Counters in the ETL file.

It is often useful to have performance counter data logged to the ETL file so that you can correlate the data in the performance counter to the other ETW data. This is what the /MonitorPerfCounter=spec qualifier does. It has the format CATEGORY:COUNTERNAME:INSTANCE@NUM where CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as PerfMon)and NUM is a number representing seconds. The @NUM part is optional and defaults to 2. You can have several of these qualifiers when collecting data. The value of the performance counter is logged to the ETL file as an event ever NUM seconds. Thus

PerfView "/MonitorPerfCounter=Memory:Available MBytes:@10" collect

This command logs the Available MBytes performance counter ever 10 seconds. This data shows up in the 'events' view under the PerfView/PerformanceCounterUpdate event. Monitoring the server's RPS load or memory usage is often useful.

Using log HTTP requests as the trigger to stop

A reasonably common scenario is that you have a web service and you are interested in investigating cases where response time is long. However most of the time response time is good. Thus simply collecting a sample is not likely to be useful. What you need is to run as a 'flight recorder' until a long request happened and then stop. This is what the /StopOnRequestOverMSec qualifier does. The command

PerfView collect "/StopOnRequestOverMSec:2000"

Will stop when an IIS (e.g. ASP.NET) request takes longer than 2000 msec. You can also add the /CollectMultiple:N option so that you collect N of these (the file name is morphed to add a .1, .2 ....).

Finally you can also cause PerfView to stop when messages are written to the windows Application event log. Thus the command:

PerfView collect "/StopOnEventLogMessage:Pattern"

Will stop when a message is written to the Windows Event Log that matches the .NET Regular expression pattern 'Pattern'. By default PerfView monitors the Applications event log, but if you wish to monitor another you can do so by prefixing 'Pattern' with the name of the event log following by a @.

Using long .NET GCs as as the trigger to stop

Another reasonably common scenario is you have some non-HTTP based service that is experiencing pause times and you have a large .NET Heap. Using the /gccollectOnly option for collection you where able to take a very long trace (hours to days) and did discover that there are long GCs that happen from time to time, but only sporadically. These long GCs are blocking and thus are likely to be responsible for the long pause times and you wish to have detailed information about the long GCs. This is what the /StopOnGCOverMSec qualifier does. The command

will collect detailed information that will capture about 2 minutes of detailed information right before any GC that takes over 5 seconds. This detailed information includes information on contexts switches (the /ThreadTime qualifier) and will collect up to three separate files (named the default: PerfViewData.etl.zip, PerfViewData.1.etl.zip and PerfViewData.2.etl.zip) for 3 separate long GCs before shutting down.

Using Exceptions to trigger a stop

Another common scenario is to trigger a stop after an exception as been thrown. This allows you to see what was happening just before the exception happened. You can also match on the name exception or text in the exception being thrown. For example

PerfView collect "/StopOnException:ApplicationException" /Process:MyService /ThreadTime

Will stop on whenever an exception that has 'ApplicationException' was thrown from the MyService process (note that /Process picks the FIRST process with the given name to focus on, NOT all processes with that name). The pattern argument for /StopOnException can be any .NET Regular expression.

PerfView collect "/StopOnException:FileNotFound.*Foo.dll" /ThreadTime

Will stop on whenever an exception that has 'FileNotFound' in its type and 'Foo.dll' somewhere in the text of the message. Notice that you can use a .NET Regular expression .* in the pattern. You can use the full power of .Net regular expressions.

Collecting multiple instances of a problem

By default when any of the /Stop* arguments are given, PerfView will stop and exit after the trigger fires. It is often useful to collect multiple instances of a problem in once session this is what the /CollectMuliple:N qualifier does. For example

PerfView collect "/StopOnRequestOverMSec:5000" /CollectMultiple:3

Will only trigger for ASP.NET requests over 5000, However once triggered, it will go back and resume monitoring until 3 such examples are created. Thus a maximum of 3 files will be generated by the command above. The resulting .ETL.ZIP files have a number just before the .ETL.ZIP suffix that makes the file names unique.

Restricting the trigger to a particular process

By default the /StopOn*OverMsec and /StopOnException will trigger when ANY process satisfies the trigger. On servers with many services running this can lead to false triggers if you are only interested in a particular process. This is what the /Process:processNameOrID qualifier can be used for. For example

PerfView collect "/StopOnRequestOverMSec:5000" /Process:3543

Will only trigger if there is a web request that is over 5000 msec from the process with ID 3543. You can also use a process name (exe without path or extension) for the filter, however this name is just used to look up the FIRST PROCESS with that name. Thus if there is more than one process with that name at the time the collection is started the exact process that is picked is effectively random. Thus you need to use numeric IDs for existing processes unless the process name is unique on the system. Processes that start after the collect starts can use the name unambiguously.

Using the /DecayToZeroHours:XX option

One issue that you can run into when using the /StopOn*Over or /StopOnPerfCounter is choosing a good threshold number. Choosing a number too high will mean that trigger will never fire. Choosing a number too low will cause it to trigger on uninteresting cases. This is what the /DecayToZeroHours option is for. The basic idea is you set the trigger to a number that is on the upper range of what you believe is likely. You also set /DecayToZeroHours:XX to a value that is 'long' (typically it is something like 24 hours. By specifying this option you have indicated that the original trigger value should slowly decay to zero over that time. Thus the command

PerfView collect "/StopOnRequestOverMSec:5000" /ThreadTime /collectMultiple:3 /DecayToZeroHours:24

Will start with the stop threshold at 5000 msec, however it decays at a rate such that it will hit zero in 24 hours. Thus in 12 hours it will be at 2500 msec. Thus over that time period the trigger will eventually get small enough to fire, but odds are that it will trigger well before that at a 'reasonably big' case.

Will collect detailed information that will capture about 2 minutes of detailed information right before any GC that takes over 5 seconds. This detailed information includes information on contexts switches (the /ThreadTime qualifier) and will collect up to three separate files (named the default: PerfViewData.etl.zip, PerfViewData.1.etl.zip and PerfViewData.2.etl.zip) for 3 separate long GCs before shutting down.

Logging while collecting with the /StopOn* options

When the /StopOn* trigger options are active, PerfView will log both to the PerfView log, as well as to the ETL file messages about the average, and maximum request in 10 second intervals. You can see these logs when data collection is happening by clicking the 'log' button on the Main window (even when the collection dialog box is up). They will also be in the ETL file and can be viewed in the 'events' view by filtering to the 'PerfView/PerfViewLog' events. These can be helpful in understanding more about how the maximum changes over time.

Capturing more data after the stop Trigger has fired

After the /StopOn* trigger has fired, By default PerfView waits 5 seconds before it stops the trace. This ensures that you see no only the period just before the trigger, but also 5 seconds afterward. This is sufficient for most scenarios but if you need more you can use the /DelayAfterTriggerSec=N to specify a longer period. Keep in mind, however that typically the default 500Meg circular buffer will only hold 2-3 min of trace so specifying a number larger than 100-200 seconds is likely to allow the period of time before triggering to get overwritten with new data.

Executing an external command when the stop Trigger fires.

In some cases, it there is other logging that is being collected along with the PerfView data. When PerfView is triggering the stop it is useful to execute a command that stops this logging. This is what the /StopCommand is for. The argument can use the variable name %OUTPUTDIR% or %OUTPUTBASENAME% or in it to represent the directory and the base name (filename without the directory or file extension) to pass to the external command.

Stopping on arbitrary ETW events or arbitrary start-stop pairs

The /StopOnRequestOverMSec is wired to measure the duration between the IIS start and IIS stop event. Many services use IIS to route their requests and thus this option is useful much of the time. However it is also possible to trigger a stop on either a single ETW event occurring or a start-stop pair having a duration longer than a trigger amount using the /StopOnEtwEvent. The general syntax is

PerfView "/StopOnEtwEvent:Provider/EventName;Key1=Value1;Key2=Value2..." collect

Where the 'Provider' can be

The name of an ETW provider that is registered with the operating system (returned by 'logman query Providers')
A string of the form '*EventSourceName', which specifies the name of a dynamically registered ETW provider (e.g. an EventSource). The '*' indicates that the name should be hashed to a GUID and that GUID be used as the provider ID.
A explicit GUID

And 'EventName' can be

Of the form 'TaskName/OpcodeName' (e.g. GC/Start) This is the
Simply 'TaskName' if the OpcodeName is 'Info' (0)
Of the form EventID(NNN), where NNN is the decimal event number associated with the event

In general the event name shown in the 'Events' view of PerfView is the correct thing to use. Finally the key value pairs give additional 'options' that affect the semantics. They are all optional, and here are keys that are valid for the key-value pairs.

Keywords=XXXX Specify the ETW keywords to turn on the event needed for the stop. The XXXX is specified in hexadecimal (with or without the 0x prefix) and the default value is ulong.MaxValue If using Windows Kernel Trace provider the default value is 0x0001270F (same as default in perfview for kernel events except profile)
Level=N Specify the ETW Level (1 = critical, 5 = verbose) to turn on the event needed for the stop. The N is specified in decimal and the default value is 4 (informational)
Process=SSSS Specify a decimal process ID or a process name (exe name without path or extension), to filter by. This can also be specified by using the /Process qualifier. The default is to listen to all processes. As with the /Process qualifier, if multiple processes with the same name are present only ONE of them at any particular time will be the focus process (thus use process IDs in that case).
FieldFilter=FieldName OP Value Specifies a field filter. OP is one of < <= > >= = != or ~ (which means match .NET regular expression, case insensitive), and value is either a string, integer value or a floating point value. This option can be repeated more than once in which case the event has to match ALL the filters (currently no OR operator). This can work on single event triggers as well as start-stop triggers however for start-stop triggers it only applies to the START event. This is a very powerful option that allows you to be quite specific about which particular event to trigger on.
BufferSizeMB=NNN Specify the buffer size used by the trigger session. This is only needed if the provider generates a LARGE volume of events rapidly. The default value is 256 MB.
TriggerMSec=NNN Specifies that PerfView will listen not for a single event but for a start stop pair and that only duration larger than NNN MSec will trigger a stop. If this key-value is present then the following key-values have meaning.
DecayToZeroHours=NNN Indicates that the Trigger time will decay to 0 over NNN hours. This can also be specified by the /DecayToZeroHours qualifier.
StopEvent=PROVIDER/EVENTNAME Specifies that stop event of the start-stop pair (the start event was specified before the key-value pairs). If this key-value pair is not specified, then the stop event is derived from the start event by the following rules.
1. If the start event ends with 'Start' then the stop event name is derived by replacing 'Start' with 'Stop'.
2. Otherwise the event with the next event ID is assumed to be the stop event.
StartStopID=XXXX In order to match up start-stop pairs, PerfView needs a 'ID' that is present in both the start and the stop event that can be used to do the matching. XXXX specifies the name of the payload field that does. this. In addition to all payload fields, XXXX can be 'ThreadID' or 'ActivityID' which indicate that they should be used as the correlation ID. If this key-value is not present, and the event has any payload fields the first field is used as the correlation ID. If the event has no payload fields the thread ID is used.
Verbose=true By default PerfView logs a SAMPLE of PerfView/StopTriggerDebugMessage messages into the ETW log which is typically enough information to diagnose why triggering is not working properly. However by setting Verbose=true the information is more complete.

Examples of /StopOnEtwEvent use

As you can see there are a lot of options, but mostly you don't need them. This option is perhaps most useful for your own EventSource Events. If you defined an event 'MyWarning' you could stop on that warning condition by doing

PerfView /StopOnEtwEvent:*MyEventSource/MyWarning collect

If you defined your provider 'MyEventSource, and had two events 'MyRequestStart' and 'MyRequestStop', you could stop whenever your requests took more than 2 seconds by doing

PerfView /StopOnEtwEvent:*MyEventSource/MyRequest/Start;TriggerMSec=2000 collect

If want to stop when the process named 'GCTest' (that is the exe is named GCTest.exe) stops (you can also use a process number).

PerfView /StopOnEtwEvent:Microsoft-Windows-Kernel-Process/ProcessStop/Stop;Process=GCTest collect

If want to stop when a process starts it is a bit more problematic because the 'start' event actually occurs in the process that spawned the process not the process being created. Instead you can use the fact that the ProcessStart has a 'ImageName' field and you can use the ~ operator of the FieldFilter option to trigger on that. Thus to stop when a process called GCTest.exe is launched you can do

PerfView /StopOnEtwEvent:Microsoft-Windows-Kernel-Process/ProcessStart/Start;FieldFilter=ImageName~GCTest.exe collect

Here is a slightly more complex example where we only stop if the GCTest.exe executable fails with a non-zero exit code. Here we use the ImageName field to find a particular Exe as well as the ExitCode field to determine if the process fails. You can use this to stop PerfView when a particular process in a large script fails (which is a reasonably common scenario).

PerfView /StopOnEtwEvent:Microsoft-Windows-Kernel-Process/ProcessStop/Stop;FieldFilter=ImageName~GCTest.exe;FieldFilter=ExitCode!=0 collect

Here is an example where we want to stop when a particular URL is serviced by a ASP.NET server. Basically we stop when a ASP.NET Request event fires with a 'FullUrl' field that matches the pattern (ends in /stop.aspx).

PerfView "/StopOnEtwEvent:*Microsoft-Windows-ASPNET/Request/Start;FieldFilter=FullUrl~http://.*/stop.aspx" collect

Here is an example where we want to stop when a disk I/O takes longer than 10000 ms. We want to monitor Windows Kernel Trace/DiskIO/Read events and use 'DiskServiceTimeMSec' field in a FieldFilter expression.

PerfView "/StopOnEtwEvent:Windows Kernel Trace/DiskIO/Read;FieldFilter=DiskServiceTimeMSec>10000.0;Keywords=0x100" collect

In general the option is pretty powerful, especially if you have the ability to add ETW events to your code (EventSource) Coupled with the FieldFilter you can use this to stop on particular DLLs in particular processes loading, or unloading, registry keys being touched files being opened, as well as any of your specific EventSource events happening (testing their arguments).

Using Keywords on /StopOnEtwEvent providers

In the previous examples we turned on all the 'keywords' associated with a particular provider. For example to trace the starts and stops of process we turned on all the events in the Microsoft-Windows-Kernel-Process provider. While this works, it can mean that the triggering logic has to look at and discard many events that are unimportant. You can improve the efficiency as well as make any debugging of triggering easier by reducing the number of events subscribed to by using the 'Keywords' option. For example

PerfView /StopOnEtwEvent:Microsoft-Windows-Kernel-Process/ProcessStop/Stop;Keywords=0x10;FieldFilter=ImageName~GCTest.exe;FieldFilter=ExitCode!=0 collect

This is the same as the previous example but it has the Keywords=0x10 option placed on it. This tells PerfView to only turn on particular events designated by the 0x10 bitfield. The only issue is how do you know what 0x10 means? You can determine this by looking at the manifest for the Microsoft-Windows-Kernel-Process provider. You can do this by opening the advanced section of the 'collection' dialog box, and clicking on the Provider Browser button. Select the provider of interest in the 'Providers' listbox and then click the 'View Manifest' button. This will bring up the complete XML manifest for the provider. You will find a 'keywords' section and in that you will find the definitions of each keyword. Thus we find that the WINEVENT_KEYWORD_PROCESS keyword has the value 0x10, and we can see that the event of interest (ProcessStop/Stop) is tied to this keyword, we know that this is the only keyword we actually need. Thus we know the 'magic' number to give to the 'Keywords' option above. Another way to find the keywords is using "logman query providers provider". Note you don't have to do this, but it does make debugging easier and processing more efficient (since there are fewer events to have to filter out).

Debugging Triggering Issues

It is not uncommon for you to try out a /StopOnEtwEvent qualifier and find that it does not do what you want (typically because it did not trigger). Sometimes what is in the log will help, however PerfView can't place too much in the log because it might flood the log. Instead it emits special PerfView StopTriggerDebugMessage events into the ETW stream so that you can look at data in the 'events' view and figure out why it is not working properly. If you have issues with Triggering you will definitely want to look at these events.

Using Performance Counters to trigger collection start (Start Trigger qualifier)

For many scenarios, simply using the /StopOnPerfCounter is sufficient (along with perhaps a /DelayAfterTriggerSec) to collect data at an interesting point (when a performance counter is unusually high or low). However that technique has the disadvantage of requiring that collection be on continuously. This is inefficient if the point of interest was well after the performance counter triggers. In this case it makes more sense to not event start collection until the interesting time. This is what the /StartOnPerfCounter option is for. Its syntax is identical to /StopOnPerfCounter except that it will not even start collecting until this trigger trips. The flag /MinSecForTrigger:N applies to /StartOnPerfCounter, to control how many seconds the performance counter has to satisfy the condition before triggering collection (the default is 3 seconds).

Using PerfView with EventSources

The .NET V4.5 Runtime comes with a class called System.Diagnostics.Tracing.EventSource which can be used to log ETW events in a very convenient way. For example here is a trivial EventSource called MyCompanyEventSource which has a 'Load' and 'Unload' event. Each event logs whatever interesting information makes sense for that event, in this case the 'imageBase' of the load as well as the name.

         sealed class MyCompanyEventSource : EventSource
        {
            public static MyCompanyEventSource Log = new MyCompanyEventSource();    // The log itself
            public void Load(long ImageBase, string Name) { WriteEvent(1, ImageBase, Name); }
            public void Unload(long ImageBase) { WriteEvent(2, ImageBase); }
        }
        // In other code
        MyCompanyEventSource.Log.Load(myImageBase, "MyName");
        // In another place 
        MyCompanyEventSource.Log.Unload(myImageBase);

Because EventSources can log to the ETW logging file in standard way, PerfView can display these in useful ways. This section describes some of the common techniques

Naming EventSources

Like all ETW providers, and EventSource has a 8 byte GUID that uniquely identifies it. Normally GUIDs are not convenient to use, and you would prefer to use a name. If an ETW provider registers itself with the operating system PerfView can ask the OS to look up a name and get the GUID. However typically EventSources do not do this because it complicates the deployment of the application. Instead EventSources typically use an internet standard way of generating a GUID from a name. Thus given a name you can find the GUID without the EventSource ever needing to register itself. PerfView supports using this convention with the *NAME syntax. If a provider names starts with a * it is assumed to be the provider GUID which results by hashing NAME in the standard way. (The hash is case insensitive). EventSource names are either the name supplied by the Name parameter of the EventSourceAttribute applied to the EventSource class or it is the simple name of the class (no namespace) if there is no name given explicitly. Once you know the name of the EventSource you can use the /providers qualifier to turn on the EventSource. For example

PerfView /Providers=*MyCompanyEventSource collect

Will turn on all keywords (eventGroups) EventSource called 'MyCompanyEventSource' at the verbose level. Notice that all of this is just 'standard' ETW. The only special part is the * to refer to the EventSource without it being registered.

In the previous example the MyCompanyEventSource was activated IN ADDITION TO the standard kernel and CLR providers. This is great for monitoring fine-grained performance, however it is too verbose for simple monitoring. While you can use the /kernelEvents=none /clrEvents=none /NoRundown qualifiers to turn off the default logging there is a '/onlyProviders' qualifier that makes this even easier. Thus

PerfView /OnlyProviders=*MyCompanyEventSource collect

Will collect ONLY from the providers mentioned (in this case the MyCompanyEventSource), turning off all other default logging. Thus the files tend to remain very small and is suitable when you only wish to see your EventSource messages.

You can achieve the same effect of the /OnlyProviders qualifier in the GUI by opening the 'Advanced' dropdown, unchecking the '.NET Rundown' 'Kernel Base' and '.NET' checkboxes, and adding your EventSource specification in the 'Additional Providers' textbox.

Just like any other ETW source, you can change the 'keywords' (groups) of events or the verbosity of your logging by specifying these to the /OnlyProviders qualifier See the help on AdditionalProviders for more details on this syntax. One very interesting option here is to turn on the 'stacks' option for the provider, which will log a stack trace every time your ETW event fires. This can then be viewed in the 'Any Stacks' view of the resulting log file.

Once you have collected your data, you can look at it with PerfView in the normal way This almost certainly means opening the 'Events' view, selecting the events of interest and updating the display. If desired the events can be saved as XML or CSV files by using the right click context menu in the events view.

Converting EventSource Data to XML

Looking at the output of an EventSource in the event viewer is great for ad-hoc investigations since the GUI allows quick filtering and conversion to CSV or XML file (right click in the EventViewer). However it may be that you want to simply parse the data with other tools that you would like to remain very loosely coupled to PerfView/ETW. For these applications all you want is something that takes a ETL file and converts it to and XML file, which you can then process using other tools. There is a PerfView command that does this.

PerfView /logFile=convert.log.txt UserCommand DumpEventsAsXml PerfViewData.etl.zip

The command above runs the 'UserCommand' called 'DumpEventsAsXml' giving it the parameter 'PerfViewData.etl.zip. This will create a file called PerfViewData.etl.xml which is an XML dump of all the ETL data in the original file (thus the file can get big). It works on any ETL or ETL.ZIP file however it is meant for files produced with the /OnlyProviders qualifier that only have EventSources turned on and thus will produce relatively little output.

The attentive user will wonder what a 'UserCommand' is. PerfView has 'built in' commands, but it also has the ability to be extended with code that the user provides (see PerfView Extensions for more). Some of these user commands become useful enough that they ship with PerfView itself by default. DumpEventsAsXml is one of these commands. You can see all the user commands that PerfView currently knows about by looking at the Help -> User Command Help menu option.

PerfView Extensions (Automating PerfView)

PerfView has the ability to collect data with command line commands , which can be used to automate simple collection tasks, however it is also useful to automate analysis as well as collection. For this simple command line options are not sufficient, you need the full power of a programming language to support an unbounded variety of useful data manipulations. This is what PerfView extensions are for. PerfView allows you to create an extension, which is a .NET DLL that lives alongside PerfView.exe that defined user defined commands. These commands can control PerfView's collection or analysis capabilities. It is very powerful and opens up a broad range of automation scenarios including

Computing complex metrics like startup time which requires you to find the difference between two events (e.g. process start and first render event
Custom groupings and other analysis based on names in the stacks.
Custom reports on Disk I/O, reference set or other metrics
Automating not only ETW collection, but also automating symbol resolution, reducing data to a single process and saving various views as PERFVIEW.XML.ZIP files, dramatically reducing the amount of data (so you can archive more of it) and speeds up use of that data (since symbols are resolved and files size are so small)

Invoking user defined commands

Along with the built in command line commands like 'run', 'collect' and 'view' there is also a 'userCommand'. A user command is one way to activate user-defined functionality in PerfView. For example when you run the command

PerfView UserCommand Global.DemoCommandWithDefaults arg1 arg2 arg3

PerfView will look for a DLL called 'PerfViewExtensions\Global.dll next to PerfView.exe. It will then look for a type call 'Commands' and create an instance of it. Then it looks for a method within that type called 'DemoCommandWithDefaults'. It then passes the rest of the parameters of the command to that method. Often the method target is varags (its last argument is 'params string[]') which allow it to handle any number of arguments.

The extension named 'Global' is special in that if the user command has no '.' in it, then the extension is assumed to be 'Global' extension. Thus the command above could be shorted to

PerfView UserCommand DemoCommandWithDefaults arg1 arg2 arg3

Invoking user defined command from the GUI

You can also invoke user commands from the GUI by using the File -> UserCommand menu option (Alt-U) on the Main Viewer. This command will bring up a dialog box in which you can enter your command. PerfView remembers the user commands you have previously executed (even across invocations of the program), so typing just the first few characters is typically enough to select a command you have executed in the past. Hitting the tab key will commit the completion and hitting Enter will run the command. Thus in just a few keystrokes you can be executing your user defined commands.

Help on User defined commands

The Help-> 'User Defined Commands' menu entry, as well as the 'Command Help' button on the user command dialog will open a dialog that contains help on the various user defined commands

Creating a PerfView Extension (creating user commands)

Before you can invoke a user defined command, you need to create an Extension DLL which contains command. This is what the PerfView CreateExtensionProject command does. Because extension DLLs are located by looking RELATIVE to PerfView.exe, the first step in creating your own extensions, is to copy the PerfView.exe to a location that you control. For example:

xcopy \\clrmain\tools\perfview.exe .\

Once you do this you can execute the command (notice we launch the LOCAL copy of perfview)

.\PerfView CreateExtensionProject ExtensionName

You will create the PerfViewExtensions directory next to the PerfView.exe, and does three things

Creates a new C# project in a PerfViewExtenionsExtensionNameSrc. If ExtensionName is missing/empty, the extension name 'Global' is used.
Creates/Modifies the solution file PerfViewExtenions\Extensions.sln to include the new project.
Opens the PerfViewExtenions\Extensions.sln in Visual Studio 2010.

Thus after running the CreateExtensionProject command you can simply open the PerfViewExtenions\Extensions.sln to run compile and test your new PerfView extension. If you have VS2010 installed, you can be up and running in seconds.

Thus probably the best way to get started it to simply:

Run 'PerfView CreateExtensionProject' This will create 'Global' extension DLL and launch VS2010 on it.
Open the 'Commands.cs' file and set a breakpoint on the first line of the 'Demonstration' method.
Compile and run by hitting F5. You will launch PerfView and you can step through the example.
Explore the PerfView object model (see section below )
Create new commands by creating new methods in the 'Commands' class.

Exploring the PerfView Object Model

INTELLISENSE IS YOUR FRIEND! Only the PerfViewExtensibility namespace is open by default and this is where the most important classes in PerfView's object model reside. This means that there is a good chance if you type some characters, you will find what you are looking for.
CommmandEnvironment is a good place to start. This is the class that defines 'global' methods. If you select on the CommmandEnvironment below and hit F12, you can browse the other global methods. These methods will return other important types in the object model (e.g. EtlFile, Events, Stacks).
Understand classes in PerfViewExtensibility first. You can use the object browser (Ctrl-W J) and look under the PerfView.PerfViewExtensibility namespace.
Take a look at the example commands. These use many of the important features (logging, symbol lookup, HTML report) in context, which is quite helpful.

Once you have familiarized yourself with the PerfView object model, you need to realize an important consideration

There is no compatibility guarantee on the PerfView object model!

What this means is that if you were to upgrade PerfView.exe to a newer version there is a good chance you will have to update your extension to match any changes that where made to PerfView since the last version. The reason for this is simple. The PerfView object model is really best thought of as being a 'Beta' release, because there simply has not been enough time to find the best API surface. Thus changes are inevitable, and the cost of keeping compatibility is simply not worth it. Thus you are free to create PerfView extensions but you must be ready to pay the porting cost on upgrades when you decide to create an extension.

Extending the GUI with User Commands

User commands give you the ability to call your code to create specialized views of data, but it is not integrated into the GUI itself. This section shows how to make your user commands become part of the normal GUI experience. The key to doing this is the 'PerfViewStartup' file in the 'PerfViewExtensions' directory next to the PerfView.exe file. If such a file exists, the commands in this file are executed at startup of PerfView. This file is read line by line and have the following commands

# Comments - lines that begin with # are assumed to be comments and are ignored.
OnStartup UserCommandName - This causes the user command UserCommandName to be called when PerfView starts up. Like all user commands UserCommandName can have the form ExtensionDLLName.CommandName that indicates the DLL where to find the user command and then the command name. This command should take no arguments. Note that this forces this ExtensionDLLName to be loaded on PerfView startup. Ideally you don't use this hook or if you are forced to, you do as little as possible in this routine to keep things pay-for-play.
OnFileOpen Extension UserCommandName - This causes the user command UserCommandName to be called whenever a file with the extension (e.g. .etl). Extension is opened (double clicked) in the main view pane. Note PerfView automatically understands etl.zip files are .etl files so specifying .etl will also cover .etl.zip. This allows you to add new children to existing file format as well as make PerfView recognize completely new file extensions. The user command is called with path name of the file being opened as an argument. If you only need to add a new view to an existing format (e.g. adding new views to .etl files) it is better to use the DeclareFileView option.

Viewing Linux Data

Linux has a kernel level event logging system called Perf Events which is not unlike ETW, and in particular knows how to capture CPU stacks at a periodic interval (e.g. 1msec) PerfView knows how to read this data, so it is possible to collect data using the Perf Events tool on Linux copy the data over to a Windows machine and view it with PerfView's stack viewer. Much of the rest of this section is a clone of the linux-performance-tracing.md document. You may wish to check there as well to see if there for the latest version of these instructions.

Setup

Getting perfcollect script

There is a BASH (shell) script that Brian Robbins wrote that will run Perf.exe resolve symbols and collect all the information into a ZIP file for transfer to another machine. You can download it using either a web browser or using the 'cURL' utility

curl -OL http://aka.ms/perfcollect

Once downloaded, to allow it to run you have to make it executable

chmod +x perfcollect

If that works you should be able to do

./perfcollect

And it should print out some help.

Installing Linux Perf tool

You will need the Perf.exe command as well as the LTTng package you can get these by doing

sudo ./perfcollect install

Note that you need to be super-user to do this so if you are not already, which is why the command above uses the sudo command to elevate to super-user before executing the install script.

Collecting Data

If you are running a .NET Runtime application you must set an environment variable that will tell the runtime to emit symbol information about Just in Time (JIT) compiled methods. Thus you must make sure that the following environment variable is set before running the application

export COMPlus_PerfMapEnabled=1

At this point you can start collection. To do so open another command window and run the following command.

sudo ./perfcollect collect FILENAME

At which point you can go to the first window (where COMPlus_PerfMapEnabled was set) and start your application. After the application completes you can use Ctrl-C to stop the collection. The result is a FILENAME.trace.zip file. This contains the trace as well as all other files to resolve symbolic information.

Viewing data with PerfView

Once you have created the FILENAME.trace.zip file you can transfer it to a windows machine and simply open it with PerfView. It will open the file in a stack window of the CPU samples, and all the normal techniques of CPU investigation are applicable.

What is going on under the hood is that PerfView is opening the FILENAME.trace.zip file to locate a file within the archive with the suffix *.data.txt and reads that. This file is expected to be the output of running 'Perf script' command. PerfView also knows how to read files with the *.data.txt suffix directly, so if you don't wish to use the 'perfcollect' script when collecting your Linux data, you can still easily feed the data to PerfView. (You can also zip up your *.data.txt file into a file with the suffix *.trace.zip and PerfView will happily open it)

Viewing External Data

One of the most powerful aspects of PerfView is its stack viewer. Perhaps one of the most interesting things about this viewer is that it is VERY generic. The data that is shown in this viewer is simply a set of samples where each sample contains

An (optional) floating point value representing the time.
A value (defaults to 1) representing the metric or cost of the sample.
A list of names representing the stack or path in a hierarchical tree.

All the rest of magic of the stack viewer, the inclusive and exclusive cost, the timeline, filtering, the callers, and callees views, are all just different aggregations of this data.

What this means is that pretty much any hierarchical data can be usefully displayed in the stack viewer. For example the size on disk view is simply taking the path of a file name to form the 'stack' and the size of the file as the metric to form the model of the total size on disk view. This means that data from other profilers or any other place where the data forms a hierarchy can be viewed with the stack viewer.

Simple .perfView.xml Format

Now inside the implementation of PerfView is a class called a 'StackSource' that represents this list of samples with stacks that PerfView's viewer views. There is also a class called a 'InternStackSource' that is designed to make it easy to read other formats and turn that data into a StackSource. However PerfView also has two formats that make it very easy allow other tools to output the stacks that perfview can simply read. One of these formats is XML based and the other is JSON based, and neither of them will be surprising, they are simply the 'obvious' encoding of the data that the stack viewer needs in those formats. For example here is a sample of the .perfView.xml format

        <StackSource>
          <Samples>
           <Sample Time="10" Metric="10"> 
                HelperNested 
                Helper1 
                Func3 
                Func 
                Main 
           </Sample>
           <Sample Time="20" Metric="10"> 
                Func3 

                Func 
                Main 
           </Sample>
           <Sample Time="30" Metric="10"> 
                HelperX 
                Helper1 
                Func3 
                Func 
                Main 
           </Sample>
           <Sample Time="40" Metric="10"> 
                Func 
                Main 
           </Sample>
          </Samples>
         </StackSource>

You can see that the format can be very straightforward. There is a 'StackSource' element that has a member 'Samples' which in turn contains a list of Samples, each of which has a time and a metric (both of these are optional, time defaults to 0 and metric defaults to 1) Inside each sample is a list of stack frames, one per line. These are ordered from the most specific (or deepest call tree nesting) to the least specific (main program). That is all you need to generate in order for PerfView to read the data. You can try this out by simply pasting the above text into a '*.perfView.xml' file and the opening the file in perfview. PerfView will open that data in the stack viewer (Try it!)

There is a corresponding *.perfView.json format which is completely analogous to the XML format. The basic structure is the same: A StackSource which has a list of Samples each same has a time, metric and list of names that represent the stack. Here is an example. Like the previous example you can cut and paste into a *.perfView.json file and open it in PerfView, to see the data in the stack viewer.

    {
      "StackSource" :  {
        "Samples" : [
           { "Time" : "10", "Metric": "10",
             "Stack": [
                "HelperNested",
                "Helper1",
                "Func",
                "Main" 
             ]
           },
           { "Time" : "20", "Metric": "10",
             "Stack": [
                "Func3",
                "Func",
                "Main" 
             ]
           },
           { "Time" : "30", "Metric": "10",
             "Stack": [
                "HelperX",
                "Helper1",
                "Func3",
                "Func",
                "Main" 
             ]
           },
           { "Time" : "40", "Metric": "10",
             "Stack": [
                "Func",
                "Main" 
             ]
           }
        ]
      }
    }

Advanced .perfView.xml Format

The simple format is nice because it is so easy to explain, but it is very inefficient. You can see the each stack has to be repeated in its entirety for each sample, and most of the time the stacks are very similar to one another. Moreover when you read the samples into the viewer, you don't get any defaults for PerfView's grouping, folding and filtering options, which makes the experience less than ideal.

Well, the .perfView.xml format is actually more complex than what has been shown so far. In fact you can assign IDs to each unique Frame of the stack and use the ID instead of the name (saving a lot of space). Similarly you can assign IDs to each unique Stack (built from Frame IDs) that can be used in the samples (saving more space). This compression dramatically reduces the time to load the data. Finally it is possible to specify all the defaults and all the options for each of the stack viewers textboxes (e.g., the Group Pats, Fold Pats Include Pats ... textboxes). In short with a little more work when you generate your .perfView.xml file you can make the experience significantly nicer.

Rather than document the specific format for these, it is easier to simply show you an example. The PerfView stack viewer has a File -> Save command and this saves the current stack view as a .perfView.xml.zip file. If you unzip this file, then you will see the representation of the data data in this more complete, efficient format. Thus you can take one of the examples above, open it, add some data to the text boxes (which remember the history), and the save the view. Then you an unzip it and look at the format. The format is completely straightforward.

Working with WPA (Windows Performance Analyzer)

Windows Performance Analyzer (WPA) is a tool build by the Windows and is available for no charge as part of the Windows Assessment and Deployment Kit. Along with the Windows Performance Recorder (WPR) It can be used to collect and view ETW data. Because they both use the same data format (ETW trace log (ETL) files), it is easy to collect using one tool and view using another. This is useful because WPA has has very powerful ways of graphing and viewing data that PerfView does not have, and PerfView has powerful ways of collecting data and other view that are not present in WPA.

Using PerfView to collect data and WPA to view data.

PerfView has a number of Production Monitoring (e.g. /StopOnPerfCounter) capabilities that at present WPR does not have. In addition the fact that PerfView is easy anyone to download from the web and XCOPY deploy as a single EXE makes PerfView ideal for collecting data in the field. In this case you can simply collect with PerfView collect command (with the /threadTime option if you may be doing a wall clock investigation) and the result will be a .ETL.ZIP file ready for uploading. Unfortunately, at present WPA will not open the ETL.ZIP file, but you can use the following command

PerfView /wpr unzip DataFileName

which will unzip the data file as well as any NGEN PDBS and store them in a .NGENPDB folder in the way that WPR would Thus after unziping in this way, you can run the WPA command on the data file to view the data in WPA.

In the scenario above PerfView will set the ETW providers as it would normally. However PerfView also has the ability to mimic the providers that WPR would turn on by default. Thus if you wish to use PerfView to collect data and try to mimic WPR as much as possible, collect the data with the following command.

PerfView /wpr collect

This should produce data files that are very close if not identical to what WPR would produce. In particular it does not produce a ZIPPed file but outputs the .ETL file and the .NGENPDB directory just as WPR would. Like all collection commands, you can use the '/Providers' qualifier to add more providers as well as the /KernelEvents or /ClrEvents qualifiers to fine-tune the Kernel and .NET provider events.

If you wish to generate a file as WPR would but take advantage of PerfView's ZIPPing capability you can combine the /wpr and /zip commands as follows.

PerfView /wpr /zip collect

This command will turn on the providers as WPR would, but ZIP it like PerfView would. This is useful for remote collection. You can use this to collect the data, and use the PerfView /wpr unzip to unpack it at its destination for viewing with WPA.

Using PerfView to View data collected with WPR.

PerfView has a number of views and viewing capabilities that WPA does not have. Thus it is often useful to view data in PerfView that was collected with WPR. This scenario 'just works' PerfView already knows how to open the ETL files and it is smart enough to notice the NGENPDB directory for the symbolic information and use it appropriately.

Command Line Reference

Most functionality that is not intimately tied to viewing is available from the command line to allow for easy automation of data collection. At the command line typing

PerfView /?

Or navigating to Help->Command Line Help from the main PerfView window will give you more complete details.

See also PerfView Extensions for information on building extensions for PerfView.

Using PerfView in Scripts (/LogFile qualifier)

By default PerfView will always bring up a GUI window when performing any operation, including data collection. It does this to allow errors to be reported back. For unattended automation this can be undesirable. This is /LogFile:FileName qualifier is for. When this qualifier is specified instead of launching the GUI the command will send all output to the specified file. The intent is that scripts would use this qualifier to avoid the GUI. The exit code for PerfView will be 0 if the command was successful.

Advanced Data Collection

PerfView data collection is based on Event Tracing for Windows (ETW). This is a general facility for logging information in a low overhead way. It is useful extensively throughout the Windows OS and in particular is used by both the Windows OS Kernel and the .NET CLR Runtime. By default PerfView picks a default set of these events that have high value for the kinds of analysis PerfView can visualize. However PerfView can also be used as simply a data-collector, at which point it can be useful to turn on other events. This is what the /KernelEvents: /ClrEvents: and /Provider: qualifiers do

All ETW events log the following information

The time (to 100ns resolution) when the event happened
The provider that logged the event (e.g., the Kernel, CLR or some user provider).
The event number (which indicates how to decode the payload)
The process and thread associated with the event (some events however there is no useful process or thread ID, but most do)

Kernel Events

By far, the ETW events built into the Windows Kernel are the most fundamental and useful. Almost any data collection will want to turn at least some of these on. PerfView groups the kernel events into three groups See Kernel ETW Events

The Default Kernel Group

The default group is the group that PerfView turns on by default. The most verbose of these events is the 'Profile' event that is trigger a stack trace every millisecond for each CPU on the machine (so you know what your CPU is doing). Thus on a 4 processor machine you will get 4000 samples (with stack traces) every second of trace time. This can add up. Assume you will get at least 1 Meg of file size per second of trace. If you need to run very long traces (100s of seconds), you should strongly consider using the circular buffer mode to keep the logs under control. Here are the events you get under the default group:

Default = DiskIO | DiskFileIO | DiskIOInit | ImageLoad | MemoryHardFaults | NetworkTCPIP | Process | ProcessCounters | Profile | Thread
DiskIO - Fires every time a physical disk read is COMPLETE, indicates the size, and how long the operation took. No stack trace.
DiskIOInit - Fires each time Disk I/O operation begins (where DiskIO fires when it ends). Unlike DiskIO this logs a stack trace.
DiskFileIO - Logs the mapping between OS file object handles and the name of the file. Without this many kernel events are not useful because you can't relate the operation to a meaningful name. You almost always want this event. No stack trace.
ImageLoad - Fires when a DLL or EXE is loaded into memory for execution (LoadLibaryEx is called). Needed if you want to map memory addresses back to symbolic names. Logs a stack trace.
MemoryHardFaults - Fires when the OS had to cause a physical disk read in response to mapping virtual memory. Logs a stack trace.
NetworkTCPIP - Fires when TCP or UDP packets are sent or received. Logs the two end points and the size. No stack trace.
Process - Fires when a process is created or destroyed. Indicates the command line (on start) or exit code (on end). Logs a stack trace.
ProcessCounters - Logs process memory statistics before a process dies or the trace ends. No stack trace.
Profile - Fires every 1 msec per processor and indicates where the instruction pointer current list and takes as tack trace.
Thread - Fires every time a thread is created or destroyed. Logs a stack trace.

The following Kernel events are not on by default because they can be relatively verbose or are for more specialized performance investigations.

ThreadTime = Default | ContextSwitch | Dispatcher - This is the most common of the verbose options. In addition to all the default providers. This option is needed if you want to use the 'Thread Time' view in perfview.
Verbose = Default | ContextSwitch | DiskIOInit | Dispatcher | FileIO | FileIOInit | MemoryPageFaults | Registry | VirtualAlloc
ContextSwitch - Fires each time OS stops running switches to another. It indicates losing processor and the thread getting it. This event fire > 10K second depending on scenario, but can be VERY useful for determining why some process is waiting. Logs a stack trace.
Dispatcher - (Also known as ReadyThread) Fires when a thread goes from waiting to ready (note that the thread may not actually run if there is no CPU available). This can also fire > 10K / sec, but is very useful in understanding why waits are happening.
FileIO - Fires when a file operation completes (even if the operation does not cause a disk read (because it was in the file system cache). Does not log a stack trace.
FileIOInit - Fires when a file operation starts. Unlike FileIO this will log a stack trace.
MemoryPageFaults - Fires when a virtual memory page is make accessible (backed by physical memory). This fires not only when the page needed to be fetched from disk, but also if it was already in the file system cache, or only needed to be zeroed. Logs a stack trace.
Registry - Fires when a registry operation occurs. Logs a stack trace.
VirtualAlloc - Fires when the Virtual memory allocation or free operation occurs. All memory in a process either was mapped or was allocated through Virtual Alloc operations.

The final set of kernel events are typically useful for people writing device drivers or trying to understand why hardware or low level OS software is misbehaving

OS = AdvancedLocalProcedureCalls | DeferedProcedureCalls | Driver | Interrupt
AdvancedLocalProcedureCalls - Logged when an OS machine local procedure call is made.
DeferedProcedureCalls - Logged when an OS Deferred procedure call is made
SplitIO - Logged when an disk I/O had to be split into pieces
Driver - Logs various hardware driver events occur.
Interrupt - Logged when a hardware interrupt occurs.

CLR Events

In addition to the kernel events, if you are running .NET Runtime code you are likely to want to also have the CLR ETW events turned on. PerfView turns a number of these on by default. See CLR ETW Events for more information on these events.

Default = GC | Type | GCHeapSurvivalAndMovement | Binder | Loader | Jit | NGen | SupressNGen | StopEnumeration | Security | AppDomainResourceManagement | Exception | Threading | Contention | Stack | JittedMethodILToNativeMap | ThreadTransfer
GC - Fires when GC starts and stops
Binder - Currently only useful for CLR team.
Loader -Fires when assemblies are loaded or unloaded
Jit - Fires when methods are Just in Time (JIT) compiled.
NGen - Fires when operations assumed with precompiled NGEN images happen
Security - Fires on various security checks
AppDomainResourceManagement - Fires when certain appdomain resource management events occur.
Contention - Fires when managed locks cause a thread to sleep.
Exception - Fires when a managed exception happens.
Threading - Fires on various System.Threading.ThreadPool operations
Stop Enumeration - Dumps symbolic information as early as possible (not recommended)
Start Enumeration - Dumps symbolic information as late as possible (typically at process stop). This is the default.
JitTracing - Verbose information on Just in time compilation (why things were inlined ...)
Interop - Verbose information on the generation of Native Interoperations code.
Stack - Turn on stack traces for various CLR events.

ASP.NET Events

ASP.NET has a set of events that are sent when each request is process. PerfView has a special view that you can open when ASP.NET events are turned on. By default PerfView turns on ASP.NET events, however, you must also have selected the 'Tracing' option when ASP.NET was installed for these events to work. Thus if you are not seeing ASP.NET events you are running an ASP.NET scenario this is one likely reason why you are not getting data.

To turn on ASP.NET Tracing

The easiest way to turn on tracing is with the DISM tool that comes with the operating system. Run the following command from an elevated command prompt

DISM /online /Enable-Feature /FeatureName:IIS-HttpTracing

Note that this command will restart the web service (so that it takes effect), which may cause complications if you ASP.NET service handles long (many second) requests. This will either force DISM to delay (for a reboot) or abort the outstanding requests. Thus you may wish to schedule this with other server maintenance. Once this configuration is done on a particular machine, it persists.

You can also do this configuration by hand using a GUI interface. You first need to get to the dialog for configuring windows software. This differs depending on whether you are on a Client or Server version of the operating system.

On Client - Start -> Control Panel -> Programs -> Programs and Features -> Turn Windows features on or off
- -> Internet Information Services -> World Wide Web Services -> Health and Diagnostics -> Tracing
On Server - Start -> Computer -> Right Click -> Manage Roles -> Web Server (IIS) -> Roll Services
- Add Role Services Health and Diagnostics -> Tracing

Symbol Resolution

?!? Methods

Code that does not belong to any DLL must have been dynamically generated. If this code was generated by the .NET Runtime by compiling a .NET Method, it should have been decoded by PerfView. However if you specified the /NoRundown or the log file is otherwise incomplete, it is possible that the information necessary to decode the address has been lost. More commonly, however there are a number of 'anonymous' helper methods that are generated by the runtime, and since these have no name, there is not much to do except leave them as ?!?. These helper typically are uninteresting (they don't have much exclusive time), and can be folded into their caller during analysis (add ?!? to the FoldPats textbox). They typically happen at the boundary of managed and unmanaged code.

module!? Methods

Code that was not generated at runtime is always part of the body of a DLL, and thus the DLL name can always be determined. Precompiled managed code lives in (NGEN) images which have in .ni in their name and the information should be in the ETL file PerfView collected. If you see things unknown function names in modules that have .ni in them it implies that something went wrong with CLR rundown (see ?!? methods). For unmanaged code (that do not have .ni) the addresses need to be looked up in the symbolic information associated with that DLL. This symbolic information is stored in program database files (PDBs)), and can be fairly expensive (10s of seconds or more), to resolve a large trace. Because of this PerfView by default does not resolve any unmanaged symbols.

Instead it waits until you as the user request more symbolic information. Typically this is done in the stack viewer by right clicking on a cell with a module!? name in and selecting 'Lookup Symbols'. This indicates that PerfView should search for the PDB file and resolve any names that it can in module. Problems finding the correct PDB are not uncommon, so this is not guaranteed to succeed, and can take a few seconds to complete. See the log file if 'Lookup Symbols' fails.

In general PerfView supports executing a command on multiple cells. This can be handy for symbol resolution. For example if there are several unresolved modules that look interesting to you (because they have high CPU usage), you can select them all (by dragging or shift-clicking) and then select 'Lookup Symbols'.

It is possible to 'prefetch' symbols from the command line. You do this by specifying the /SymbolsForDlls:dll1,dll2 ... when launching PerfView. The dlls in the list passed to /SymbolsForDlls do NOT have their file name extension or path.

Default Symbol Path

By far, the most common unmanaged DLLs of interest are the DLLs that Microsoft ships as part of the operating system. Thus if you don't specify a _NT_SYMBOL_PATH PerfView uses the following 'standard' one

_NT_SYMBOL_PATH=SRV*%TEMP%\SymbolCache*https://msdl.microsoft.com/download/symbols

This says is to look up PDB at the standard Microsoft PDB server https://msdl.microsoft.com/download/symbols and cache them locally in %TEMP%\SymbolCache. Thus by default you can always find the PDBs for standard Microsoft DLLs.

However if you are interested in symbols for DLLs that Microsoft does not publish (e.g. your own unmanaged code, you must supply a _NT_SYMBOL_PATH before launching PerfView that specifies where to look.

Setting _NT_SYMBOL_PATH in the GUI

If you need change the symbol path, you can either set the _NT_SYMBOL_PATH environment variable before you launch PerfView, or you can use the File -> SetSymbolPath menu option on StackViewer window. This command will bring up a simple dialog box showing the current value of the _NT_SYMBOL_PATH variable and allow you to change it. The _NT_SYMBOL_PATH is a semicolon delimited list of places to look for symbols. Each such entry can be either

A simple file system path. These can be relative, but absolute paths are recommended
Syntax of the form SRV*localPath*symbolServer. Where localPath is optional and specifies a location on your local machine to cache files fetched from the symbol server. Using this is always recommended and PerfView will add it for you (using %TEMP%\SymbolCache) if you don't enter it. SymbolServer is the name of the symbol server. It is either a UNC file name (e.g. \\MySymbols\symbols) or a URL (e.g. https://msdl.microsoft.com/download/symbols)

Typically if you don't get unmanaged symbols when you do the 'Lookup Symbols', you check the log and if necessary add new paths to the symbol path. See also symbol resolution.

PerfView supports Azure DevOps symbol servers and it will automatically authenticate either using local development credentials (Visual Studio or VSCode) or by prompting you to sign in.

SRV*localPath*https://yourorg.artifacts.visualstudio.com/_apis/Symbol/symsrv
SRV*localPath*https://artifacts.dev.azure.com/yourorg/_apis/symbol/symsrv

Summary

Thus typically all you need to get good symbols is

If you are investigating performance problems of unmanaged DLLs of EXEs that did not come from Microsoft (e.g. you built them yourself), you have to set the _NT_SYMBOL_PATH to include the location of these PDBs before launching PerfView.
Select cells that have !? in them in the viewer, right click and select 'Lookup Symbols'

Source Code Lookup

One very useful feature that is easy to miss is PerfView's source code support. This support is activated by selecting a name in the stack viewer and typing Alt-D (D for definition), or right clicking and selecting 'Goto Source'. This will bring up the source code for that name in a text editor, where every line has been annotated with metric for that line. This feature is indispensable for doing analysis within a method, and is also just generally useful for understanding what the code is doing in general.

Source code support is a relatively fragile mechanism because in addition to having all the information to symbolically look up method names (PDBs) PerfView also needs line level information as well as access to the source code itself. It is easy for these extra conditions to break which will break the feature. However source code support is typically so useful that it is worth the trouble to get things working.

In order for source code to work you need the following

The code must support line level symbolic information. This includes
- Unmanaged code (e.g. C++)
- Managed code using the .NET V4.5 Runtime. V4.5 is an in-place update to the V4.0 .NET Runtime, which windows update should install by 12/2012 (it is also the default for Windows 8). However if you are running an application built for V3.5, source code will not work unless you set a configuration file for the app to force it to use the V4.5 runtime.
PerfView must be able to find the source code. This can be accomplished in a number of ways.
- If the code was built on the machine where the profile was collected, then things should 'just work'. The EXE or DLL will contain the path to the symbol file (PDB) and this will be correct, and the source code paths in the symbol file will also be correct.
- If the code was built with 'Source Server' support and you have access to the TFS or Source Depot (SD) source code repository, then again source code should 'just work'. This is a common case for users within Microsoft itself because both DevDiv (which makes Visual Studio, and the .NET Runtime), and the Operating system to build their code with source server support. In this case the PDB symbol file has embedded within it the exact version information needed to find exactly the right version of the source in the source code control system.
- If the code was built with Source Link support then PerfView will attempt to download the source file from the linked repository. For public repos on GitHub.com, for example, this should just work. For private GitHub repos and Azure DevOps repos, you may be prompted for authorization. See Authentication Options for the different ways PerfView can authenticate to private repositories and symbol stores. If you need to supply credentials, but take too long to sign in, the source look-up will time out. In that case, just retry the "Go to source" command and it should succeed.
- You have set the _NT_SOURCE_PATH environment variable to be a semicolon list of places to look to find the source code. Each such element in this list is a 'base' that PerfView will search by appending suffixes of the full build-time path of the source file.

PerfView gives detailed messages in PerfView's log of the steps it took to find the source code. Thus if there is any issue with looking up source code this log is the place to start.

Setting _NT_SOURCE_PATH in the GUI

Often you don't need to set the _NT_SOURCE_PATH variable because by default PerfView will search both the original build time location (which will work if you build on the same machine you run) as well as the symbol server specified in the PDB symbol file (Which works if the code was indexed with the source server. However in other cases you must set the _NT_SOURCE_PATH. Just like the case of _NT_SYMBOL_PATH, you can set this variable in the GUI by going to the File -> 'Set Source Path' menu entry of the stack viewer. This value is persisted across different invocations of the PerfView program.

Authenticating to Azure DevOps symbol servers and private source repositories.

If your symbols are on an Azure DevOps artifacts store, or your source code is not public, then PerfView may prompt you to sign in. Support currently exists for Azure DevOps and private GitHub repositories. If installed, PerfView will try to use the Git Credential Manager which is typically installed with Git For Windows. If Git Credential Manager is not installed, PerfView will fall back to alternate authentication mechanisms. The authentication mechanisms can be configured on the Authentication submenu on the Options menu in the main PerfView window. The authentication options are described below.

Git Credential Manager. This is the most flexible option for developers using Git. It works alongside your Git installation to sign into private repositories. Support is currently enabled for Azure DevOps and GitHub. We hope to add GitLab and BitBucket support in the future. PerfView will search for the Git Credential Manager executable (git-credential-manager-core.exe) in a number of well-known locations, but if it can't be found, then the option will be unavailable. If you have installed Git Credential Manager in a non-standard location, you can set the value of the GCM_CORE_PATH environment variable to the full path prior to launching PerfView. However, please note that, for security reasons, the GCM_CORE_PATH environment variable is ignored when PerfView is running elevated.
Developer identity for Azure DevOps. If you're having trouble with Git Credential Manager (sometimes, it can prompt several times for the same credentials), then this might be a better option. It uses the same mechanism that's' used in the Azure SDK for .NET when providing developer credentials to Azure resources. Visual Studio or VS Code must be installed and, while they don't have to be running at the same time as PerfView, you must have signed into Visual Studio or VS Code using credentials that can access your Azure DevOps repo. If you sign into Visual Studio with several different accounts, you may need to select the right one in Tools/Options/Azure Service Authentication. See the Authentication and the Azure SDK blog posting for more information.
Device Code Flow for GitHub. This option, for GitHub only, uses a Device Code to grant PerfView access to GitHub private repositories. PerfView will prompt you with an 8 digit device code which you use to log into GitHub.com using any web browser. The browser could be running on a different device, if necessary. When you enter the code into the browser and approve the app, the dialog will automatically close and PerfView will be able to access the same private repositories that your account is allowed to access.
Basic HTTP Authentication. This option allows you to use Basic HTTP authentication when connecting to a symbol server. To use it, you should specify the username and password in the URL for your symbol server. For example: SRV*SymbolCachePath*https://username:password@symbolstore.url;. This scheme is active by default but used only if the URL contains username and password information.

'BROKEN' Stack Frame in Trace.

When a sample is taken, the ETW system attempts to take a stack trace. For a variety of reasons it is possible that this will fail before a complete stack is taken. PerfView uses the heuristic that all stacks should end in a frame in a particular OS DLL (ntdll) which is responsible for creating threads. If a stack does not end there, PerfView assumes that it is broken, and injects a pseudo-node called 'BROKEN' between the thread and the part of the stack that was fetched (at the very least it will have the address of where the sample was taken). Thus BROKEN stacks should always be direct children of some frame representing an OS thread.

When the number of BROKEN stacks are small (say < 3% of total samples), they can simply be ignored. This is the common case. However the more broken stacks there are, the less useful a 'top-down' analysis (using the CallTree View) is because effectively some non-trivial fraction of the samples are not being placed in their proper place, giving you skewed results near the top of the stack. A 'bottom-up' analysis (where you look first as where methods where samples occurred) is not affected by broken stacks (however as that analysis moves 'up the stack', it can be affected)

Broken stacks occur for the following reasons

In 32 bit processes, ETW relies on the compiler to mark the stack by emitting an 'EBP Frame'. When it fails to do this completely and uses the EBP register for other purposes, it breaks the stack. This should not happen for operating system code or for .NET Runtime code, but may occur for 3rd party code.
In a 32 bit process on a 64 bit Windows 7 or Windows Server 2008 there is a bug in which stacks are uniformly dropped in some sessions. The good news is that it only happens intermittently. Thus if you collect the data again, it is likely to sidestep this bug. This should be fixed in Windows 8.
In a 64 bit process, ETW relies on a different mechanism to walk the stack. In this mechanism the compiler generates 'unwind information'. Currently this ETW mechanism does not work properly for dynamically generated code (as generated by the .NET runtime JIT compiler). This causes stacks to be broken at the first JIT compiled method on the stack (you see the JIT compile method, but no callers of that method). This issue is fixed on Window 8 but not in previous OS versions.
Asynchronous activities. Stack crawling is a 'best effort' service. If the sample is taken at a time where it would be impossible to do logging safely, then the OS simply skips it. For example, if during stack crawling while in the kernel the stack page is found to be swapped out to the disk, then stack crawling is simply aborted.

Working around 64 bit stack breaks:

If you are profiling a 64 bit process there is pretty good chance that you are being affected by scenario (2) above. There are three workarounds to broken stacks in that instance

NGEN the application. The failures occur at JIT compiled code. If you NGEN the application, JIT compilation will not be necessary and the broken stacks will disappear. To NGEN your application simply type
C:\Windows\Microsoft.NET\Framework64\v4.0.30319\NGen install YourApp.exe.
You will have to repeat this every time your application is recompiled. If your code is called from a server, you need to NGEN all the DLLs that are important to you (same command line as above).
For server applications there is often not a main EXE that you can pass to the NGEN command above, however you can NGEN particular DLLs using the same syntax (NGEN install DLLPATH). If you don't know that path names to your DLLs you can find them by going to the 'Events' view and selecting the 'ModuleLoad' and 'ModuleDCStop' events as well as the 'ModuleILPath' and 'ModuleNativePath' columns. Any DLL without a 'ModuleNativePath' is a candidate for NGEN.
Switch to 32 bit. If your code is pure managed code, then it can run both as a 32 or a 64 bit process. By switching use a 32 bit process, you avoids the problem. This does not work if you took dependencies native code that only exists for 64 bit. You can convert your application to run 32 bit by using the CorFlags utility that comes are part of the .NET SDK. It also comes are part of Visual Studio (open the VS command prompt). To switch simply type CorFlags /32bit+ YourApp.exe. You will have to repeat this every time your application is recompiled.
For ASP.NET applications you can set it so that your page is loaded in a 32 bit process by following the instruction in this blog
Perform only a bottom-up analysis. Even with many broken stacks, there is a lot of information in the profile, and a 'bottom-up' analysis is possible.

Missing frames on stacks (Stacks Says A calls C, when in the source A calls B which calls C)

Missing stack frames are different than a broken stack because it is frames in the 'middle' of the stack that are missing. Typically only one or maybe two methods are missing. There are three basic reasons for missing stacks.

Inlining. If A calls B calls C, if B is very small it is not unusual for the compiler to have simply 'inlined' the body of B into the body of A. In this case obviously B does not appear because in a very real sense B does not exist at the native code level.
Tail-calling. If the last thing method B does before returning is to call C, the compiler can do another optimization. Instead of calling C and then returning to A, B can simply jump to C. When C returns it will simply return to A directly. From a profiler's point of view, when the CPU is executing C, B has been removed from the stack and thus does not show up in the trace. Note also that B does not need to be small for this optimization to be beneficial. The only requirement is that calling C is the last thing that B does.
EBP Frame optimization. In 32 bit processes (64 bit processes don't use EBP Frames), the profiler is relying on the compiler to 'mark' the call by emitting code at the beginning of the method called the EBP Frame. If the compiler does not set up a frame at all and uses the EBP register for its own use it results in a broken stack. However even when the compiler is aware of the need to generate EBP Frames there is overhead in doing so (2 instructions at the beginning and end of the method. For small methods (too big to inline, but still small), the compiler can opt to simply omit the generation of the frame (but LEAVE EBP untouched). This results in a missing frame. It should be noted that the EBP Frame that method sets up marks the CALLER, not itself. Thus if method B seems to be missing, it is not because B omitted its EBP frame but because method C did. Thus this kind of frame omission happens when method C is small, not when B is small.

While missing frames can be confusing and thus slow down analysis, they rarely truly block it. Missing frames are the price paid for profiling unmodified code in a very low overhead way.

Troubleshooting

Main View Troubleshooting

Check the Log File - If you performed an operation (like symbol lookup) and you should check the log file (button in the bottom right corner of the display), for additional information
No ASP.NET Events If you are expecting the ASP.NET view when you opened the ETL file but it was not present, it is likely because ASP.NET itself was not configured to log such events. See ASP.NET Events for more.
Other Troubleshooting

Stack Viewer Troubleshooting

Symbol Problems: (X!? or ?!? in the display ): Because of the expense of looking up symbols for unmanaged DLL, PerfView resolves them lazily. You have to indicate which DLLs to resolve by right clicking on a set of cells and selecting the 'Lookup Symbols' command. If 'Lookup Symbols' fails look carefully at the log file and add to _NT_SYMBOL_PATH. Read see Symbol Resolution for more complete information.
All the CPU time is in a node like OTHER<<ntdll!?>> : An important part of a performance investigation is to group the costs into semantic groups that are meaningful to the programmer. Typically this means that you want to group code that you have no control over (like the operating system code) as one group and much finer groups (often individual methods) for code that you can change. The default view for PerfView tries to approximate this by a view called 'Just My App' which groups all code that is NOT in the directory subtree in which EXE file lives as a 'OTHER' group. This works well for many applications, but for scenarios where some other host (e.g. internet explorer) it produces poor results because your code does not live with the EXE and thus is grouped as 'OTHER'. Fixing this is a simple matter of choosing a more appropriate grouping operator. Typically choosing the 'group module entries' entry in the 'GroupPats' text box is a good choice to start with, and then ungroup (right click->grouping->ungroup module) any modules that you yourself own. See how to group and grouping reference for more on grouping.
All the CPU time is the Process Node. By default PerfView sets the Fold% textbox to 1, which means that any node in the tree that uses less than 1% of the total CPU time is 'inlined' into its parent. This normally works well, but for servers that have 100s of threads, then no thread may use more than 1% of the time (since it is split among 100s). The result is that you lose all the detail and just see the Process node, which is not useful. To fix this simply set Fold% to 0. You may also wish to fold away the thread nodes (put ^Thread in the Fold Pats Text box). This basically says I don't care what thread CPU is used on, combine them all as if they were one thread. Once you do this you can actually turn Fold% back to 1 because now you don't have 100s of threads any more (they are treated as one).
'BROKEN' Stack Frame in Trace- During data collection when an event fires, a stack trace is taken. When this trace is incomplete it is called a broken stack, and can happen for a variety of reasons. See Broken stacks section for more.
Missing Frames - Sometimes the trace shows method A calling method C but you KNOW that A calls B which calls C. See Missing Frames for details.
.NET Programs spend a lot of time in clr.dll (or mscorwks.dll) at shutdown. In order to get good symbolic information for .NET methods, it is necessary for the CLR runtime to dump the mapping from native instruction location to method name. This is done when the process shuts down (or when PerfView requests and rundown explicitly). The CPU consumed by this is uninteresting from an analysis perspective (because it does not occur normally). The easiest way to exclude this time is to set a time range that does not include the process shutdown. See zooming to a range of interest for more.
Other Troubleshooting

Event Viewer Trouble Shooting

No events show up - It is reasonably common to leave text in one of the filter text boxes unintentionally (typically the 'textPats' textBox). This of course filters the set of events more aggressively, often leading to no events being shown. Inspect the filters carefully in this case.
Other Troubleshooting

Tips

Here are useful techniques that may not be obvious at first:

General Tips

Help: On Windows 7, if you drag a top level window to the left or right margin of the desktop, it will expand automatically to fill exactly half the desktop. Use this feature to drag the StackWindow to take up the left half and the help to take up the right. Now you can read the help and use it simultaneously.
Blue hyperlink help: The UI is filled with blue hyperlinks that take you to specific help on that aspect of the UI. If you have a question about what a particular piece of UI does, a blue hyperlink can often help.
Right Click is your friend: When you want to know what is possible to do, it never hurts to right click to see what context menu pops up. These menus have the keyboard shortcut on them so it is a useful way to learn fast ways of navigating.
Cut and paste into Excel, EMail - You can cut and paste regions of data in a grid-view into excel or into E-mail as text.
Cut and Paste Ranges - When you have selected exactly two numeric cells, this is copied (Ctrl-C) as two numbers. If you paste this into the 'start' textbox, it will set both the start and end values.
Quick Sums, Averages - Whenever you select more than one numeric cell in a grid-view, The sum, count, average (and possibly difference) is displayed in the status bar. These results can be cut and pasted from the status bar.
Quick Selection of numbers If you wish to copy a number out of the status bar, you can quickly select it by simply double clicking on it, and hitting Ctrl-C.
Quick Calculator - If you copy a number into the clipboard then when a single cell is selected in the gridview, the sum, difference product and ratio of the selected cell and the clipboard value is displayed in the status bar. This is useful when you want to do arithmetic on cells in different views.

Frequently Asked Questions (FAQ))

How do I get rid of ? in node names (e.g. ntdll!?)?
PerfView emits a ? for any program address that it cannot resolve to a symbolic name. See Troubleshooting Symbols and Symbol Resolution for more.
What are 'BROKEN' stacks? What causes BROKEN stacks?
If the stack trace that is taken at data sample time does not terminate in OS DLL that starts threads, the stack is considered broken. See Broken Stacks for more.
Stack frames seem to be missing. What is going on?
The algorithm used to crawl the stack is not perfect. In some cases there is not sufficient information on the stack to quickly find the caller. Also compilers perform inlining, tailcall and other operations that literally remove the frame completely at runtime. The good news is that while sometimes confusing, it is usually pretty easy to fill in the gaps. See Missing Frames for more.
.NET Programs spend a lot of time in clr.dll (or mscorwks.dll) at shutdown. What is that?
In order to get good symbolic information for .NET methods, it is necessary for the CLR runtime to dump the mapping from native instruction location to method name. This is done when the process shuts down (or when PerfView requests and rundown explicitly). The CPU consumed by this is uninteresting from an analysis perspective (because it does not occur normally). The easiest way to exclude this time is to set a time range that does not include the process shutdown. See zooming to a range of interest for more.
What is mscorwks!PreStubWorker or clr!PreStubWorker? I see my code calling it.
PreStubWorker is the method in the .NET Runtime that is the first method in the .NET Runtime Just-in-time compiler. This method will be called the first time a method is called to convert the code in the EXE (which is NOT native code) into native code that can be executed by the processor. If the amount of time in this helper (inclusively) is large, it can be reduced by using the NGEN.exe tool to precompile the code.

What does (unmerged) mean in the main viewer?.
When ETW data is first collected, it actually comes in two files an .ETL file (which the viewer shows you) and a .Kernel.ETL file (which the viewer hides from you). Moreover these files do not contain information (precise dll versions) needed if you wish to examine the data on a different machine. Merging is the process of combining these files and adding the extra information. Because merging can take some time (10s of seconds) it is not done by default, and the viewer indicates this by displaying '(unmerged)'. This is a warning to you that if you wish to copy this file to another machine you will need to merge it first. See merging for more.
What is PerfView's Relationship to XPERF.exe / WPA.exe
The NT performance team has a tool called XPERF (and a newer version called Windows Performance Analyzer (WPA) which is also VERY useful for doing performance analysis. In fact, PerfView and XPERF/WAP should not really be considered competitors. In fact they both use the same data (ETW data collected by various ETW providers). The ETL files created by XPERF can be viewed by PerfView and vice versa because they really are very similar programs. So which should you use? The good news is that it does not really matter that much, since you can change your mind at any point. Currently PerfView has more power grouping capabilities, so XPERF users may want to try PerfView out when they encounter 'flat' profiles. Conversely, WPA has better graphing capabilities as well as memory views that PerfView simply does not have.

PerfView has /wpr qualifier that eases some friction when using WPA to view data collected with PerfView. See Working with WPA for more.

What is PerfView's Relationship to the Visual Studio Profiler?
Visual Studio also has a profiler built into it, so the question arises why not use that? The answer is you should! However the Visual Studio profiler's goal was to make profiling easy at development time. Also it concentrates on CPU issues. If you have need to collect profile information 'in the field' (which typically includes test labs), it is hard to use the VS profiler (you have to install it, which includes creating a device driver). It also it cumbersome to attach to services (often there are security issues). The result is that it is hard to use the VS profiler outside of development time.

Release Notes

Version 2.0.39 3/20/19
- Merged kayle's update to display the type of the alloction for C++ code (in the Net OS Heap Alloc View). It is now the case that if you have PDBS for the call site of a C++ 'new' expression and that compiler supports it (I believe anything after VS2017 CPP compiler will work), then PerfView will create a 'Type XXX' pseduo-node for allocation sites. Having this type information can definitely be useful.
Version 2.0.32 12/14/18
- Added the /focusProcess=ProcessIDOrName qualifier (e.g. focusProcess=PerfView.exe) This allows you to only turn on non-Kernel events for a particular process, and thus cut the overhead / size of the collection when there are many active processes on the system.. Note that it does not have an effect on kernel events (which are often the most common, but not always), so it may not help as much as you would like, but DEFINITELY helps during rundown (if you have many managed processes, they all do rundown which can be impactful). So it always helps when there are many managed processes (because of rundown) but can help quite a lot if many of those processes allocate a lot, or use the threadpool (which both can create many events).
- Added a popup warning if the ETL file has events out of order in time (this should not happen but when it does, it can produce GUI anomalies, so I want the warning to be obvious). Added a FirstTimeInversion property to support this feature.
Version 2.0.28 10/2/18
- Fixed parsing of Task Parallel library parsing to include the .NET Core 2.1 event System.Threading.Tasks.TplEventSource/IncompleteAsyncMethod used to find 'orphaned' Async operations. Also added this event to the default collection for TPL, so that it is always 'just here'. Basically this is a new feature of the .NET Core task library that notices when tasks are created, but then collected without ever being completed one way or the other. This can happen if the TaskCompletionSource dies before it calls 'Complete' on the task. The string in the event is the name of the method where the orphaned machine (Task) will return when it continues. The code that was supposed to trigger the 'await' to complete is at fault. This feature needs to be friendlier but it is a big step from knowing nothing.
- modified the TraceEvent library's concept of what the 'version of the manifest is to' include a term that is 100 * the largest event ID. Thus if you add a new event (at the end), you can remove (clean up) a few dozen unused events and still be considered 'better'. Note that this should be used with care, as it implys that the deleted events are not EVER useful (even for old code that still emits them), because TraceEvent will not parse them going forward (The TPL EventSource did just this which is why it came up here.)
Version 2.0.27 9/25/18
- Fixed 'PerfView Listen EVENTSOURCE' so that it works without the * prefix for EventSources.
- Fixed missing descriptions for user commands
- Added support for the /SessionName=XXXX parameter which renames both the user and kernel session names that PerfView uses (which allow you to have two PerfView's running or run with other tools that use the kernel provider)
- Use stack compression by default
- Stop the kernel and user mode session concurrently. This helps when the disks are very slow (VMs), to keep the two sessions overlapping maximally
Version 2.0.22 8/15/18
- Added the /DotNetCallsSampled command line option that does call instrumentation but samples every 997 calls (to keep overhead low)
- Added the /DisableInlining command line option that tells the runtime not to inline (used with the /DotNetCalls or /DotNetCallsSampled options)
- Minor bug fixes so that things work inside windows docker containers. This works on windowsServerCore Version RS3 or beyond. PerfViewCollect can be used on windowsNano OS
- fixed build to support SourceLink for the PerfView/TraceEvent source itself.
- Added docs for using PerfView in windowservercore and nanoserver containers.
Version 2.0.17 5/25/18
- Added support for the ThreadName property that the OS supports. The Thread/SetName event is now parsed well, and if the name is present it shows up in the Stack views.
Version 2.0.16 5/22/18
- Fix bug when parsing 'mixed' EventSources that use both Manifest events and self-describing events in the same EventSource, leading to the self-describing events being parsed as (garbled) manifest events. This can happen when using EventCounters pretty easily since EventCounters use the self-describing format.
Version 2.0.15 5/14/18
- Changed the default symbol cache to %TEMP%\SymbolCache. This aligns PerfView with what Visual Studio does which saves some space. Because PerfView remembers the symbol path from invocation to invocation, this change will not affect existing places where PerfView is run. To use the new cache location you need to use the file -> Clear User Config, and restart. But mostly you should not care.
Version 2.0.2 1/12/18
- Added support for SourceLink for 'Goto Source' functionality. SourceLink is a technique of finding source files by placing a mapping from built time file name to URL into the symbol file so that the source code can be fetched by URL at debug/profiling time. .NET Core annotates all its symbol files this way. The result is that 'Goto Source' on .NET Core assemblies (that is the framework and ASP.NET) just work in PerfView (it will bring up the relevant source).
Version 2.0.1 1/8/18
- Added Flame Graph.
Version 2.0.0 1/5/18
- Officially update the version number to 2.0 in preparation for signing and releasing officially. Only the version number update happens here.
Version 1.9.71 1/3/18
- Fix an issue in TraceEvent that causes double-dispatch of some events. This is most likely to affect the Start-stop activities. Thus if there is strangeness there, this may fix it.
Version 1.9.70 12/15/17
- Fixed issue where when PerfView is run on older .NET Runtime's it fails to load the System.Runtime.InteropServices.RuntimeInformation.dll.
Version 1.9.69 12/14/17
- Added the /LowPriority command line qualifier that causes the merging/NGENing/ZIPPing that perfview does to package up the data to happen at low CPU priority to minimize the impact to the system.
Version 1.9.68 11/8/17
- If you are collecting with something that needs a .NET Profiler (the .NET Alloc, .NET Alloc Sampled or .NET Calls). it is possible that modifications to the registry that install PerfViews profiler are not being cleaned up. The effect of this is mostly that other tools that might use the .NET Profiler will not work properly (e.g. code coverage tools or other profilers). This is most likely to happen on 64 bit and .NET Core (Desktop .NET is likely to work OK). This fix makes the cleanup thorough. The fix will 'clean up' any keys left behind by old PerfView runs.
Version 1.9.67 11/2/17
- Added the Gen2 Object Death view that use the 100KB allocation events (coarse sampling). Thus most traces will now have this view (including the /GCOnly view). This view shows you where you allocated objects that then die in Gen 2 (These are the most important for reducing the number of Gen2 GCs (and Gen 2 GC fragmentation)). You could do this before when you turned on /DotNetAlloc or /DotNetAllocSampled collection but those are more expensive and can have logistic issues (you can't attach to a existing process). The view will only show you a coarse sampling but that often has useful information.
- Added the 'GC Occurred Gen(X)' frame to the GC Heap Net Alloc and GC 2 Object Death views. These are useful for seeing where the GCs in time without having to go to the GCStats or Events views.
Version 1.9.66 10/27/17
- Updated the support DLLs that parse .diagsession files. This allows it to read the newest format.
Version 1.9.64 9/27/17
- Added Support for Argon (light weight) Windows containers. Most of this is in fact work-arounds which will eventually be removed, but this makes PerfView work with Argon containers in the RS3 version of the OS (the version currently available). Note that there seems to still be issues with looking up symbols for SOME OS DLLs, but all managed code should work. Also PerfView is a GUI app and Argon containers don't use GUI, so you need to use the techniques in 'Automating data collection' to use PerfView in the container. (for example 'Perfview.exe /logfile:logfile.txt /accepteula /maxcollectsec:30 collect'). PerfView (like all GUI apps) will run in the background if run from the command line directly, but will block until exit when run from a batch script).
Version 1.9.55 5/9/17
- Change /GCCollectOnly so that it also collect Kernel Image load events. This is useful because it allows you to get software version information which otherwise is unavailable without increasing the size of the resulting file significantly.
Version 1.9.54 5/5/17
- Fixed bug where Process name for the MapFile event was incorrect.
Version 1.9.51 1/31/17
- Added support for reading files from the YourKit java profiler. This works for both their CPU trees as well as their object allocation trees. These XML files need to be named '*.tree.xml' for perfview to recognize the file as something it understands. See XmlTreeStackSource for more details.
Version 1.9.50 1/24/17
- Fixed issue where the 'processes' view was giving negative start times and other bogus values.
Version 1.9.49 1/6/17
- Better names for start-stop coming from Diagnostics Sources. This helps for doing ASP.NET Core uses DiagnosticSource for both incoming and outgoing HTTP requests. use
Version 1.9.47 1/5/17
- Enable DiagnosticSource and ApplicationsInsight providers by default.
Version 1.9.46 1/4/17
- Added the GIT commit hash to the module information in the 'Modules' Excel table in the 'Processes' view. This commit will also show up in the ImageLoad event in the 'events view. Useful for finding the source code for a particular module. This information is fetched from the 'FileVersion field of the version information for the file (what fileVersion -v returns). It is looking for 'Commit Hash: HASH'. If it does not find this on FileVersion, it looks on the ProductVersion field.
- Extend the UserCommand Listen command to take full ETW provider specs rather that just the ETW provider name
Version 1.9.45 12/19/16
- Fixed problem getting symbols for System.Private.CoreLib.ni.dll by using /ForceNGENRundown.
Version 1.9.44 12/16/16
- Supported .NET Alloc, .NET Sample Alloc and .NET Calls on .NET Core.
Version 1.9.41 11/26/16
- Fix Null Ref when opening Thread Time With Start-Stop Activities.
Version 1.9.40 10/14/16
- Update version number to 1.9.40 for GitHub release.
Version 1.9.33 9/16/16
- Merged in code to fix .NET Core ReadyToRun images by running crossgen with .ni.dll file names
Version 1.9.31 9/16/16
- Fix issue getting symbols for .NET Core's CoreLib.ni NGEN image.
Version 1.9.30 9/13/16
- Fix issue https://github.com/Microsoft/perfview/issues/116. Problem opening ETL files with bad end time.
Version 1.9.29 9/7/16
- Fix issue where if you do GC dump with 'save etl' more than once from the same process you don't get type names.
Version 1.9.28 9/7/16
- Fix perf issue with traceLogging support
Version 1.9.27 9/5/16
- Reorganize TraceLogging fix into its own class (TraceLoggingEventID).
Version 1.9.26 9/2/16
- Fix the parsing of Events generated by Windows 10 TraceLogging APIs. While they generally worked in the native case, in JavaScript they were liked to be broken. The issue is that TraceLogging events no longer give an small integer Event ID that was guaranteed to be unique for that layout of event. Instead you simply have a blob of meta-data. Fixed this by assigning an event ID to each such blob (would have been nice if ETW had simply done that)
Version 1.9.25 8/31/16
- Fix symbol lookup but associated with 1.9.24 (can't find PDB signature)
- Change the convention for PDB naming for ready-to-run images.
Version 1.9.24 8/31/16
- Added ability to property create PDBS for NGEN and read-to-run images for .NET Core scenarios. Note that this support is likely to be ripped out when these PDBS are up on a symbol server properly.
Version 1.9.23 8/30/16
- Fixed issues with Activity views in .NET Core.
- Added the command line arguments to the process node in the stack viewers
- Hack to make ready-to-run PDB lookup work (really needs crossgen to be fixed, but this makes things work in the mean time)
Version 1.9.22 8/29/16
- If you place a 'symbols' directory next to a data file, PerfView will place any PDBs needed in that directory. This is a handy feature when you are sharing data with other people with data files that are private builds. Unfortunately, a few versions back this logic was broken. This update fixes this.
Version 1.9.21 8/10/16
- Removed Just My app for dotnet.exe hosts since it is does more harm than good.
Version 1.9.20 8/9/16
- Fixed broken opening of .diagsession files.
Version 1.9.19 6/17/16
- Fixed issue opening trace.zip files introduced in last update.
Version 1.9.18 6/15/16
- Fixed issue where .Trace.ZIP files without LTTng information would fail when viewing the CPU stacks with a file in use error.
Version 1.9.17 6/14/16
- Fixes issue with out of memory when taking a .GCDump from a very large process dump. Improved the out of memory logic to automatically retry with smaller values.
Version 1.9.16 6/2/16
- Made the view for a *.trace.zip file show all the possible sub-views (CPU stacks as well as LTTng data).
Version 1.9.15 6/1/16
- Integrated Lee's fixes for LTTng support for GC Heap dumps on Linux.
Version 1.9.14 5/20/16
- Updated DirecotrySize view to recognise NGEN images and Ready-To-Run images.
Version 1.9.13 5/19/16
- Fixes to make .NET Core Ready-to-run images work properly;
- Added the PdbSignature user command (help debug PDB symbol match issues)
Version 1.9.12 5/16/16
- Updated default symbol paths to include NuGet locations.
Version 1.9.11 5/13/16
- Fixed issue looking at heap dumps in ETL files.
- Fixed activity paths to have // prefix again.
Version 1.9.8 4/22/16
- Added the 'Advanced Group' to .GCDump files and put everything but the heap in it
- Added a bit more information to the .GCDump log spew.
Version 1.9.8 4/5/16
- Added TotalHeapSize TotalPromotedSIze and Depth fields to the GC/HeapStats event. This is useful for /StopOnEtwEvent uses (e.g. stop when the GC heap gets too big)
Version 1.9.7 4/4/16
- Fix issue getting source code from NGEN images on .NET Core scenarios.
- Added support to collect File Open (Create) events (with stacks) by default.
- Also add collection of Process Create events (with stacks) by default
Version 1.9.6 3/28/16
- Fix asserts associated with keeping EnumerateTemplates in sync with TraceEventParser events.
- Made PDB expansion logic a bit more robust.
Version 1.9.5 3/28/16
- Make the heap dumper retry with a smaller maxObjectCount if it runs out of memory
- Tuned the CLR rundown to avoid unnecessary events (in high volume scenarios)
Version 1.9.4 3/28/16
- Fixed failure to load NGEN images in .NET Core scenarios
- Change it so that PDBS that are in the build location or next to the DLL are checked first
- (thus no network operations if you build locally)
Version 1.9.3 3/28/16
- Fixed failure reading Linux traces that have unusual characters in their path name.
Version 1.9.1 2/22/16
- Added Power events (so you can know how throttled the CPU is)
Version 1.9.0 2/12/16
- Updated documentation. Preped for release to web.

Version 1.8.28 2/7/16

Categorized items in etl files into 'memory' 'specialized' and 'obsolete' group so people are more naturally drawn to the most important views. Removed blocked time (thread Time supercedes it)
Added Support for CrossGen when auto-generating NGEN pdbs (for CoreCLR)

Added Support for .perfView.json and perfView.json.zip files. You can give it a JSON file like the following which has two samples in it.

                    
{
    "StackSource": {
        "Samples": [
            {
                "Time": "10.1",
                "Stack": [
                    "Executing Func for Sample 1",
                    "Calling Func",
                    "Main"
                ]
            },
            {
                "Time": "10.1",
                "Stack": [
                    "Executing Func for Sample 2",
                    "Calling Func",
                    "Main"
                ]
            }
        ]
    }
}

Version 1.8.28 2/4/16
- Added support doing performance investigations with Linux Perf Events data. Basically if collect data with the bash script https://raw.githubusercontent.com/dotnet/corefx-tools/master/src/performance/perfcollect/perfcollect it will runt the Linux 'perf' tool that will collect CPU samples, convert them to a .data.txt file (which is a textual representation of the data) and then ZIP it into a .trace.zip file PerfView knows how to decode either the uncompressed .data.txt file or the zipped .trace.zip file and display it as a stack view. Thus you can now do linux performance investigations with PerfView.
Version 1.8.25 2/2/16
- Improvements in Start-Stop time. UNKNOWN_ASYNC displayed more often, some AWAIT time shown more often.
Version 1.8.24 1/27/16
- When opening 'Drill Into' windows, the columns are not in the order of the parent window in the ByName view. Fixed this.
Version 1.8.23 1/26/16
- Merging failed on Win7 and Win2k8 systems in PerfView Version 1.8. This means you could still analyze on the machine where you collected, but symbols would fail to look up if you took the trace off the system. Fixed by including an old version of KernelTraceControl.dll an used it on Win7 systems.
Version 1.8.22 1/23/16
- Fixed ArgumentOutOfRange exceptions thrown in EventView for some events (strings with length prefixes)
- Don't crash if regular expressions are incorrect in Events view.
Version 1.8.21 1/18/16
- Extended perfView.xml file format so that it can more easily consume 'ad hoc' creation of stacks. It still accepts the 'interned' scheme where you give IDs to each frame and stack and use those to create samples, but now you can specify the samples inline with the sample like this
```
<StackWindow>
    <StackSource>
        <Samples>
            <Sample Time="10">
                Executing Func for Sample 1
                Calling Func
                Main
            </Sample>
            <Sample Time="20">
                Executing Func for Sample 2
                Calling Func
                Main
            </Sample>
        </Samples>
    </StackSource> 
</StackWindow>
                    
```
  While this format is inefficient (you repeat many strings in many stacks), it is sometimes convenient, and it is easy enough to support. There are more details which I will blog about in the near future.
Version 1.8.20 1/13/16
- Improved the robustness of the UserCommand 'Listen' command in the face of bad events.
Version 1.8.19 1/7/16
- Significantly improved the Thread Time with Start-Stop Activities. The goal here is that this view replaces the ASP.NET and Service Request view, and we are probably most of the way there now. I need to validate this more and then probably obsolete the other views.
Version 1.8.15 12/4/15
- Fixed a fairly serious bug associated with the Events Viewer where you don't see some CLR events (They appear in the left pane, but you never see them in the right pane even though there are instances of them in the file). Note that version 1.8.0 does not have this bug, it was introduced relatively recently.
- Added ActivityInfo and StartStopActivity fields to Events View. ActivityInfo will show you the creation and start time (and the raw ID) of the System.Threading.Tasks.Task that logged the event. StartStopActivity shows you the name of the start-stop activity that is logged the event.
Version 1.8.15 12/4/15
- Fixed a fairly serious bug associated with the Events Viewer where you don't see some CLR events (They appear in the left pane, but you never see them in the right pane even though there are instances of them in the file). Note that version 1.8.0 does not have this bug, it was introduced relatively recently.
- Added ActivityInfo and StartStopActivity fields to Events View. ActivityInfo will show you the creation and start time (and the raw ID) of the System.Threading.Tasks.Task that logged the event. StartStopActivity shows you the name of the start-stop activity that is logged the event.
Version 1.8.11 11/16/15
- Fix excessive warnings when converting ETL files. Might also fix some StartStop Activity issues.
Version 1.8.10 11/12/15
- Significant improvement in how activity tracking works. Hopefully the stacks associated with 'with Tasks' views will be better.
- Added JIT Inlining feature that enables viewing all successful and failed inlining attempts, including the JIT-supplied reason for why inlining wasn't performed in the failure cases.
- Added finalization feature that tracks finalized objects and provides a table of each type with a finalized object and the associated number of times an object of that type was finalized.
Version 1.8.9 11/1/15
- There is a bug in RC candidates of V4.6.1 where NGEN createPdb only works if the path of the NGEN image is in the Native Image Cache (NIC), but V4.6.1 uses hard links for NGEN images that come from the install itself. The result is that you don't get symbols for mscorlib, system, and system.core. This adds a work-around for this (normally all paths to the NIC path before calling NGEN CreatePdb), until the runtime is fixed.
Version 1.8.8 10/31/15
- Added support for .NET V4.6.2 convention for NGEN PDB line numbers. This means that if data is collected on a V4.6.2 then the lack of access IL PDBS are not available at data collection time is not longer an impediment to getting line number information (that is access to the corresponding IL pdb with line number information is no longer needed to create an NGEN pdb that has line number information).
Version 1.8.7 10/22/15
- Integrated changes that allow DyanamicTraceEventParser to do everything that RegisteredTraceEventParser can do. Removed the calls to RegisteredTraceEventParser. This could break things but should not. So far things look OK.
Version 1.8.6 10/12/15
- Integrated Lee's update of CLRMD that should make PerfView able to extract heap dumps from debugger dumps of .NET Native processes.
- Added the DotNet (Telemetry) event ETW provider by default.
Version 1.8.5 10/6/15
- Made 'Any Stacks (with StartStop Activities)' and 'Any StartStopTree' public.
Version 1.8.3 9/23/15
- Turned off System.Threading.Tasks.Task events that are verbose and only needed for debugging. This was useful before so that any traces I get have detailed information for debugging, but are now impacting the cost of using PerfView in production when Tasks are used heavily.
Version 1.8.2 9/13/15
- /InMemoryCircularBuffer option was broken (Would throw a file not found exception in SetFileName). Fixed this.
Version 1.8.1 9/3/15
- Fixed issue where Debug versions were asserting that two stacks were attached to the same event because kernel and user mode stacks were not being stitched together properly (mostly in rare cases where thread-starts were happening)
Version 1.8.0 8/30/15
- Release to Web.
Version 1.7.31 8/19/15
- Update code that does merging so it works properly on Win10. It does not have an effect if you look at the events with PerfView, but on Win10 until this change, data collected with PerfView would not parse EventSource events properly in WPA.