ICS Technical Report 96-16
This paper explores some of the issues that arise in effective use of measures to monitor and improve formal technical review practice in industrial settings. It focuses on measurement dysfunction: a situation in which the act of measurement affects the organization in a counter-productive fashion, which leads to results directly counter to those intended by the organization for the measurement.
The next section discusses several general concepts in measurement and measurement dysfunction. The following section applies these concepts to formal technical review and catalogs some of the potential measurement dysfunctions that might occur in FTR. The final section proposes some strategies for minimizing the threat of measurement dysfunction in FTR.
According to Austin, whenever you measure an attribute of an organization with the goal of improving the organization's performance, you run the risk of worsening the organization's performance as a direct result of the measurement. As a simple illustration, consider the apocryphal story of the Soviet Boot factory. Faced with high production quotas, they chose to produce only size-7-left boots. The desired productivity was achieved, as defined by the measurement, though the basic goals of the organization were sacrificed in the process.
For the purposes of exploring measurement dysfunction in formal technical review, a very simple analysis of measurement practice can suffice. Using Austin's terminology, there are at least two uses to which a given measurement can be applied: for information and for motivation.
Informational measurement "tells about an organizational process... It is used to learn from and to plan." In more FTR-specific terms, informational measures support process improvement.
Motivational measurement, on the other hand, "is used to quantify the value of compensation for compliance with objectively verifiable standards of work." In other words, motivational measurement is used to evaluate the performance of individuals.
One tenet of this book is that an individual measure is "value-free" with respect to its application: it can be used for informational purposes, motivational purposes, or both. Importantly, it is impossible for an organization to guarantee that a measure, once taken, will never be used for motivational purposes. Thus, individuals in an organization may tend to operate under the assumption that any measures of individual performance can be used for motivational purposes, regardless of the stated intention of the organization with respect to that measure at the time it is taken.
Measurement dysfunction occurs when individuals behave in a manner that produces good values of the measure for the purposes of motivation (i.e. performance improvement), although this behavior simultaneously undermines the value of the measure for the purposes of information (process improvement).
Austin does not claim that measurement dysfunction is guaranteed to occur when a measurement can be applied for both information and motivation. However, he does provide a compelling argument that dysfunction is a quite natural outcome and, unfortunately, that it is typically quite difficult to detect the presence and degree of measurement dysfunction.
Austin's book provides much more detail on the general issues and remedies for measurement dysfunction. The next section turns to the specific case of dysfunction in formal technical review.
What Austin's work reveals, however, is that it is not enough for the organization to claim that it will not use a review metric for personnel appraisal. As long as the review metric is available to management for use in that manner, reviewers may act as if it will be used in that way (either now or in the future). The problem with software review processes such as Gilb's Inspection (as well as by review processes of my own design such as CSRS/FTArm) is that they are designed in such a way that virtually all measures of review are available to management for use as motivational measures. Measurement dysfunction is a potential outcome of such a review process, regardless of whether or not the organization ever actually uses the measures for such purposes.
To detail this problem, this section provides explicit descriptions of eight different forms of measurement dysfunction possible in FTR.
Measurement occurs during FTR at both the individual and the group level. Individuals generate measurable data during planning, review preparation, and rework. Groups generate measurable data during the review meeting. Dysfunction can occur in both individual and group measures. For each type of dysfunction described below, its occurrence at either or both of the individual or group level is noted.
Measurement dysfunction can occur when an organizational goal is to increase the number of "important" defects found during FTR, as measured by metrics such as the percentage or frequency of non-minor errors found during FTR. Groups and/or individuals can artificially inflate the defect severity level accorded to defects to achieve the desired measurement improvement, even if the "real" measure is stable or falling. Such inflation may be done subconsciously, without any direct intent to deceive the organization by individuals or groups.
Measurement dysfunction can occur when an organizational goal is to improve FTR defect detection, as measured by an increase in the defect density measure over time. In this case, groups and/or individuals can improve this number over time by starting to classify certain items as defects that were previously classified into some non-defect category (such as "minor issues", "syntax", "formatting", or "author question"). Again, such change might be entirely subconscious, and could result in an improvement in the reported measure though the "real" measure might be stable or falling.
Measurement dysfunction can occur when an organizational goal is to improve defect closure rates or efficiency, as measured by metrics such as the number of open defects or the average time-to-close a defect. An organization might also weight this measure by the defect severity in order to encourage more effort into closing the more severe defects.
In this case, groups and/or individuals can improve this number over time in at least two ways. First, they can begin classifying certain difficult-to-fix issues as "enhancements" rather than "defects", which lowers the measured number of hard defects in the work product and thus improves the measure. Second, in the case of weighted measures, they can gradually lower the severity classification provided to a defect of a given type, which will improve the value of the weighted measure.
Measurement dysfunction can occur when an organizational goal is to improve or assess review (or reviewer) quality as measured by review preparation time. Measurement dysfunction can also occur as a result of the way in which reviewer preparation time data is collected. Typically, preparation time is publically presented by each reviewer to the moderator at the beginning of the review meeting. In this latter case, measurement dysfunction can result from pressure to be viewed as having worked as hard on the review preparation as the other participants, and thus to not be viewed as a "slacker".
Since reviewer preparation is a private activity, individuals are free to report a greater than actual preparation time value to the moderator without fear of being found out. Furthermore, there is substantial pressure on participants to never report "no preparation", since this could result in re-scheduling of the meeting. Low preparation effort could also show up (either implicitly or explicitly) as a negative factor on a future individual performance evaluation.
Preparation time inflation can also occur subconsciously, as when a reviewer simply "rounds off" a preparation time of a little over an hour and a half to two hours. Even such rounding results in a 30 percent error in the measure.
Measurement dysfunction can occur when an organizational goal is to improve individual or group efficiency as measured by defect discovery rate. In this case, reviewers may feel pressured to adjust their reported preparation times downward in order to move their efficiency upward. As noted above, reviewer preparation times are particularly vulnerable to alteration, since they are provided publically and because they are essentially unverifiable.
For example, a reviewer who spends three hours preparing (rather than the recommended two hours) would only benefit from reporting a preparation time closer to the recommended value: her efficiency would appear to be greater, and the disparity in preparation effort between her and her colleagues would probably be reduced.
Measurement dysfunction can occur when an organizational goal is to improve either defect density or defect detection rates for individuals. In this case, individuals will feel pressure to improve the number of defects they can report as found during preparation. One way, noted previously, is to inflate severity. Another way is to talk with other reviewers about the document and find out what errors they've discovered. The reviewer can then report these errors on his preparation defect report. The result is an increase in the personal measures for the individual, although these duplicate issues are simply collated during the meeting and do not result in any net quality improvement.
Measurement dysfunction can occur when an organizational goal is to increase review usage, as measured by either or both of review participation or review coverage. In either of these cases, there is a pressure to simply maximize participation and/or coverage while minimizing actual effort allocated to the review. The result is high frequency but low quality review.
Dysfunction can occur as soon as one or more individuals recognize that a measure might be used either now or in the future for performance evaluation. Thus, promises by current management to use the measures "appropriately" are largely ineffectual in preventing measurement dysfunction. As everyone knows, managers and management policies are subject to periodic and unpredictable change.
As shown in several of the examples above, dysfunction can occur without any explicit, conscious, malicious attempts on the part of the developers to subvert the data. In many cases, it manifests itself gradually over an extended period of time as a "false trend". Determining its existence would take extraordinary efforts for most organizations, such as comparing similar defect types over a period of a year or more to determine if inflation in severity types is occurring. Other forms of dysfunction, such as in preparation time reporting, may be impossible to detect.
First, if individuals and groups were provided with a measurement system that was impossible to use for performance evaluation, then there would be little motivation to record measurements inaccurately. Second, if the measurement system was clearly useful for informational purposes, then there would be clear motivation to record the measures consistently and reliably. Obtaining FTR measurement with minimal dysfunction appears to require both of these conditions to be met.
Let's assume that the measures list above in the previous section have value as informational measures, and thus the goal is to determine how to obtain accurate values for them for use in process improvement while minimizing the of possibility of measurement dysfunction. What would a formal technical review process look like under these circumstances? Here are some design principles for formal technical review processes based upon the goal of minimal measurement dysfunction:
The degree to which a review method minimizes the possibility of dysfunction is directly correlated with the degree to which it protects the privacy of individual data. Any individual data not protected by a review method is subject to potential dysfunction and may not truly represent the process.
Some of the more important implications of this principle are:
A solution to this problem is to ensure that any data generated by individuals is always presented in aggregate form along with data from other individuals, and with any ownership information missing.
Some implications of this principle are:
To satisfy both requirements, there must be some way for individuals to personally benefit from the effort invested into collecting review measures. A design requirement for a minimally-dysfunctional review process is to provide mechanisms for analysis of individual review data that supports individual process improvement. Fortunately, there exists a rich source of insight into how to do empirically-driven personal software process improvement in the form of Watts Humphrey's "A Discipline for Software Engineering" (Addison-Wesley, 1995).
Here are some example review measures of group activity that, in order to minimize dysfunctionality, should be kept privately by the group and not made available to management:
This implies that groups must be trained and provided with the tools necessary to do analysis and process improvement of their own data.
As a worst case scenario, if measurement dysfunction is found to be a substantial and widespread problem in Formal Technical Review practice, then:
It is important to note that susceptibility to measurement dysfunction does not necessarily translate into actual dysfunction. Dysfunction measurement susceptibility must be viewed as a continuum. Whether an organization needs to change some, most, or all of its FTR measurement mechanisms depends upon its culture and history.
Finally, there is also a silver lining. Formal technical review, while being shown over and over again to provide positive benefits, still suffers from a low adoption rate. Perhaps at least some of these adoption problems stem from the perceived invasiveness of FTR measures. As new FTR methods appear that attempt to actively prevent measurement dysfunction, at least one of the current obstacles to FTR adoption might be ameliorated.