If you have missed the Twittersphere brouhaha and you want to get more details on what I think about metrics, you can listen to my recorded webinar on the topic. You can ask the other folks about what their objections are.
Once you do, I’d be interested in knowing your thoughts. Are test metrics useful to you? What problems have you had? What benefits have you received? What different situations have you used metrics in (e.g., Agile vs. waterfall, browser-based apps, level of risk, etc.), and how did that context affect the metrics? Let me know…
]]>Hello,I wanted to respond to your Twitter reply regarding my first attempt at using PRAM (Twitter post Feb 18, 2013 6:58am) in more detail.The real success of this testing efforts will of course be measured once the software is out in the field…but this is the “small” success: I was assigned a project to coordinate testing involving 25 integrating applications. Testing needed to be coordinated between unit and production testers, who would be testing at the same point in the development pipeline. (The company is exploring new testing methodologies, so utilizing testing resources in different ways is part of this.) I needed to find a common ground from which to develop a test plan for this project. I had just seen your presentation on PRAM and brainstorming, so I thought focusing the risks would bring testers together. From the software specs, I compiled a list of quality risk categories. I put these on the board, and as I passed out Post-it notes, I asked the newly-formed team of unit/production testers from each application to identify the test types they would use to mitigate the risks. One of the interesting outcomes was that we found that one area of risk did not have as many Post-it notes. We then needed to decide whether this was a risk area that did not need to be a priority, or whether it needed to be tested but not by all apps. As a result of this exercise, testers collaborated on their test plans for their specific application, while simultaneously evaluating the effect this project had on the product as a whole.
How does this experience differ from how you might have coordinated the test planning for this type of project? Any suggestions for the future?
Thank you for your interest,
Shoshannah Gil
First, glad to hear that you’re having success with the PRAM technique.
One thing that struck me about your e-mail is that you only mentioned testers as participants in your risk analysis process. Testers do have an excellent perspective on product quality risk, but their perspective, like all stakeholders, is incomplete.
As you continue to refine your use of the technique, be sure to include other business and technical stakeholders. You can find more details on the how’s and why’s of stakeholder involvement on the RBCS Digital Library.
]]>Hi Rex
I hope you are well. I wanted to ask you a question about test estimation. I am sure you have been asked many of these before but the one I have is not really about the estimation techniques themselves (such as usage of historical data, dev effort, etc)
There is one area of test estimates which is always arguable, hard to estimate and finally explain to sponsors no matter how well you are prepared. This is a test analysis and design task which is vague by definition. If we quickly decompose it into smaller pieces we would end up with the following simplified list of activities:
1. Analyse the test basis (if exist).
2. Get ambiguities, inconsistencies and gaps in the test basis resolved.
3. Apply the test techniques to create the test cases.While you can more or less quantify 1 and 3 in terms of the effort (let’s assume we at least have something to work with in terms of test basis), the issue is obviously with 2 where we are dependant on many people (Business Analysts, system Analysts, dev team, end-users, etc).
There are two obvious options we can choose from:
Option 1: Assume there will be no gaps, issues or they will be resolved immediately and all our questions will get answers with no delays and thus, simply estimate 1 and 3. Of course, we will make the assumptions documented to highlight the risk if they arise. The problem with this one is that we know straightaway we will overbudget and we will have to come back to business sponsors and ask for more money. Nobody likes doing this, especially if we start asking for an additional amount every single time our inadequate estimates deviate with the reality. Moreover, on some of the project the budget is fixed straight away once it has been confirmed.
Option 2: Get this slippage time or time spent on requirements clarification somehow estimated based on previous experience. The issue here is that this is an ‘unexplained’ effort to an extent it can’t be justified by a statement: “but we know there will be issues or something will not be ready”. Pretty valid scenario in this case would be: “Hey, we have just two requirements here. Why the hell it takes two weeks and not two days to create the tests for these two?”
To me getting the right questions raised, asked and answered is a part of test analysis and this activity is extremely important as it prevents defects. To a certain extent this is a very informal static testing or a QA activity which needs to be build into the process but nobody is willing to pay for it explicitly. From the other hand, the ethics does not allow you to simply ignore problems and test that a buggy software is buggy. In the latter case, I normally still try to squeeze in the static test and get decision makers to accept the risk that problems with the requirements may arise very late.
I wanted to hear for your recommendation on test analysis and design effort estimates and test effort negotiation with business sponsors and project managers. It would be also great to hear your comments on both options or perhaps option 3 if it exists.
Thanks
Stas Milev
ISTQB Certified Advanced Test Manager (CTAL)
Hi Stas–
A good question. What I would suggest is that the estimation for activities 1, 2, and 3 should be based on historical data. So, if you know that you have some average number of test cases be identified quality risk, per specified requirement, per supported configuration, etc., you should be to estimate activities 1 and 3 based on the average number of hours effort associated per test case. For activity 2, once again, if you have historical data on the average number of defects typically found per test basis document page, you should be able to estimate the number of defects you’ll find. If you know the average time from discovery to resolution of such defects, and the average amount of effort for each such defect, you can then estimate the delay and effort.
The metrics gathered about test basis defects could be used not only for estimation, but also for process improvement.
]]>I was ready some of your excellent webinars, specially about the defect metrics, So my question is: Is there any benchmark data about defects severity, and also about reopen defects.?
I would really appreciate if you can guide me where I Can find this data.
Note: the purpose of this data is to measure and compare our results against the benchmark.
Tomas Gotes
Thanks for the question, Tomas. In terms of defect severity, unfortunately there are no standard, common definitions for severity in the industry yet. So, any metrics that showed aggregated data from multiple organizations would probably be rather questionable. That said, you might check out Capers Jones’ excellent book, The Economics of Software Quality, which, as I recall from reviewing it, had some interesting analysis here.
In terms of defect report re-open rates, our target during assessments is 5% or less. Some amount of defect report re-open seems inevitable, since test environments and data are generally more representative of production than are the developer’s environments. However, there is a significant risk of schedule delay, as well as significant inefficiency (due to an additional layer of rework), when the defect re-open rate gets too high.
]]>Dear Rex,
My name is Marcus Milanez and I’m a software developer who lives in São Paulo, Brasil. I recently bought a copy of your “Testing Metrics” ebook and found it really valuable for improving my understanding on testing processes, as well as hearing more words on the importance that metrics play on my daily tasks. Thanks for sharing all your studies and results, I appreciate it.
However, I still have some questions that I still couldn’t get any reasonable answers – maybe they’ve been presented to me already, but my limited comprehension is failing to absorb them. In order to exercise these questions correctly, I would like to add some general contexts, just to make sure that I’m using my words properly.
Forgetting a little bit about all the nitty-gritty details that are observed during the creation or maintenance of a software system, in general, in a regular software development process we have:
1) Customers require features that must be properly implemented
2) Developers along with specification teams, stakeholders and other parties, define the scope of that feature and write a couple of functional/acceptance tests that validate the analysis of the customer request
3) In a planned spring, developers implement the feature and get feedback from specs team, stakeholders and customers, to make sure that everything is fine.
4) Quality team verifies the quality of any version delivered by developers. Bugs are filled and latter fixed by developers in a sprint.
I clearly understand that the list above is incomplete, could be better and is definitely passive of change, but I’m just trying to give an overall context. Now, by reading your book, I could clearly see all the values that metrics and proper testing strategies add to a software project, and I’m glad I was presented to things like DDE and all those sample charts – that information is really gold. While reading the book though, I tried hard to think and use the same logic that you presented in your DDE formula, for somehow measuring the value of unit tests created by developers during development time. The thing is that I couldn’t get to any conclusion because looks like the value added by unit tests can’t be measured in the same way that you presented in your book. My reasons for that are:
1) Looks like unit tests are not making verifications in the same sense that exploratory or automated tests are – rather, we regularly use them for getting something built according to a set of premises, and later on, these set of premises can be verified.
2) Developers don’t usually count, or indicate, bugs found during development time. I don’t even think that this would make any sense.
3) They are not testing a use case, they are testing smaller pieces of it. Although these smaller pieces can directly affect a use case, looks like these observations are quite different from those observed by an automated functional/acceptance test.
So, my question is: can we measure the value/cost benefit added by unit tests, using metrics that indicate that they are definitely effective in a software project, or the value of unit tests are not related at all with the number of bugs filled by a validation team for example? Can the value added (or not) by unit tests be measured, qualified and improved? Is it possible to observe unit tests, alone, with a cost perspective?
I exercised this question on my own, and so far the following seem to be valid for me:
Looks like unit tests main intention is not to find bugs. Unit tests seem to be closer to a development tool (like an IDE is) than a tool for finding problems.
Since unit tests are part of the development effort, it is difficult to say how many hours have been spent during codification of unit tests and codification of production code.
Although a project with high a code coverage by unit tests tend to have less bugs filled, they are not a guarantee of a use case bug free code.Still thinking about number of bugs, I believe that unit tests tend to prevent run time errors more than use case errors.
Apparently, the real value of unit test comes in terms of maintenance, ability of adding or removing features without fear, and mainly as a means of communication with other developers working on the same code base.
If these first items were valid, it is likely that unit tests will certainly prevent the creation of a given number of bugs during validation phase. However, in order to assert that, we would need to collect all the metrics related to a given project that don’t have unit tests (like DDE, cost per bug fix, time per bug fix and others), properly implement unit tests for this project, and then get the very same metrics again for comparison. Is this rationale correct? If so, can we confidently say that unit tests, alone, were responsible for getting these numbers changed? Are there any study published that analyses these points?
Thank you so much for your attention and sorry for this long email. The thing is that I found your book extremely useful and I definitely want to improve my understanding of many aspects of my profession.
yours,
Marcus Milanez
Thanks for your kind words about my book. I’m glad it’s proving interesting and useful to you, and that it provoked such a good and reflective set of questions.
Before I get into a detailed response, let me set up the background for what I’m about to write:
With those assumptions in mind, here are my thoughts on the many good comments and points that you raise. Let me start by reviewing the development process:
So, let’s get back to metrics and measuring the value of unit testing, starting with not logging defects found by developers. This is a mistake. I agree that we don’t need to log failures that occur during step 1 above, because–as I said–we’re not really testing, we are developing the software. However, once we get into steps 2, 3, and 4, I believe those defects should be logged. Often when I say this, people have a horrified reaction and say, “Oh, no no no, we can’t have programmers interupting themselves and breaking their flow to log defects in some cumbersome bug tracking tool.”
To which I reply: “I didn’t say they had to use a bug tracking tool. I said they should log the defects.” Here’s the distinction. I agree that there’s no need for a tracking tool here, because the bug is going to be immediately removed. Its lifecycle will start and end at the same day, and so all the state-based stuff that the tracking tool does is unnecessary. But we do need to log information about the defect, especially classification information, phase of introduction, etc., and we should use the same classifications as are used during formal testing. This information can be captured in a simple spreadsheet. That way, we can do analysis of the defects found during unit testing (step 2), code reviews (step 3), and unit regression testing (step 4).
What can this analysis do for us? A number of things. First, it can help developers and the broader organization learn how to write better code. Patterns in types of defects found during steps 2, 3, and 4 can show that training and mentoring is needed to reduce bad coding practices or habits that some developers might have.
Second, to address your main question, it can allow us to directly measure the value of unit tests. To do so, we need to do one more thing, which is to estimate the effort associated with unit testing and unit test defect removal as well as estimating the effort associated with other levels of testing (e.g., acceptance testing, system integration testing, etc.), removal of defects found in those levels of testing, and the effort associated with failures in production. This is actually easier than you might think, and sufficiently accurate numbers can be obtained by surveying the people involved.
The differences are often quite dramatic. One client found that a defect found and removed in unit testing took three person-hours of effort, while a defect found and removed in system testing took 18 hours, a defect found and removed in system integration testing took 37 hours, and a defect found and removed in production took 64 hours. So, for this client, each defect found in unit testing saved at least 15 hours of effort, and possibly as much as 61 hours of effort!
Notice that this approach allows us to avoid the “double blind study” approach that you mentioned earlier. We don’t have to find two almost-identical projects, where the only difference is whether unit testing happened or not. I expect that would prove impossible, so it’s good that we don’t need to do it.
Now, as you can see, we can measure part of the value delivered by unit testing, in terms of avoided downstream costs of failure. (You might want to read more about cost of quality, which is the technique used above; check out my article here.) However, I also agree with your earlier comments about unit testing having other objectives, such as easing the maintenance of code, and reducing the risk of breaking something when we do, as well as helping developers communicate about the code. (That last benefit is especially applicable if you follow step 3 mentioned above.) It is possible to develop metrics that allow you to measure these benefits as well, but, if the main objective of measuring the value of unit testing is to convince managers to continue to invest in it, then the “effort saved” metric mentioned above should be sufficient.
]]>Hi Rex,
I am Jason, and few days ago was in the webinar for Reviews. As promised, I have collated the lists of questions to ask based on the Study Preparation Guide for CTAL-TM that I purchased from RBCS recently. Please do not hesitate to contact me should you require further information from me. Below are the questions and my comments:-
71. An organization follows a requirements-based test strategy for most of its projects. Which of the following is the best example of modifying the test approach for a project based on an understanding of risks?
A. Past performance issues lead to an increased effort on performance testing
B. Test estimation is based on the number of pages in the requirement specification.
C. Test execution is outsourced to a testing company based on a low-cost bid.
D. Unit test effort is limited to ensure early commencement of system test execution.
I did not really understand this question very well. How does performance issues in the past are related to risk?
Jason, the reason that A is the right answer is because we are using past defect information (in this case, performance defects) to assess the likelihood of particular types of problems.
84. You are managing a test effort that uses entirely reactive techniques, including a list of past bugs found in the field, a checklist of typical bugs for products using this technology, and exploratory testing based on tester experience. No written tests are developed prior to test execution, other than the list of bugs. Consider following statements
i Immune to the pesticide paradox
ii Repeatable for regression and confirmation testing
iii Useful in preventing bugs during system design
iv Cheap to maintain
v Makes no assumptions about skill tester skills
Which of the following is true about this particular testing strategy?
A. I and III are benefits of this strategy.
B. II and V are benefits of this strategy.
C. I and V are benefits of this strategy.
D. I and IV are benefits of this strategy.
Argument: I should not be part of the answer. “Exploratory testing based on tester experience” allows different test cases combination to be tested. If that is the case, why is that immune to the pesticide paradox? To my understanding, immune to the pesticide paradox refers to doesn’t have any effect to the pesticide paradox.
Since it is based on list of past bugs found in the field and checklist, I would consider II as tester can go through the regression based on these checklists.
The Foundation and Advanced syllabi are clear on the fact that detailed test cases–which do not exist here–are useful if repeating tests precisely for regression and confirmation testing purposes is needed. So, II cannot be a benefit.
The Foundation syllabus, in the section about general testing principles, says that the only way to overcome the pesticide paradox is to run different tests rather than repeating the exact same tests. Because all tests would be subtly (or even dramatically) different if repeated, especially if testers with different experience are used for subsequent executions.
125. You manage a test team for a bank. Your test team uses two primary test strategies, checklist-based and dynamic. As its checklist, your team has a list of main areas in which the test team, in-house users, or bank customers have reported defects on past releases. For the dynamic testing, it employs members of the test team with experience in the bank branches and back-office to do exploratory testing.
Based on this information alone, which of the following is an improvement that you would expect from a STEP assessment?
A. Get involved earlier in the lifecycle
B. Analyse requirements specifications
C. Run only scripted tests.
D. Improve the office environment.
The answer given in the guide is B. Analyse requirements specification. Could you please elaborate the reason to this answer?
Yes. The reason is that, as mentioned in the Advanced syllabus, STEP is based on an assumption that requirements analysis and requirements-based testing will occur.
Looking forward hearing from you soon. Thanks.
I hope this is helpful.
]]>I have been asked to document the benefits of early engagement of QA in the SDLC and am looking for any qualitative or quantitative information, articles, papers, or industry experts I can reference. Any advice or guidance you could provide would be greatly appreciated. Thanks.
Steven, I’d recommend that you pick up a copy of Capers Jones’ recent book, The Economics of Software Quality. This is an excellent resource for what you’re trying to accomplish. Anecdotally, I can mention that we did a study for a client where we found that the average cost of removing a defect in reviews was $37, while the cost to remove a defect in system testing was $3,700. Due to those relative costs, and the number of defects involved that escaped to system testing, this client was losing between $100,000,000 and $250,000,000, on a $1,000,000,000 annual IT budget.
]]>Hi Rex
Could you please help me with the following question? Is it right to say the order of reviews according its formality – from most formal to less formal is: inspection, walkthrough, technical review and informal review? When I had read foundation level syllabus to present the exam (2005) it was very clear. Now, when I reread the foundation level syllabus in its version 2010, it seems to me that It has changed to inspection, technical review, walkthrough and informal review.
Thanks
Regards,
Patricia Osorio Aristizabal
Patricia, the latest version of the Foundation syllabus is 2011, but I think the text is much the same as the 2010 version. I have checked some of the previous versions of the syllabus, and can’t find any explicit mention of the spectrum of formality.
I believe this idea of the spectrum of formality (informal, technical review, walkthrough, inspection) comes from IEEE 1028. In that standard, metrics and review-based process improvement are not specified for the technical review, and they are for walkthroughs and inspections. So, this makes the technical review less formal. The inspection is more formal than a walkthrough due to the separation of moderator and author.
This leads to an interesting question: Does this spectrum of formality actually matter in the real world? In my experience, it really doesn’t. Here’s why I say that. Most companies don’t do reviews, or at least don’t do them anywhere near as often and as thoroughly as they should. So, when I’m talking to a client about doing reviews, I don’t get into the issue of what level of formality. Instead, my focus is on motivating them to start doing more reviews and doing them better. I can’t remember working with any clients where the main problem they had with reviews was that they weren’t using the right level of formality.
The other reason this doesn’t matter much is because of the naming issue. People use the terms “review,” “technical review,” “inspection,” “JAD session,” “walkthrough,” and more, and whenever I hear those terms I ask people to tell me, specifically, what such an event is, who is involved, what the process is, and–if people mention more than one type of review–what the differences are. I very rarely get clear answers to those questions, which tells me that any particular session where people sit down to discuss a work product could have any one of a dozen or so different names attached to it.
Personally I don’t see this as a problem to worry about. I’m more worried about whether my clients are doing reviews, regularly and with good benefit, than with what they call it.
]]>Hi Rex
First, let me thank you for the huge effort you put in writing the books and educating people on various testing topics – the materials are just excellent. I have been a test manager for 7 years already and still find a lot of useful things.
Thanks very much, I’m glad they are useful to you.
I hope you don’t mind if I ask you a question related to weighted failure metric you mentioned in your Advanced Test Manager book. You did not focus to much on this in your book and just mentioned it measures technical risk and the likelihood of finding problems. As such, can you please expand a bit more on how to analyse this metric, the value and meaning of it?
The weighted failure can be calculated both on a per-test basis and a per-test suite basis. (A test suite is a logical collection of test cases, such as a functional test suite, a performance test suite, etc.) The weighted failure counts the number of bugs found (either by test or across all the tests in the test suite), but each bug report is weighted based on the priority and severity of the bug. In other words, a test suite that finds a moderate number of high priority, high severity bugs will probably score higher than a test suite that finds a large number of low priority, low severity bugs.
Probably the best way to learn more about this metric is to download and experiment with an Excel test tracking spreadsheetwith weighted failure included. Feel free to work with this one a bit, and I think the concept will make more sense.
Many thanks
Stas Milev
ISTQB Certified Advanced Test Manager
You’re welcome. I hope this is useful.
]]>Hi Team,
I have a query with respect to the levels and types of testing & how they are carried out. Can Development team execute all the System Integration test cases as a part of their Unit test cases? Later the testing team will be re-executing the same test cases once again and will be left behind with no defects to be reported.
Regards
krishnachaitanya
In general, Krishna, this would be neither possible nor desirable. It’s not possible because properly designed system integration testing will focus on interoperability and other emergent behaviors (e.g., performance, security, reliability) that do not manifest themselves or are indeed not even testable at a unit level. It’s not desirable because different levels of testing should focus on covering different things.
Basically, the test levels should function as a sequence of filters. If you wanted to filter water, you wouldn’t use five identical filters to do that, but rather would use a sequence of different filters, each designed to catch particular types of impurities. In the case of testing, each test level is designed to cover certain aspects of the system, to mitigate certain types of quality risks, and to catch certain types of bugs.
]]>