In order to answer this question, we setup an experiment to see if people could tell the difference between four different types of water (Filtered Tap Water, Fiji©, Zephyrhills© and a generic brand purchased at 7-11©).
Filtered Tap Water
Each person was given a sample of the four waters at the beginning of the test, and told which one was which, so they knew how each water tasted. At any time during the test, they were allowed to go back to the samples and re-taste them.
After tasting each sample, they were given 12 unmarked cups of water, and asked to select the correct water based upon its taste and smell. Each of the four water brands were provided three times in the study (12 cups total, see image below).
The correct answer, along with the answer for each of the three testers are displayed below in Table 1.
Table 1. Correct and Chosen Answers for Water Test
|Cup #||Actual||Tester #1||Tester #2||Tester #3||% Correct|
|Overall||42% (4)||33% (3)||42% (4)||8% (1)|
Having each brand show up more than once allows us to test how repeatable each tester is. In other words, if one tester correctly chooses the Fiji water the first time, but chooses it incorrectly the other two times, then it shows that the first selection may have been more of a lucky guess, rather than strong evidence that the tester could differentiate between the water.
In order to apply statistical analysis to this experiment, we used Minitab’s Attribute Agreement Analysis test. For those of you not familiar with this technique, it is a method for determining how well different people can select the correct answer from a list of choices.
Here is the Minitab Analysis of the results, summarized to highlight the key points
Attribute Agreement Analysis for Tester1, Tester2, Tester3
Each Appraiser vs Standard Assessment Agreement
Appraiser # Inspected # Matched Percent 95 % CI
Tester1 12 4 33.33 (9.92, 65.11)
Tester2 12 3 25.00 (5.49, 57.19)
Tester3 12 4 33.33 (9.92, 65.11)
# Matched: Appraiser’s assessment across trials agrees with the known standard.
All Appraisers vs Standard Assessment Agreement # Inspected # Matched Percent 95 % CI
12 1 8.33 (0.21, 38.48)
# Matched: All appraisers’ assessments agree with the known standard.
Fleiss’ Kappa Statistics Response
Kappa SE Kappa Z P(vs > 0)
Fiji -0.093322 0.166667 -0.55993 0.7122
Generic 0.259259 0.166667 1.55556 0.0599
Tap 0.323197 0.166667 1.93918 0.0262
Zephyrhills -0.217105 0.166667 -1.30263 0.9036
Overall 0.066972 0.096912 0.69106 0.2448
* NOTE * Single trial within each appraiser. No percentage of assessment agreement within appraiser is plotted.
To summarize the analysis above, the numbers in bold are the Kappa values. A kappa value greater than 0.7 is considered acceptable, meaning that our testers are able to adequately select that brand from the rest of them. As you can see, there are no brands with kappa values greater than 0.7, therefore we conclude that with an overall kappa value of 0.067, the testers are not able to determine a difference between the brands of water. In fact, since some of the values were close to zero, it means that they could have done just as well if they guessed (random chance), than actually tasting the water and making a selection. The brands highlighted in red were actually below zero, which means that they were worse than random chance, so the testers would have done better by simply guessing. Bottom line: Stop buying bottled water, just reuse your water bottles by filling them up with filtered tap water (not recommended for long term use). Not only will it help your own pocketbook, but you’ll help the environment, by preventing the creation of new bottles and reduce the transportation costs associated with getting the bottles to your local store.
Conclusion: So how is this study applicable to your company? Most processes collect some kind of data, and typically there are codes that get assigned to designate the type of transaction, type of defect, or some other reason. Without validating the ability of the people to correctly classify these codes into the right buckets, there is a possibility that the codes are being incorrectly used, and people are misinformed on what is really going on in the process.
Let’s say you are collecting data on reasons for late payments from your customers. You generate a report that shows the Top 5 reasons for late payments.
|Problem with Service Provided||25%|
|No Reason Provided by Customer||18%|
|Wrong Information on Invoice||13%|
|Wrong Amount on Invoice||5%|
Naturally, you would start working on the “Missing Paperwork” category, but you are assuming that you have a good measurement system that is correctly coding these late payments into the correct defect code. The only way to know is by performing an Attribute Agreement Analysis. If it does not pass (poor Kappa values), then you must conclude that the defect codes are not accurate, and must be further clarified in order to get a “true” picture of which issue to focus on.
Let’s assume that your coding criteria is clarified for your people, and the data is cleaned up with this criteria. Now let’s look at the Top 5 issues…
|Wrong Information on Invoice||42%|
|Problem with Service Provided||15%|
|No Reason Provided by Customer||12%|
|Wrong Amount on Invoice||5%|
As you can see, the order of reasons has changed after the criteria was improved, so now I can correctly go out and investigate why there is “Wrong Information on Invoice” instead of the previous problem of “Missing Paperwork”
Attribute Agreement Analysis allows you to have confidence that your attribute (coding, pass/fail) data is accurate, so you make good decisions and prioritize your efforts in the right direction.