Testing the Reliability of Content Analysis Data:

Testing the Reliability of Content Analysis Data:

What is Involved and Why*
Klaus Krippendorff
What is Reliability?
In the most general terms, reliability is the extent to which data can be trusted to represent genuine rather than spurious phenomena. Sources of unreliability are many. Measuring instruments may malfunction, be influenced by irrelevant circumstances of their use, or be misread. Content analysts may disagree on the readings of a text. Coding instructions may not be clear. The definitions of categories may be ambiguous or do not seem applicable to what they are supposed to describe. Coders may get tired, become inattentive to important details, or are diversely prejudiced. Unreliable data can lead to wrong research results.
Especially where humans observe, read, analyze, describe, or code phenomena of interest, researchers need to assure themselves that the data that emerge from that process are trustworthy. Those interested in the results of empirical research expect assurances that the data that led to them were not biased. Moreover, as a requirement of publication, respectable journals demand evidence that the data underlying published findings are reliable indeed.
In the social sciences, two compatible concepts of reliability are in use.
•    From the perspective of measurement theory, which models itself by how mechanical measuring instruments function, reliability means that a method of generating data is free of influences by circumstances that are extraneous to processes of observation, description, or measurement. Here, reliability tests provide researchers the assurance that their data are not the result of spurious causes.
•    From the perspective of interpretation theory, reliability means that the members of a scientific community agree on talking about the same phenomena, that their data are about something agreeably real, not fictional. Measurement theory assures the same, albeit implicitly. Unlike measurement theory, however, interpretation theory acknowledges that researchers may have diverse backgrounds, interests, and theoretical perspectives, which lead them to interpret data differently. Plausible differences in interpretations are not considered evidence of unreliability. But when data are taken as evidence of phenomena that are independent of a researcher’s involvement, for example, historical events, mass media effects, or statistical facts, unreliability becomes manifest in the inability to triangulate diverse claims, ultimately in irreconcilable differences among researchers as to what their data mean. Data that lead one researcher to regard them as evidence for “A” and another as evidence for “not A”—without explanations for why they see them that way—erode an interpretive community’s trust in them.
Both conceptions of reliability involve demonstrating agreement, in the first instance, concurrence among the results of independently working measuring instruments, researchers, readers, or coders who convert the same set of phenomena into data; and in the second instance, consistency among independent researchers’ claims concerning what their data mean.

What to Attend to when Testing Reliability?
Reliability of either kind is established by demonstrating agreement among data making efforts by different means—measuring instruments, observers, or coders—or triangulation of several researchers claims concerning what given data suggest. Following are five conceptual issues that content analysts need to consider when testing or evaluating reliability:
Reproducible Coding Instructions
The key to reliable content analyses is reproducible coding instructions. All phenomena afford multiple interpretations. Texts typically support alternative interpretations or readings. Content analysts, however, tend to be interested in only a few, not all. When several coders are employed in generating comparable data, especially large volumes and/or over some time, they need to focus their attention on what is to be studied. Coding instructions are intended to do just this. They must delineate the phenomena of interest and define the recording units to be described in analyzable terms, a common data language, the categories relevant to the research project, and their organization into a system of separate variables.
Coding instructions must not only be understandable to their users, in content analysis, they serve three purposes: (a) They operationalize or spell out the procedures for coders to connect their observations or readings to the formal terms of an intended analysis. (b) After data were generated accordingly, they provide researchers with the ability to link each individual datum and the whole data set to the raw or no-longer-present phenomena of interest. And (c), they enable other researchers to reproduce the data making effort or add to existing data. In content analysis, reliability tests establish the reproducibility of the coding instructions elsewhere, at different times, employing different coders who work under diverse conditions, none of which should influence the data that these coding instructions are intended to generate.
The importance of good coding instructions cannot be underestimated. Typically, their development undergoes several iterations: initial formulation; application on a small sample of data; tests of their reliability on all variables; interviews with coders to access the conceptions that cause disagreements; reformulation, making the instruction more specific and coder-friendly; etc. until the instructions are reliable. Coders may also need training. For data making to be reproducible elsewhere, training schedules and manuals need to be communicable together with the coding instructions.
Appropriate Reliability Data
Content analysts are well advised not to confuse the universe of phenomena of their ultimate research interest; the sample selected for studying these phenomena, the data to be analyzed in place of that universe; and the reliability data generated to assess the reliability of the sample of data.
Reliability data are visualizable as a coder-by-units table containing the categories of any one variable (Krippendorff, 2004a:221ff). Its recording units—the set of distinct phenomena that coders are instructed to categorize, scale, or describe—must be representative of the data whose reliability is in question (not necessarily of the larger population of phenomena of ultimate interest). Additionally, the coders, at least two but ideally many, must be typical if not representative of the population of potential coders whose qualifications content analysts need to stipulate.* Finally, the entries in the cells of a reliability data table must be independent of each other in two ways. (a) Coders must work separately (they may not consult each other on how they judge given units), and (b) recording units must be distinct, judged and described independent of each other, and hence countable.
Testing the reliability of coding instructions before using them to generate the data for a research project is essential. However, an initial test, even when performed on a sample of the units in the data, is not necessarily generalizable to all the data to be analyzed. The performance of coders may diverge over time, what the categories mean for them may drift, and new coders may enter the process. Reliability data—the sample used to measure agreement—must be representative of and sampled throughout the process of generating the data, especially of a larger project. Some researchers avoid the uncertainty of inferring the reliability of their data from the agreement found in a subset of them by duplicating the coding of all data and calculating the reliability for the whole data set. Where this is too costly, the minimum size of the reliability data maybe determined by a table found in Krippendorff (2004a:240). Since reproducibility demands that coders be interchangeable, a variable number of coders may be employed in the process, coding different sets of recording units—provided there is enough duplication for inferring the reliability of the data in question.
An Agreement Measure with Valid Reliability Interpretations
Content analysts need to employ a suitable statistic, an agreement coefficient, one that is capable of measuring the agreements among the values or categories used to describe the given set of recording units.  Such a coefficient must yield values on a scale with at least two points of meaningful reliability interpretations: (a) Agreement without exceptions among all coders and on each recording unit, usually set to one and indicative of perfect reliability; and (b) chance agreementthe complete absence of a correlation between the categories used by all coders and the set of units recordedusually set to zero and interpreted as the total absence of reliability. Valid agreement coefficients must register all conceivable sources of unreliability, including the proclivity of coders to interpret the given categories differently. The values they yield must also be comparable across different variables, with unequal numbers of categories and different levels of measurement (metrics).
For two coders, large sample sizes, and nominal data, Scott’s (1955)  (pi) satisfies these conditions and so does its generalization to many coders, Siegel and Castellan’s (1988:284-291) K. When data are ordered, nominal coefficients ignore the information in their metric (scale characteristic or level of measurement) and become deficient. Krippendorff’s (2004a:211-243)  (alpha) handles any number of coders; nominal, ordinal, interval, ratio, and other metrics; and in addition, missing data, and small sample sizes. It also generalizes several other coefficients known for their reliability interpretations in specialized situations, including  (Hayes & Krippendorff, 2007).
Some content analysts have used statistics other than the two recommended here. In light of the foregoing, it is important to understand what they measure and where they fail. To start, there are the familiar correlation or association coefficients—for example, Pearson’s product moment correlation, Chi Square, including Cronbach’s (1951) alpha—and there are agreement coefficients. Correlation or association coefficients measure 1.000 when the categories provided by two coders are perfectly predictable from each other, e.g., in the case of interval data, when they occupy any regression line between two coders as variables. Predictability has little to do with agreement, however. Agreement coefficients, by contrast, measure 1.000 when all categories match without exception, e.g., they occupy the 45-regression line exactly. Only measures of agreement can indicate when data are perfectly reliable, correlation and association statistics cannot, which makes them inappropriate for assessing the reliability of data.
Regarding the zero point of the scale that agreement coefficients define, one can again distinguish two broad classes of coefficients, raw or %-agreement, including Osgood’s (1959:44) and Holsti’s (1969:140) measures, and chance-corrected agreement measures. Percent agreement is zero when the categories used by coders never match. Statistically, 0% agreement is almost as unexpected as 100% agreement. It signals a condition that the definition of reliability data explicitly excludes, the condition in which coders coordinate their coding choices by always selecting a category that the other does not. This condition can hardly occur when coders work separately and apply the same coding instruction to the same units of analysis. It follows that 0% agreement has no meaningful reliability interpretation. On the %-agreement scale, chance agreement occupies no definite point either. It can occupy any point between close to 0% and close to 100% and becomes progressively more difficult to achieve the more categories are available for coding.* Thus, %-agreement cannot indicate whether reliability is high or low. The convenience of its calculation, often cited as its advantage, does not compensate for the meaninglessness of its scale.
Reliability is absent when units of analysis are categorized blindly, for example, by throwing dice rather than describing a property of the phenomena to be coded, causing reliability data to be the product of chance. Chance-corrected agreement coefficients with meaningful reliability interpretations should indicate when the use of categories bears no relation to the phenomena being categorized, leaving researchers clueless as to what their data mean. However, here too, two concepts of chance must be distinguished.
Benini’s (1901)  (beta) and Cohen’s (1960)  (kappa) define chance as the statistical independence of two coders’ use of categories—just as correlation and association statistics do. Under this condition, the categories used by one coder are not predictable from those used by the other, regardless of the coders’ proclivity to use categories differently. Scott’s  and Krippendorff’s , by contrast, treat coders interchangeable and define chance as the statistical independence of the set of phenomena—the recording units under consideration—and the categories collectively used to describe them. In other words, whereas the zero point of β and  represents a relationship between two coders, the zero point of  and  represents a relationship between the data and the phenomena in place of which they are meant to stand. It follows that  and , by not responding to individual differences in coders’ proclivity of using the given categories, fail to account for disagreements due to this proclivity. This has the effect of deluding researchers about the reliability of their data by yielding higher agreement measures when coders disagree on the distribution of categories in the data and lower measures when they agree! Popularity of  notwithstanding, Cohen’s kappa is simply unsuitable as a measure of the reliability of data.
Finally, how do  and  differ? Scott corrected the above-mentioned flaws of %-agreement by entering the %-agreement expected by chance into his definition of —just as  and  do, but with an inappropriate concept of chance. As chance corrected %-agreement measures, , , and  are all confined to the conditions under which %-agreement can be calculated, i.e., two coders, nominal data, and large sample sizes. Krippendorff’s  is not a mere correction of %-agreement. While  includes  as a special case,  measures disagreements instead and is, hence, not so limited. As already stated, it is applicable to any number of coders, acknowledges metrics other than nominal: ordinal, interval, ratio, and more; accepts missing data; and is sensitive to small sample sizes. 
The forgoing evaluation of statistical indices is to caution against the uninformed application of so-called reliability coefficients. There is software that offers its users several such statistics without revealing what they measure and where they fail, encouraging the disingenuous practice of computing all of them and reporting those whose numerical results shows their data in the most favorable light. Before accepting claims that a statistic measures the reliability of data, content analysts should critically examine its mathematical structure for conformity to the above requirements.
A Minimum Acceptable Level of Reliability
An acceptable level of agreement below which data have to be rejected as too unreliable must be chosen.  Except for perfect agreement on all recording units, there is no magical number. The choice of a cutoff point should reflect the potential costs of drawing invalid conclusions from unreliable data. When human lives hang on the results of a content analysis, whether they inform a legal decision, lead to the use of a drug with dangerous side effects, or tip the scale from peace to war, decision criteria have to be set far higher than when a content analysis is intended to support mere scholarly explorations. To assure that the data under consideration are at least similarly interpretable by researchers, starting with the coders employed in generating the data, it is customary to require   .800. Only where tentative conclusions are deemed acceptable, may an   .667 suffice (Krippendorff, 2004a:241).* Ideally, the cutoff point should be justified by examining the effects of unreliable data on the validity and seriousness of the conclusions drawn from them.
To ensure that reliability data are large enough to provide the needed assurance, the confidence intervals of the agreement measure should be consulted. Testing the null-hypothesisthat agreement is not due to chanceis insufficient. Reliable data should be very far from chance, but not significantly deviate from perfect agreement. Therefore, the probability q that agreement could be below the required minimum provides a statistical decision criterion analogue to traditional significance tests (Krippendorff, 2004a:238).

Which Distinctions are to be Tested
Unless data are perfectly reliable, each distinction that matters should be tested for its reliability. Most agreement coefficients, including  and , provide one measure for each variable and treat all of its categories alike. Depending on what a research needs to show, assessing the reliability of data variable by variable may not always be sufficient.
•    When researchers intend to correlate content analysis variables with each other or with other variables, the common agreement measures for individual variables are appropriate* Content analysts may use their data differently, however, and then need to tailor the agreement measures to ascertain the reliabilities that matter to how data are put to use.
•    When some distinctions are unimportant and subsequently ignored for analytical reasons, for example by lumping several categories into one, reliability should be tested not on the original but on the transformed data, as the latter are closer to what is being analyzed and needs to be reliable.
•    When individual categories matter, for example, when their frequencies are being compared, the reliability of these comparisons, i.e., each category against all others lumped into one, should be evaluated for each category.
•    When a system of several variables is intended to support a conclusion, for example, when these data enter a regression equation or multi-variate analysis in which variables work together and matter alike, the smallest agreement measured among them should be taken as index of the reliability of the whole system. This rule might seem overly conservative. However, it conforms to the recommendation to drop all variables from further considerations that do not meet the minimum acceptable level of reliability.
For the same reasons, the averaging of several agreement measures, while tempting, can be seriously misleading. Averaging would allow the high reliabilities of easily coded clerical variables to overshadow the low reliabilities of the more difficult to code variables that tend to be of analytical importance. This can unwittingly mislead researchers to believe their data to be reliable when they are not. Average agreement coefficients of separately used variables should not be obtained or reported, and cannot serve as a decision criterion.
As already suggested, pretesting the reliability of coding instructions before settling on their use is helpful while testing the reliability of the whole data making process is decisive. However, after data are obtained, it is not impossible to improve their reliability by removing from them the distinctions that are found unreliable, for example, joining categories that are easily confused, transforming scale values with large systematic errors, or ignoring variables in subsequent analyses that do not meet acceptable reliability standards. Yet, resolving apparent disagreements by majority rule among three or more coders, by employing expert judges to decide on coder disagreements, or similar means does not provide evidence of added reliability. Such practices may well make researchers feel more confident about their data, but without duplication of this very process and obtaining the agreements or disagreements observed between them, only the agreement measure that was last measured is interpretable as valid index of the reliability of the analyses data and needs to be reported as such (Krippendorff, 2004a:219).

Conclusion
Finally, reliability must not be confused with validity. Validity is the attribute of propositions—measurements, research results, or theories—that are corroborated by independently obtained evidence. Content analyses can be validated, for example, when the reality constructions of the authors’ (sources’) of analyzed texts concur with the findings, the effects on their readers (audiences) are as predicted, or the indices computed from them correlate with what the analysts claim they signify. Reliability, by contrast, is the attribute of data that do stand in place of phenomena that are distinct, unambiguous, and real—what they are cannot be divorced from how they are described. In short, validity concerns truth, reliability concerns trust.
Since an analysis of reliable data may well be mistaken, reliability cannot guarantee validity. Inasmuch as unreliable data contain spurious variation, errors in the process of their creation, their analysis has the potential of leading to invalid conclusions. In fact, for nominal data, (1–) is the proportion of data that is unrelated to the phenomena that gave rise to them. This suggests being cautious about conclusions drawn from unreliable data. Unreliability can limit validity.
In the absence of validating evidence, reliability is the best safeguard against the likelihood of invalid research results.
For calculating , consult Krippendorff (2004a:211-256) or http://www.asc.upenn.edu/usr/krippendorff/webreliability2.pdf (accessed May 2007).

References:
Benini, R. (1901). Principii di Demographia. Firenze: G. Barbera. No. 29 of Manuali Barbera di Scienze Giuridiche Sociali e Politiche.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20: 37-46.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrica, 16: 297-334.
Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1,1:77-89.
Holsti, O. R. (1969). Content Analysis For The Social Sciences and Humanities. Reading, MA: Addison-Wesley.
Krippendorff, K. (2004a). Content Analysis: An Introduction to Its Methodology. Second Edition. Thousand Oaks, CA: Sage.
Osgood, C. E. (1959). The representational model and relevant research methods. In I. De Sola Pool (Ed.). Trends in Content Analysis, (pp. 33-88).  Urbana: University of Illinois Press.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19: 321-325.
Siegel, S., & Castellan, N. J. (1988). Nonparametric Statistics for the Behavioral Sciences. Second Edition. Boston, MA: McGraw-Hill.


Comments