Conceived and designed the experiments: IOE ARG RH. Performed the experiments: IOE ARG RH. Analyzed the data: TF AGL. Wrote the paper: TF AGL.
The authors have declared that no competing interests exist.
We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples.
We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples.
Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in clinical studies.
The problem of assessing agreement between two or more assessors, or raters, is ubiquitous in medical research. Some of the many examples can be found in the fields of radiology, epidemiology, diagnostic medicine and oncology
The problem can be split into two broad categories, according to the presence or absence of a ‘gold standard’, defined as an infallible method for determining the quantity of interest
In this paper we look at a particular case of the second category, in which a gold standard measure is not available. Uebersax and Grove
The fixed panel design, in which each sample is rated by each rater.
The varying panel design, in which each sample is rated by a different set of raters. Raters are ‘anonymous’, in the sense that while it might be possible for a single rater to rate more than one sample, this event would either be unrecorded or not considered in the analysis.
The replicate measurement design, in which samples are rated on multiple occasions by each rater.
In such examples the calculation of simple summary statistics such as sensitivity and specificity is not possible, but there is a large literature on alternative measures such as the Kappa coefficient
We begin by reviewing the existing methods that have been used to assess inter-rater reliability for ordinal or categorical outcome variables in which there is no gold standard measure. We then develop a new, intuitive summary statistic for a motivating example, consisting of grading breast cancer tumour samples, in which the number of raters is large, and there is missing rating information (i.e. a rating is not available from each rater for each sample), our overall aim being to summarise the extent to which individual raters agree with the group of raters as a whole. We assess the suitability of this simple measure by comparing results with those from a Bayesian latent trait model for an ordered categorical response, and conclude by summarising the usefulness of the two methods in the analysis of this particular type of agreement data.
Breast cancer is a heterogenous disease and is highly variable in shape, size and character. However, a substantial amount of useful prognostic information is available from the careful histopathological examination of routine breast carcinoma specimens
The Nottingham method, outlined in
Feature | Score |
Majority of tumour (>75%) | 1 |
Moderate degree (10–75%) | 2 |
Little or none (<10%) | 3 |
Small, regular uniform cells | 1 |
Moderate increase in size and variability | 2 |
Marked variation | 3 |
Dependent on microscope field area | 1–3 |
The perceived poor reproducibility and consistency of grading systems has been improved by use of semi-objective scoring systems and adherence to written criteria such as those provided by the Nottingham method
Our data-set consists of grades provided by 732 pathologists (hereafter termed ‘raters’) for histological tissue sections from 52 breast cancer tumour samples (hereafter termed ‘samples’) circulated between 2001 and 2004, in eight twice-yearly batches. Not every rater was sent all of the samples, but raters gave grades to an average of 33 of the 52 samples (range 2 to 52 samples, interquartile range 20 to 47 samples). In the terminology of Uebersax and Grove
1367 of the 25544 individual samples submitted to raters for grading (9%) were returned either ungraded or as ‘not assessable’. These instances have been removed from the data-set and are therefore treated as missing data in the same manner as samples that were not sent to raters. Each sample was graded by between 390 (53%) and 513 (70%) raters, which leaves around 36% of all sample-rater pairs that were ungraded and that are regarded as missing data. The primary aims of the project are to provide information concerning the extent of inter-rater agreement in assigning grades to samples, and to ascertain whether there is any evidence that some raters consistently give values different to the majority. This might be the case if, for example, raters were to interpret aspects of the grading scale and guidelines in different ways.
Observed marginal data from an illustrative selection of samples and raters are shown in the first five columns of
Sample | Observed : n (%) | Simulated : % | Estimated | ||||||
G1 | G2 | G3 | Ungraded | G1 | G2 | G3 | μi (s.e) | λi (s.e) | |
1 | 386 (93.2) | 28 (6.8) | 0 (0) | 318 | 93.0 | 6.7 | 0.2 | −5.1 (0.6) | 1.0 (0.2) |
6 | 326 (70.1) | 137 (29.5) | 2 (0.4) | 267 | 69.8 | 29.2 | 1.0 | −3.0 (0.1) | 1.0 (0.1) |
52 | 223 (43.4) | 285 (55.6) | 5 (1.0) | 219 | 43.2 | 56.0 | 0.8 | −1.8 (0.1) | 1.5 (0.1) |
39 | 183 (39.3) | 258 (55.3) | 25 (5.4) | 266 | 38.6 | 56.1 | 5.2 | −1.4 (0.1) | 0.9 (0.1) |
18 | 46 (10.1) | 393 (86.1) | 17 (3.7) | 276 | 10.4 | 85.5 | 4.1 | −0.3 (0.1) | 1.8 (0.1) |
46 | 77 (15.6) | 349 (70.6) | 68 (13.8) | 238 | 16.0 | 70.2 | 13.9 | −0.1 (0.1) | 1.0 (0.1) |
43 | 23 (4.8) | 376 (78.3) | 81 (16.9) | 252 | 5.5 | 77.4 | 17.1 | 0.6 (0.1) | 1.3 (0.1) |
48 | 6 (1.2) | 209 (42.1) | 282 (56.7) | 235 | 1.2 | 41.6 | 57.2 | 2.3 (0.1) | 1.1 (0.1) |
8 | 1 (0.2) | 161 (34.4) | 306 (65.4) | 264 | 0.6 | 33.6 | 65.8 | 2.7 (0.1) | 1.2 (0.1) |
13 | 0 (0) | 4 (0.9) | 454 (99.1) | 274 | 0 | 1.0 | 99.0 | 6.7 (1.5) | 1.2 (0.4) |
Grades (G1–G3) assigned to a selection of ten breast tumour samples by 732 pathologists, with simulated results and parameter estimates from the Bayesian latent trait model.
Rater | Observed : n (%) | Agreement Score | Simulated No. of samples in agreement with majority (s.d.) | Estimated | ||||
G1 | G2 | G3 | Ungraded | b12 (s.e) | b23 (s.e) | |||
156 | 20 (65) | 8 (26) | 3 (10) | 21 | 0.41 | 27.9 (3.3) | 0.8 (0.4) | 5.1 (0.5) |
273 | 22 (48) | 11 (24) | 13 (28) | 6 | 0.64 | 39.2 (2.9) | −0.6 (0.3) | 2.6 (0.5) |
275 | 18 (40) | 7 (16) | 20 (44) | 7 | 0.73 | 41.3 (2.7) | −1.4 (0.3) | 1.1 (0.4) |
137 | 20 (39) | 13 (25) | 18 (35) | 1 | 0.76 | 41.7 (2.6) | −1.2 (0.3) | 2.0 (0.4) |
247 | 5 (11) | 28 (62) | 12 (27) | 7 | 0.68 | 41.0 (2.4) | −3.5 (0.4) | 2.9 (0.5) |
500 | 14 (27) | 21 (40) | 17 (33) | 0 | 0.76 | 43.2 (2.5) | −2.1 (0.4) | 2.2 (0.4) |
335 | 7 (23) | 10 (33) | 13 (43) | 22 | 0.72 | 42.7 (2.6) | −2.0 (0.4) | 1.8 (0.5) |
617 | 13 (26) | 13 (26) | 24 (48) | 2 | 0.73 | 41.5 (2.8) | −2.3 (0.4) | 0.8 (0.4) |
521 | 1 (6) | 4 (25) | 11 (69) | 36 | 0.65 | 38.8 (3.6) | −3.3 (0.7) | 0.5 (0.6) |
143 | 0 (0) | 11 (55) | 9 (45) | 32 | 0.50 | 35.7 (3.3) | −5.0 (0.6) | 0.4 (0.5) |
Grades (G1–G3) assigned by a selection of ten pathologists to 52 breast cancer tumour samples, with estimated agreement scores, and simulated results and parameter estimates from the Bayesian latent trait model.
In this section we discuss existing methods for analysing inter-rater agreement data, and describe two methods that we use to analyse the breast cancer tumour data.
One summary statistic, the roots of which are found in the psychology literature, is particularly commonly used in papers reporting inter-rater agreement with a categorical outcome: the Kappa coefficient
The rationale for the Kappa coefficient and other similar measures of agreement is that they are chance-corrected, in the sense that they attempt to allow for the fact that for discrete or ordinal outcomes there will be a non-zero probability
Given
There are a large number of papers both advocating and criticising the use of the Kappa coefficient for assessing inter-rater agreement. Briefly, the main criticisms are that its interpretation is often based on somewhat arbitrary guideline values, leading to problems of interpretation; that it is heavily dependent on observed marginal proportions and thus the case-mix of the samples used; that it can be severely misleading in degenerate cases in which one or more of the outcome categories is uncommon; and that it lacks natural extensions when there is more than one outcome of interest or when multiple raters are used
Latent trait and latent class modelling have become increasingly popular in recent years for analysing inter-rater agreement data. Summaries are provided by Langeheine and Rost
In latent class modelling, samples are regarded as belonging to exactly one of
Although latent trait models have received some criticism because the underlying trait variable lies on an arbitrary, uninterpretable scale
One major class of models that has been used for agreement data is that of log-linear models for categorical data, as described by Agresti
Other summary statistics that have been proposed include Yule's
We use the breast cancer tumour data to demonstrate and compare two methods for analysing agreement data with a large number of raters and an onymous varying panel design. Our proposed methods are designed to reflect the extent to which the distribution of ratings provided by individual raters agrees with that provided by all raters.
An easily-computed, intuitive summary statistic is a simple agreement score
Let
Then for any rater
Importantly, neither the mean nor the variance of the agreement score depends on the number of raters who rate each sample, which enables a fair comparison of agreement scores between raters to be made in the presence of incomplete rating data. In practice the
Fix
Select
For samples
Estimate the agreement score based on the simulated grades
Repeat steps 2–4.
We can then estimate the distribution function of the agreement score based on a large number of replications for each
Using a Bayesian formulation of the problem enables relevant parameters to be estimated without recourse to maximising the likelihood function directly
We think of the categorical response variable as representing an underlying, latent, scale (c.f. the ‘Bones’ example in
This can be represented by a cumulative logit model of the form
We choose priors for the parameters as follows. We give the average of the two boundaries
We give the tumour severity parameters
In order for the model to be fitted, certain conditions must hold. Consider a bipartite graph with nodes representing samples and raters, in which edges connect raters to the samples they saw. The graph must be connected in order for the parameters to be identifiable and to enable reasonable comparison between the grade boundaries of different raters. The graph for this data-set is 2-connected, thus ensuring parameter identifiability.
Finally, we can obtain new, simulated, sets of rater/tumour observations by repeatedly sampling from the fitted Bayesian model. For each set of simulations, we record the number of raters assigning grades 1, 2 and 3 to each tumour and the majority grade for each tumour. Using data from 1250 simulations, we estimate the probability that a given rater would agree with the majority for a given tumour for each rater/tumour pair. The simulated data allow the estimation of two marginal probabilities,
The model was fitted in WinBUGS 1.4.2
Calculated values of the agreement score amongst the 732 raters range from 0.35 to 0.87 (mean 0.72). The scores from ten raters are shown in
Plot of estimated agreement score against number of samples rated, with confidence envelopes within which 95% and 99% of raters would be expected to lie if all raters were equally proficient.
Rating of rater 156 | Consensus grade | ||
G1 | G2 | G3 | |
G1 | 8 | 12 | 0 |
G2 | 0 | 1 | 7 |
G3 | 0 | 0 | 3 |
Comparison of the 31 breast cancer tumour sample grades (G1–G3) given by the single rater 156 with the consensus grade.
The simulation results are summarised in
The main body of the plot is a heatmap showing the probability that raters (columns) agree with the consensus grade for each sample (rows). Raters are ordered in terms of estimated probability of agreeing with the majority, and samples in terms of estimated latent severity. The right-hand panel shows the expected distribution of assigned grades for each sample, and the bottom panel shows, for each rater, the marginal probability of agreeing with the consensus grade.
Although the estimated values of the grade boundary parameters
Scatter-plot of estimated grade boundaries from the Bayesian latent trait model, in which the estimated grade boundary
Plot of the raters' ranks, estimated from the Bayesian latent trait model, with 95% credible limits.
Both the agreement score method and the Bayesian latent trait model indicate heterogeneity between raters for our data-set. We hypothesised that the agreement score method might give an unduly optimistic assessment of rater performance for raters who had seen a subset of samples that were relatively easy to grade. Therefore, in order to compare the two methods of analysis, we consider the estimated difference in raters' ranks from the two methods. We plot this against the estimate, averaged over the samples seen by each rater, of the
Plot of differences in raters' estimated ranks for the two methods against a measure of the difficulty of the samples seen by each rater, grouped by deciles. This measure of difficulty is calculated as the average, taken over the samples seen by each rater, of the estimated values of
We have developed and compared two methods for inter-rater agreement analysis of data in which there is no gold standard, a large number of onymous raters, and incomplete rating information.
The agreement score is a simple, non-chance-corrected statistic that can be easily calculated and potentially used in order to provide some evidence whether there may be raters whose behaviour is discrepant compared to that of the majority. It can therefore be regarded as a measure of the relative agreement between an individual rater and a population of raters. For our data, we found strong evidence that there were certain raters whose levels of agreement with the majority were worse than would be expected by chance. There was no evidence that there were raters who were better than chance, perhaps unsurprisingly - it is easier to envisage reasons why a single rater might record an unusually low score than an unusually high one.
Although the agreement score is dependent on the case-mix of the samples used in a particular study, it has a straightforward interpretation as the probability that a given rater will agree with another randomly-chosen rater on a randomly-chosen sample, and can be displayed graphically in a way that avoids misleading rankings. This may be a useful tool in the preliminary analysis of data of this type, and can be used to identify potentially discrepant raters as a first step in determining possible reasons why they may differ from the majority.
We have demonstrated, using the Bayesian latent trait model, a way by which to estimate both the performance of raters, via estimated grade boundary parameters, and the marginal distributions of ratings given to each sample. The agreement score for particular raters may be misleading for raters who, by chance or otherwise, have only rated a selection of samples that are unusually easy or difficult to classify. In particular, we have shown that raters who rated ‘easy’ samples tend to have unjustifiably high values of the agreement score. We therefore believe that the latent trait method is of particular value if there is missing rating information and the number of raters is large, in which case the probability that some raters will see an unusually difficult or easy set of samples is increased.
In future work the method might also be developed to relax the assumption that missing rating information is uninformative, i.e. to test whether there was any preference on the part of the raters over which samples they chose to rate. For our example, this might occur if the pattern of missing data were related to the grade. The 9% of samples that were not rated most often occurred in groups of consecutively-numbered samples in single batches sent to certain individual raters, which in our opinion suggests that deliberate preferential rating is unlikely. In other extensions of the work, the method might be adapted for use with multivariate outcomes (e.g. to analyse the three components that constitute the grade), for ongoing rater assessment, and to deal with changes in rater behaviour or agreement over time (e.g. rater learning, or to check the impact of new grading guidelines). We do not anticipate that our proposed methods, designed for the case in which the number of raters is large, will be useful or even viable in small studies: much previous work has focussed on methods of analysis with fewer than ten raters (e.g.
Latent trait models have attracted some criticism because of the lack of interpretability of model parameters, owing to the arbitrary choice of latent scale
Simulation enables ranks of raters, with plausible confidence limits, to be estimated, which could in principle be reported back to individuals. The wide confidence limits in our example, however, are illustrative of the great difficulty involved in estimating ranks precisely. Even with greater precision the practical value of knowing one's rank would be limited. Conversely, knowledge of the location of one's grade boundaries relative to other pathologists would be of potential interest and these measures require much less computation to obtain estimates than do the other results.
In the context of breast cancer tumour grading, our data show substantial variation between individual pathologists in the way in which grades are assigned to samples. This finding is broadly consistent with the existing literature: for example, Meyer et al. suggest that this is because ‘the level of agreement achievable is limited by the subjectivity of grading criteria’
Code used in the statistical analysis
(0.02 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
Code used in the statistical analysis
(0.01 MB TXT)
AGL thanks colleagues at the CR-UK Cambridge Research Institute for helpful comments and suggestions. We thank Sue Moss, Derek Coleman and Sandhya Kodikara (Institute of Cancer Research Cancer Screening Evaluation Unit, Cotswold Road, Sutton, Surrey, UK) for provision of the data files and supporting information and Julietta Patnick, Director of NHS Cancer Screening Programs for her encouragement and support.