The authors have declared that no competing interests exist.
Conceived and designed the experiments: PBC JLL DSJ. Performed the experiments: PBC. Analyzed the data: PBC. Contributed reagents/materials/analysis tools: PBC DSJ JLL. Wrote the paper: PBC JLL DSJ. Designed the software used in analysis: PBC.
Ecologists often use multiple observer transect surveys to census animal populations. In addition to animal counts, these surveys produce sequences of detections and non-detections for each observer. When combined with additional data (i.e. covariates such as distance from the transect line), these sequences provide the additional information to estimate absolute abundance when detectability on the transect line is less than one. Although existing analysis approaches for such data have proven extremely useful, they have some limitations. For instance, it is difficult to extrapolate from observed areas to unobserved areas unless a rigorous sampling design is adhered to; it is also difficult to share information across spatial and temporal domains or to accommodate habitat-abundance relationships. In this paper, we introduce a hierarchical modeling framework for multiple observer line transects that removes these limitations. In particular, abundance intensities can be modeled as a function of habitat covariates, making it easier to extrapolate to unsampled areas. Our approach relies on a complete data representation of the state space, where unobserved animals and their covariates are modeled using a reversible jump Markov chain Monte Carlo algorithm. Observer detections are modeled via a bivariate normal distribution on the probit scale, with dependence induced by a distance-dependent correlation parameter. We illustrate performance of our approach with simulated data and on a known population of golf tees. In both cases, we show that our hierarchical modeling approach yields accurate inference about abundance and related parameters. In addition, we obtain accurate inference about population-level covariates (e.g. group size). We recommend that ecologists consider using hierarchical models when analyzing multiple-observer transect data, especially when it is difficult to rigorously follow pre-specified sampling designs. We provide a new R package, hierarchicalDS, to facilitate the building and fitting of these models.
Transect surveys are often used to sample animal populations and are a central component of many inventory and monitoring programs. In such surveys, an observer travels along a set of lines or visits a finite collection of points, recording all animals they encounter within a fixed distance of the line (or point). If all animals within this strip are encountered, researchers can make inferences about abundance over a larger area by employing standard design-based sampling protocols
Distance sampling is one potential avenue for correcting for incomplete detection of animals in fixed area polygons. In its canonical form (e.g.
where
When animals are detectable from the air, from vessels at sea, or by other means (e.g. avian auditory counts), distance sampling provides a way to correct for imperfect detection in animal surveys without having to physically capture and mark animals. Correcting for imperfect detection is necessary when estimating absolute abundance, and is also viewed by many as an essential component of trend estimation because trends in detectability are typically confounded with trends in abundance unless detectability is explicitly accounted for
Researchers have extended conventional distance sampling to account for a variety of complications that arise in real life sampling scenarios. Several studies have utilized multiple observers to relax the assumption of complete detectability on the transect line
Several authors have recently proposed using hierarchical, Bayesian models in place of likelihood or moment-based estimators to analyze distance sampling data
Individual nodes indicate a parameter or vector of parameters, and arrows represent conditional dependence. Notation is defined in
Thus far, attempts to analyze line transect data with hierarchical models have focused on single observer data. In this paper, we develop hierarchical models for double observer data that permit habitat covariates to influence abundance intensity, while simultaneously modeling effects of covariates on detection probability. Since double observers are employed, these models allow for
Our modeling approach is applicable to sampling programs for a variety of taxa; here, we focus on describing a generalized hierarchical modeling framework, developing user friendly software, and demonstrating the viability of our approach. After describing our proposed model, we use a small simulation study to verify that it provides reasonable inference about abundance for multiple species with different habitat preferences. Finally, we analyze data from a known population of golf tees that were sampled via a double observer distance sampling protocol. Golf tee clusters varied by the number of tees in each cluster, by color, and by level of exposure, allowing us to fit models that expressed detection probability as a function of covariates and to estimate posterior distributions for these covariates. In contrast to most population surveys, truth is known for this dataset and provides a verifiable test of our modeling framework.
Parameter | Definition |
|
Total animal abundance in the study area |
|
Number of groups of animals located in area |
|
The log of abundance intensity in area |
|
Precision of log of abundance intensity; used to impart overdispersion relative to the Poisson distribution |
|
Abundance intensity in area |
|
Parameters of the linear predictor describing variation in the log of abundance intensity as a function of habitat covariates |
|
The value of the |
|
Parameters describing the distribution of individual covariates at the population level |
|
Parameters of the linear predictor describing variation in the probit of detection probability as a function of observer and individual covariates |
|
Parameter describing increasing correlation between |
|
|
|
Bernoulli response variable for whether the |
|
Number of groups observed by at least one observer during transect |
|
Number of observers present when sampling transect |
|
The value of the |
|
Design matrix associated with habitat model |
|
Design matrix associated with the detection model for the |
|
Label for grid cell |
|
The area of grid cell |
Parameters and data used in the hierarchical model for distance data.
We propose a hierarchical model for distance sampling data consisting of several conceptually distinct components (
If conducting a Bayesian analysis, the posterior distribution is then proportional to
Given samples from the posterior, one can make posterior predictions of total abundance, so we might include another component
True abundance is indicated in red, with posterior means and estimated 95% credible intervals for abundance indicated by circles and brackets, respectively. Panel (A) gives results for the simulation with linearly increasing abundance, while panel (B) gives results for the simulation with a quadratic relationship between abundance and a habitat covariate.
The data collected in multiple observer transect surveys consist of a collection of binary observations,
Each symbol represents a different group of golf tees, with dark symbols representing yellow tees and gray symbols representing green tees. Groups that were observed by at least one observer are indicated by solid symbols, while open symbols indicate groups that were never observed. Squares represent tee groups that were exposed above surrounding grass, while triangles represent unexposed groups. Group sizes are indicated by the proportional size of each symbol, with the smallest symbols representing groups of 1 animal, and the largest symbols representing a group of 8 individuals. Transect lines are represented by solid black lines, with dotted lines giving survey area boundaries and demarcating the areas surveyed by each transect. The red line serves as the strata boundary (points north comprise the northern stratum).
Suppose for the moment that we also knew the total number of groups present in the area associated with transect
Bar plots representing the probability mass for group size in the golf tee experiment. Empirical distributions correspond to the actual distribution of group size used in the experiment, while posterior distributions represent estimated posterior predictive distributions obtained after analyzing data with our hierarchical model.
Conditional on
where
and with continuous distance data,
In both cases,
Kernel density estimates of marginal posterior distributions are indicated in black, with true values used to simulate data indicated by red, vertical lines. Parameters indexed by “Cov” give covariate parameters, “Det” give detection parameters, “Hab” give habitat parameters, and “N” gives abundance. The first panel (“cor”) gives an estimate of the observer dependence parameter. Species specific parameters are indexed by “sp1” (for species one) or “sp2” (species two).
Detection functions for each species are based on mean group sizes for each species (4 and 2, respectively), and are made for observer 2 (who had an intermediate detective ability).
Like the (more popular) logit link function, the probit link function provides a transformation from
With multiple observers, conditioning on one value of
where
Some covariates thought to influence detection probability are collected at the transect level (e.g. survey conditions), and are therefore known for all potential groups of animals (observed and unobserved). However, for covariates associated with individual groups of animals (e.g., distance, group size, species), an underlying model is needed to link the observed covariates (for observed groups) to unobserved covariates (for unobserved groups). We model these individual covariates as having arisen from a parametric distribution, possibly with overdispersion. In general, let the covariate
where
where
Let
Here,
The model as written focuses on abundance of
For areas that are unsurveyed (that is,
while for the zero-truncated overdispersed Poisson model,
Posterior predictions of total abundance are then calculated as
Bayesian analysis requires specification of prior distributions for
We gave the
As suggested previously, the primary challenge in implementing a complete data model for multiple observer transect surveys was in jointly sampling local abundance and individual covariates. We chose to implement a reversible-jump algorithm (RJMCMC) to sample abundance at the transect level, in a manner similar to Durban and Elston
Specification of the complete data model starts with specifying an integer
Addition and deletion steps consist of increasing or decreasing the value of
Propose a new value for
Accept proposal with probability
Here,
if
This formulation, including the integral and specification of variance as 1.0, corresponds to the inverse probit link function
The Metropolis ratio,
where
Following the addition/deletion step, the next step in our RJMCMC estimation scheme is to resample individual covariates. For each such covariate, there are two categories of values to update: (1) covariates for which a group of animals were in the population and never observed, and (2) latent groups not currently belonging to the population. Letting
for
For “pseudo-groups”
Estimation of remaining model parameters (conditional on a set level of abundance) proceeded by cyclical sampling of model parameters from their full conditional distributions
We developed generalized computing code to conduct MCMC estimation, which we implemented in the R programming environment
Kernel density estimates of posterior distributions are in black, while true values are represented by red vertical lines, and estimates from a conventional mark-recapture distance sampling analysis (see Laake and Borchers
We first used simulation to verify that our modeling approach provided reasonable estimates of abundance and related parameters. In particular, we generated a double-observer distance sampling dataset for two species with different habitat preferences and covariate values, but with a common detection function. For the first species, expected abundance increased linearly with an arbitrary covariate (here, transect number); expected abundance of the second species had a quadratic relationship with the covariate (
The top panel gives detection probability curves for the set of covariates that maximize observer dependence (observer = 2, group size = 1, exposure = 0, species = “green”). “Individual” specifies detection probability for observer 2 only; “Conditional” gives the probability of detection for observer 2 given that the group was detected by observer 1; “Duplicate” gives the probability of detection by both observers; “Pooled” gives the probability of detection by at least one observer. The bottom panel represents dependence, as summarized by the parameter
A total of 25 transects were simulated, each of which had two observers assigned. In all cases, observers were picked randomly from a pool of three total observers with different underlying detection probabilities. The assumed detection function was made to be a function of observer, distance, species, and group size. Correlation between observers was modeled as a linearly increasing function of distance on the probit scale, with a maximum value of 0.5 at the farthest observable distance.
Modeling construct | Choice | Alternatives | Advantages |
|
Bivariate normal distributionwith correlation as a functionof distance | Individual random effect | Not needed when there is one observer |
|
Probit | Logit, complimentary log-log | Simplifies Bayesian computation through Albert and Chib algorithm |
|
Complete data likelihood/data augmentation | Observed data likelihood | Eases explicit conditioning, simplifies likelihood computations, and enables extensions such as species misidentification |
|
Reversible Jump MCMC(RJMCMC) | Fixed dimension Bayesian inference (e.g. using occupancy-like setup |
Straightforward implementation for non-uniform spatial support (e.g. unequal transect lengths) |
Choices, alternative(s), and advantages of the modeling choices we made when analyzing double observer line transect data.
Using the true functional form for detection and habitat models, we sampled the posterior distribution with two Markov chains of length 270,000 with random starting values, recording posterior values from one out of every 20 iterations to save disk space. Convergence was determined by examining trace plots and other standard convergence diagnostics
To further test our estimation approach, we analyzed data from an experimental survey of golf tees collected at the University of St. Andrews in 1999
The locations of 108 groups of green golf tees and 142 groups of yellow golf tees were randomly assigned over a landscape with two spatial strata (
A total of 11 8-m wide transects were used to sample the population of golf tees, with eight independent observers traversing each transect. Transects varied in length, but completely covered the study area. We attempted to model these data in a similar manner to Laake and Borchers
Using the double observer distance data, we attempted to estimate abundance of each “species” of tee (here, green and yellow) using our hierarchical probit formulation. We specified separate models for abundance intensity (
where
We sampled the posterior distribution corresponding to the golf tee data with two Markov chains with different starting values. After an initial pilot run of 1000 iterations to adjust MCMC tuning parameters to desired ranges, each chain was run for 100,000 iterations. Inspection of trace plots and other standard MCMC diagnostics suggested that convergence to a stationary distribution was obtained almost immediately; as such, we combined the final 90,000 iterations of each chain together for inference.
Estimated abundances mirrored truth in each transect (
For comparison with LB, we focus inference on the number of groups of animals (noting that posterior distributions for absolute abundance are also readily available). The posterior distribution for abundance of golf tees had a mean of 226 groups and 95% credible interval of (204, 251). By contrast, LB produced an estimate of 252, which was much closer to the true population size of 250. However, as LB note, this is somewhat accidental, as estimates of the number of groups in each color and exposure class differed substantially from true values. The hierarchical approach does better in this context, producing estimates that are as good or better than those generated by LB (
Likelihoods for conventional mark-recapture distance sampling (MRDS) estimators are often written as a function of several different types of detection functions
As suggested by Royle
Double-observer transect data are widely used to estimate abundance of animal populations. Although previously available estimators (notably, Horvitz-Thompson-like estimators [HT];
An alternative approach to extrapolating abundance over a large spatial domain is to use a multi-stage statistical procedure, where the outputs from the first stage of modeling (e.g. density estimates) are used as inputs (data) for a second round of modeling (e.g. using a spatial model with habitat covariates)
We have presented a general framework for hierarchical analysis of double observer transect data that avoids many of these difficulties, obtaining posterior distributions for all parameters from a single analysis. In particular, abundance intensity can be made a function of habitat covariates, so that extrapolation to unsampled areas is straightforward (assuming that covariates are known for these areas). Further, precision of abundance estimates should be better than HT estimators whenever an explanatory habitat covariate can be identified. Modeling data from multiple observers allows us to relax the assumption of 100% detection on the transect line. Observer dependence can be accommodated via a bivariate normal distribution on the probit scale, helping to account for an increase in detection heterogeneity as a function of distance. To our knowledge, this is the first attempt at constructing a hierarchical model for double observer transect data.
Our model performed well in estimating abundance of two simulated populations whose abundance intensities were linearly and quadratically related to a hypothesized habitat covariate. Admittedly, we supplied the estimation model with the correct functional form for habitat relationships, a convenience typically not possible in real world estimation scenarios. Unfortunately, we know of no universally accepted method for conducting model selection among alternate functional forms for habitat-density relationships when using a RJMCMC approach to estimation. For instance, the popular deviance information criterion (DIC;
Our model also performed well when estimating the abundance of a known population of golf tees, in some cases outperforming conventional MRDS HT estimators. The estimates from both approaches (hierarchical, MRDS) tended to underestimate the number of golf tees that were visually obstructed; however, we suspect that this was largely due to some groups of tees being virtually undetectable. As such, this should not be seen as a failure of our proposed method, but as an artifact of the particular dataset. It is well known that transect data alone will produce negatively biased estimates if some subset of the population is unavailable for detection
We made a number of modeling choices that differ from the way in which line transect data are typically analyzed. Some of these choices, together with our rationale, are listed in
Although we are convinced that our approach is valuable for making predictions in unsampled areas, there is clearly need for more research in this area. By virtue of its hierarchical structure, our approach can easily be extended to incorporate spatial autocorrelation in abundance. For instance, Schmidt et al.
We are also interested in extending our hierarchical framework to model partial observation and misclassification of species. In multi-species transect surveys, this is a real issue, as multiple observers often record species as unknown or have conflicting records. Currently available estimation approaches are incapable of handling such conflicts. Our data augmentation framework is clearly capable of treating true species as a latent (unobserved) variable, with misclassification introduced in the observation component of the model; however, parameter identification under such a scenario deserves further investigation.
We strongly encourage ecologists interested in abundance and species-habitat relationships to consider hierarchical modeling for estimation, especially when it is infeasible to conduct standard designed-based inference. When surveys are replicated across time and space, hierarchical models provide demonstrable advantages over design-based modeling approaches, as information can be shared across temporal and spatial domains
(PDF)
We thank D. Borchers for providing us with the golf tee data, and J. Moore and B. McClintock for comments on a previous draft of this paper. Views expressed are those of the authors and do not necessarily represent findings or policy of any government agency.