The authors have declared that no competing interests exist.
Conceived and designed the experiments: RB CS SA. Performed the experiments: SA PVDS. Analyzed the data: SA. Contributed reagents/materials/analysis tools: PVDS CS. Wrote the paper: SA RB CS PVDS.
Objective identification and description of mimicked calls is a primary component of any study on avian vocal mimicry but few studies have adopted a quantitative approach. We used spectral feature representations commonly used in human speech analysis in combination with various distance metrics to distinguish between mimicked and non-mimicked calls of the greater racket-tailed drongo,
The fundamental step to any study of vocal mimicry is to distinguish between mimicked calls and species-specific calls in an objective manner. This is usually done by listening to available sound libraries of a number of different species and identifying model species based on human psychophysical, often qualitative, perceptions of similarities between calls. This is commonly backed by visual inspection of spectrograms
A more quantitative way of defining mimicry is to compare spectral features of the mimicked calls with those of the putative model calls using various statistical measures such as Multivariate Analysis of Variance
Mimicry, by definition, implies call similarity, both structural and perceptual, since perceptual similarity must have a structural basis. It is therefore important to assess structural similarity between calls in studies of vocal mimicry. In our study we consider a call to be a mimicked one if it is more similar to the call of a model species than to: i) other calls of its own species and (ii) calls of a large number of other species with whom it shares the same habitat. Recently, Igic and Magrath
In both birds and humans, sounds are produced during expiration by the flow of air through the vocal system. Even though the vocal organ in birds is structurally distinct from that of humans, acoustic output in both is produced by the ‘source-filter’ model
The greater racket-tailed drongo,
All recordings of the greater racket-tailed drongo and putative model species were made at the Biligiri Rangaswamy Temple Tiger Reserve (77°–77°16′E, and 11°47′–12°09′N), a 540 sq. km area of forests at the junction of the Western and Eastern Ghats in southern India. Audio recordings were made at a sampling rate of 48 kHz using a portable Marantz PMD 671 digital recorder and a Sennheiser ME 66 directional microphone. Spectrograms of the recordings were generated in RAVEN
Samira Agnihotri (SA) has worked on the calls and songs of bird species in this area for eight years and identified mimicked calls in the recordings as a trained listener of bird calls. The phrase ‘mimicked call’ thus refers to notes/calls in the racket-tailed drongo's repertoire classified as mimicked based on the aural and visual perception of a highly trained listener (SA). This classification is used as the reference against which other classification results are compared in this manuscript. Examples of mimicked calls as classified by SA are shown in
All necessary permits were obtained for the described field studies from the Karnataka State Forest Department.
We asked 105 volunteers to assess similarity between mimicked and model calls by showing them spectrogram images of the same. Typically, cross-validation by humans involves psychophysical methods where one or more individuals are first trained to recognise and distinguish between sounds or are already familiar with them, and then asked to identify similarity by listening to a range of sounds in an experimental setup. We believe that inter-individual variability is high in untrained listeners and tried to reduce this by opting for visual inspection of spectrograms, which are accurate time-frequency representations of the time-varying audio components of such sounds.
We created spectrogram libraries of four types:
Mimicked calls - (21)
Putative heterospecific model calls (“models”) - (21);
Putative heterospecific non-model calls (“other species”) - (20)
Racket-tailed drongo putative non-mimicked calls (“species-specific”) - (20)
Spectrograms of the sounds in these libraries were saved as JPEG files at a uniform resolution (3.75 s/line; 24 kHz/line) and numbers on the axes were removed digitally. We randomly picked 3 notes each from the “mimicked calls” (see
Species | Call type | Spectral signature | |
Banded bay cuckoo | Call | FM | |
Black-rumped flameback | Call | Trill | |
Common hawk-cuckoo |
Call | FM | |
Common tailorbird |
Call | FM | |
Crested serpent eagle | Call | FM | |
Crested treeswift | Call | HR | |
Green bee-eater |
Call | Trill | |
Jungle babbler | Alarm | HR | |
Large billed crow | Call | NB-Trill | |
Loten's sunbird |
Call | BB | |
Oriental honey buzzard |
Call | FM | |
Oriental honey buzzard |
Courtship | FM | |
Oriental white-eye | Call | NB-Trill | |
Plum-headed parakeet | Call | HR | |
Red spurfowl |
Call | HR | |
Rufous treepie | Call | FM | |
Rufous treepie | Alarm | BB | |
Shikra |
Call | HR | |
White-breasted kingfisher | Call | Trill | |
Yellow-browed bulbul | Call | FM | |
Bonnet macaque | Alarm | HR |
- files that were misclassified in the human assessment.
*- files that were misclassified by the computer-based method.
During the experiment, a person was shown the test spectrogram (‘mimicked’) and asked to identify and rank the two most similar spectrograms from the remaining 10 in the set. Each set was shown to a different person. All subjects were students from electrical engineering and biology departments who were familiar with signal processing methods and spectrograms but naive to the purpose of this study. No time limit was set for the task, but most individuals completed within five minutes.
All participants gave informed verbal consent to participate in the study. Their names and institutional affiliations were recorded with their consent but are kept confidential. Approval for this component of the study, including the consent procedure, was obtained from the Institutional Human Ethics Committee of the Indian Institute of Science, Bangalore (IHEC No. 16/2013).
We used the following methods commonly used in human speech analysis to extract spectral feature vectors and calculate similarity between mimicked and putative model calls.
The critical aspect in a feature vector derivation is that it should be a smooth representation of the underlying short-time spectrum of the signal. Some important representations that have become successful in speech recognition, which are relevant to the problem at hand are the following:
The log power spectrum thus obtained is subject to a discrete cosine transform (DCT), which results in what is known as the mel cepstrum. Typically, the first few coefficients are significant and following speech/speaker recognition experiments, the first 13 coefficients are used to constitute a mel-frequency cepstral coefficient (MFCC) feature vector. In addition to the
The key steps involved in PLP are critical band smoothing of the short-time spectrum, resampling the smoothed spectrum at approximately 1-Bark intervals, pre-emphasis by an equal-loudness compensation and compression of the resulting spectrum to simulate intensity-loudness power law. The resulting spectrum has been shown to be consistent with many known results in acoustic signal perception. A feature vector parametrization of the spectrum is obtained by all-pole modeling (Levinson-Durbin recursion) and subsequent conversion to cepstral domain
We used these feature vectors in combination with the following distance metrics to calculate similarity between mimicked and putative model calls:
We first created a library of 357 sound files that were segmented to include only one representative of each call. These included 63 mimicked calls, 84 heterospecific calls (including putative models) of 61 species and 210 racket-tailed drongo putative non-mimicked species-specific calls. Frequencies of the calls in these files ranged from 500 Hz to 8 kHz, spanning most of the frequency range of bird vocalisations. The notes were broadly classified on the basis of their spectral signatures (
Trill (Black rumped flameback,
We used the results from this large data set to select one spectral feature extraction method and two distance metrics that gave the maximum number of correct matches. We selected the two similarity indices based on the total number of first-ranked correct matches made by the various indices. We also examined the variance in rank assignment within the first three ranks across calls (incorrect matches were scored as 4, the lowest rank, for this calculation) for all the indices.
We then repeated the analysis using these selected methods on a smaller subset of the sound files, which was identical to the subset used for the human assessment. The design for this analysis was also identical to the design for the human assessment, i.e. each of the 21 mimicked calls was tested 5 times. Each comparison was against a randomly picked set of 10 other calls (3 “mimicked”, 3 “species specific”, and 3 “other species” plus the putative model call).
We used a 2-sample test for equality of proportions with Yates' continuity correction to compare the total proportion of correct matches obtained by each method. All statistical tests were performed in R 2.13.2
We examined how humans identified mimicry at three different levels. First, at the broadest level, 77 of the 105 people tested correctly matched a mimicked call to its model as their first choice (73.3%). This accuracy increased to 82.9% when we included second-ranked calls into the criteria for correct matches. Secondly, on a call-by-call basis, 7 out of the 21 mimicked calls tested (33.3%) were ranked as most similar to their putative model in all 5 trials (i.e. by five different people) in which they were presented (
There was a significant increase in accuracy when we compared the correct matches in the first rank with and without an 80% threshold (
In the first part of our analysis with a sound library of 357 files, 14 of the 27 mimicked calls (51.85%) tested showed the putative model call as the closest match (out of 356 files) across all three feature extraction methods. If we included the calls where the model was ranked as the second closest match (out of 356), the total number of files matched correctly increased to 17 (63%). The PLPCC feature vector performed best, giving 12 of these 17 correct matches. PLPCCs also performed well across three categories of call types (5/7 FM calls; all 3 BB calls and 3/4 HR calls). The Jaccard index, the correlation co-efficient and the cosine similarity indices showed the maximum number of first-ranked correct matches across feature extraction methods (
RASTA-PLPCC | MFCC | LSF | |||||||
Rank 1 | Rank 2 | Rank 3 | Rank 1 | Rank 2 | Rank 3 | Rank 1 | Rank 2 | Rank 3 | |
9 | 3 | 2 | 5 | 1 | 1 | 5 | 1 | - | |
9 | 3 | 2 | 5 | 1 | 1 | 5 | 1 | - | |
9 | 3 | 2 | 5 | 1 | 1 | 5 | 1 | - | |
9 | 2 | 2 | 2 | 1 | 1 | 3 | 1 | 2 | |
10 | 3 | 2 | 2 | 1 | - | 2 | 2 | - |
RASTA-PLPCC (14) | MFCC (7) | LSF (7) | |
0.58 | 0.62 | 1.29 | |
0.58 | 0.62 | 1.29 | |
0.58 | 0.62 | 1.29 | |
1.02 | 1.9 | 1.48 | |
0.57 | 2.14 | 1.95 |
Similar values indicate consistency in assigning ranks.
When we performed the analysis on the smaller sound library, 11 of the 21 files tested (52.38%) using the PLPCC were ranked as most similar to their putative model in all 5 trials. If we included correct matches in the second rank, the total percentage of correct matches increased to 71.43%. In accordance with our analysis of the human assessment, if we kept a threshold of 80% accuracy for each call, i.e. correct matches in at least four of the five trials per call, then according to this criterion, 15 of the 21 (71.4%) mimicked notes that were tested were matched correctly to their models in the first rank and this increased to 76% when we included second ranked correct matches. The increase in the total number of correct matches with a threshold of 80% was not significant for the first rank, nor for the first and second ranks combined (
Five mimicked calls were not matched to their putative model in either the first or the second rank by the computational algorithms (
Both methods showed similar results in terms of the total proportion of mimicked calls that were matched correctly to their putative models. Although the automated method performed slightly better than humans, the difference was not significant (Rank 1:
This is the first study of vocal mimicry in the racket-tailed drongo that has attempted to quantify mimicry using an objective approach. This is also the first study in which MFCCs and PLPCCs have been applied to describe vocal mimicry in a bird species. Overall, both computer-based methods and human assessment showed similar results (in terms of proportion of calls) in matching a mimic to its model.
These results are similar to those obtained in previous studies on automatic recognition of bird calls using feature vectors used in human speech analysis to examine the calls of a relatively large number of species
In our study, we did not have any training data sets and the PLPCC gave the maximum number of correct matches irrespective of the spectrogram category (FM, BB, HR, NB-Trill, and Trill) i.e., there was no structural similarity in the types of calls correctly classified by the PLPCC. The best matching performance in our study was obtained with the RASTA-PLPCC feature vectors and the performance did not vary significantly with the type of distance metric employed. MFCC and LSF feature representations performed nearly identically up to within the first two ranks and both were found to be significantly inferior compared with RASTA-PLPCC. The LSFs are a discrete representation of the spectrum and hence more sensitive to minor perturbations of the spectrum or additive noise than MFCC/PLPCC, which are smoother representations of the short-time spectrum.
It is, however, surprising to note that MFCC feature representations and dynamic features, which have become de facto standard in speech/speaker recognition applications, did not fare well in the mimicry to model call matching task. RASTA-PLPCC equipped with the Euclidean distance metric (in terms of total number of correct matches) emerged the top performer. Also, among the three feature representations considered, it is again RASTA-PLPCC that comes quite close to modelling the peripheral auditory system behaviour using suitable engineering approximations. Given that the ground truth in the call matching task was given by a trained human listener, it is probably not surprising that RASTA-PLPCC outperforms the other two representations considered.
Our results also corroborate recent work on vocal mimicry in the brown thornbill, where human identification and classification of mimicked notes did not differ from that done by computer-based methods
In this paper, we examined 21 racket-tailed drongo calls, which include mimicry of 19 different species, and compared them with 61 other calls (including the putative models, 20 non-model species calls, and 20 racket-tailed drongo species specific calls). In our study there was 81% overlap in the sounds judged as mimicry by the trained human assessor and the untrained human assessors and 76% overlap between the trained human assessor and the automated method. If we pool the results of the automated methods and the human assessment, we have a combined accuracy of 95%. This indicates that humans were able to match certain calls that the computational algorithms were unable to and highlights the fact that cross-validation by humans is still a necessary component of studies involving automated procedures for identifying and classifying sounds. This could be especially relevant when the calls vary greatly in signal-to-noise ratio, a problem that continues to be a significant hurdle in the progress of automated sound recognition in the field
Quantitative methods to study vocal mimicry provide a useful tool in identifying and establishing variations in mimicry repertoires across mimicking individuals. They can also be used to examine the accuracy of imitation between models and their mimics
We thank all the volunteers for the human assessment task (students at the Electrical Engineering and Electrical Communication Engineering departments, the Centre for Ecological Sciences at the Indian Institute of Science and students of the WCS-NCBS Masters course in Wildlife Biology & Conservation). Permission to work at the B.R.T. Tiger Reserve was given by the Karnataka State Forest Department.