The authors have declared that no competing interests exist.
Conceived and designed the experiments: SAG ASC SRG. Performed the experiments: SAG ASC SRG. Analyzed the data: SAG ASC SRG. Contributed reagents/materials/analysis tools: SAG ASC SRG. Wrote the paper: SAG ASC SRG.
Teleconferencing as a setting for scientific peer review is an attractive option for funding agencies, given the substantial environmental and cost savings. Despite this, there is a paucity of published data validating teleconference-based peer review compared to the face-to-face process.
Our aim was to conduct a retrospective analysis of scientific peer review data to investigate whether review setting has an effect on review process and outcome measures.
We analyzed reviewer scoring data from a research program that had recently modified the review setting from face-to-face to a teleconference format with minimal changes to the overall review procedures. This analysis included approximately 1600 applications over a 4-year period: two years of face-to-face panel meetings compared to two years of teleconference meetings. The average overall scientific merit scores, score distribution, standard deviations and reviewer inter-rater reliability statistics were measured, as well as reviewer demographics and length of time discussing applications.
The data indicate that few differences are evident between face-to-face and teleconference settings with regard to average overall scientific merit score, scoring distribution, standard deviation, reviewer demographics or inter-rater reliability. However, some difference was found in the discussion time.
These findings suggest that most review outcome measures are unaffected by review setting, which would support the trend of using teleconference reviews rather than face-to-face meetings. However, further studies are needed to assess any correlations among discussion time, application funding and the productivity of funded research projects.
The American Institute of Biological Sciences (AIBS) has been providing peer review services to the biological research community since 1963
Overall, existing data in the psychology and team performance literature seem to indicate that the performance of teams is impacted by technologically mediated communication (teleconferences, chat, email, etc.), however, the extent is dependent on the type of technology, the type of task in which the team is engaged, and whether the team is ad-hoc (temporary) or established (appointed members serving regularly over a prescribed period of time)
AIBS has coordinated the scientific peer review of thousands of applications for one specific program (PrX) in support of a federal agency for over a decade, revealing interesting and informative data on the peer review process. Importantly, one aspect of the peer review process for PrX that evolved over the years is the review setting; reviews that were conducted via face-to-face meetings have transitioned to teleconference review meetings. Aside from the change in review setting, most of the AIBS review process for PrX has remained fairly constant. Reviewers have consistently used the same scoring scale, the same rules regarding conflict of interest and the same discussion format. Reviewers have consistently provided specific evaluative information to the client and specific feedback to investigators in much the same format. Therefore, PrX represents an appropriate opportunity for a retrospective study to observe some of the output metrics of the peer review process and examine whether they are affected by the change in review setting from face-to-face to teleconference meetings.
Funding for this program is appropriated to support research in a wide variety of topic areas (more than 80 different areas over the last 13 years). Topic areas have included, but are not limited to: vision, drug abuse, nutrition, blood-related cancer, kidney disease, autoimmune diseases, malaria, tuberculosis, osteoporosis, arthritis, and autism research. AIBS has provided independent, objective scientific peer review services for this program since its inception in 1999, reviewing over 6,000 applications. While several funding mechanisms have been employed, the most consistently used mechanism has been an NIH R01-like mechanism, which funds studies over 3-year periods in amounts up to $750,000 in direct costs. AIBS derived the data for this analysis from the review of applications submitted to this mechanism in the years 2009–2012. It should be noted that in both 2010 and 2012, a pre-application cull was used in which only a subset of investigators were invited to submit a full application, thus reducing the number of full applications for those years. The success rate of full applications was 4.6%, 9.3%, 8.9% and 10.1% for 2009–2012, respectively.
AIBS staff members recruit subject matter experts to review applications submitted to specific topic areas, choosing reviewers with areas of expertise closely matching the research applications under review. Reviewers are vetted for real and perceived conflicts of interest. They are required to sign a non-disclosure agreement to maintain the confidentiality of the review. Each review panel consists of 7–12 subject matter experts (including a chairperson) and, in recent years, one or more consumer reviewers. Consumer reviewers are full voting members who have direct experience with diseases relevant to the scientific topic areas being evaluated by the peer review panels. All panel members receive online and face-to-face (when applicable) orientations describing the AIBS peer review process.
Once panel members are recruited, review materials are disseminated and panel members begin reading and evaluating applications. Reviewers have access to all the applications but are only responsible for providing written comments for a subset of applications that closely matches their specific subject matter expertise; each application has at least two assigned reviewers. Reviewers score assigned applications using specific review criteria. Each application is given an overall scientific merit score. The overall scientific merit scale is from 1.0 to 5.0 (where 1 is highest scientific merit and 5 is lowest scientific merit). In recent years, reviewers have used the AIBS online evaluation system (SCORES; trademark pending), which allows for the capture of the initial evaluations and scores as well as online conferral among reviewers when needed.
For face-to-face review meetings, participants travel to the meeting destination (usually a hotel) for a one- or two-day meeting, depending on the size and number of applications to be reviewed. No travel is necessary for teleconference reviewers. During the peer review meeting (either face-to-face or teleconference), assigned reviewers present their critiques of the strengths and weaknesses for each application using specific review criteria. The discussion is then open to the panel. AIBS staff and the panel chairperson ensure that each application is reviewed using a consistent process and that overall scientific merit scores reflect what is written in each critique. AIBS staff also ensure that all applications receive a thorough and equitable discussion and that all panel members' concerns are noted. After panel discussion is completed, each panel member submits their final score (using the online AIBS SCORES system) on each application. The electronic score sheets are then locked, time-stamped and the final scores are recorded. The panel then moves on to the next application until all have been discussed and scored. Afterward, an overall summary paragraph of the panel's evaluation of the application's strengths and weaknesses and panel recommendations is created by the assigned reviewers for each application and then approved by the panel chairperson.
Written critiques and summary statements are edited by AIBS staff to ensure scientific accuracy and clarity. The final deliverable to the funding agency for each application consists of the overall summary statement, the average of the panelists' scores (also referred to as the overall score [OS]) and the assigned reviewers' critiques (anonymized). Panel members are then surveyed regarding the quality of review, the review procedures, interactions with AIBS staff, etc., to ensure continuous process improvement for AIBS and its clients.
Until 2011, the peer review for PrX was conducted through face-to-face peer review panels with only occasional teleconference reviewers for applications requiring specialized expertise. In 2011 and 2012, all applications were reviewed via teleconference panels.
In this analysis we compare reviewer scoring behavior of 1600 applications over a 4-year period: data from two years of face-to-face panel meetings and two years of teleconference meetings were used. The average overall score (AOS) for applications reviewed each year is recorded. It should be noted that the OS for any given application is an average of individual reviewers' scores, and the AOS is the average of all the OSs of all applications from all panels for that year. The AOS of applications are compared over time, along with average standard deviations and reviewer inter-rater reliability (IRR) measures. The average discussion time per application was recorded per panel and then averaged over all panels for that year. The reviewer demographics were also recorded and analyzed over time. Where applicable, one-way ANOVA was applied with post-hoc Scheffe tests to compare data sets for each year and provide measures of statistical significance. Finally, some results of a reviewer survey to gather reviewers' assessments of the quality of the review process are also provided.
It should be noted that, although the specific scores/details of individual applications must be kept confidential, the data sets collected for this study will be anonymized and made available upon request.
Although it is common for standing review panels to develop a corporate memory (particularly with regard to funding level cut-offs) and potentially “chase the pay line” through their scoring, PrX had no standing panels over this timeframe, roughly 50% of reviewers were new from one year to the next
Average score comparison between 2009, 2010 (face-to-face) and 2011, 2012 (teleconference) reviews. The total numbers of applications reviewed were 669, 291, 347, and 297 for 2009, 2010, 2011, and 2012, respectively.
In addition, the distribution of OSs of all applications in each year was compared (
Overall score (OS) distribution for all applications from 2009, 2010 (face-to-face) and from 2011, 2012 (teleconference) peer reviews.
The standard deviations of peer review panel member scores for each application have also been recorded. From 1999 to the present, average standard deviations for PrX have remained relatively stable from year to year, ranging from 0.18–0.27, which is consistent with standard deviations observed in NIH peer review panels
Average standard deviation of individual reviewer merit scores per application, comparing 2009, 2010 (face-to-face) and 2011, 2012 (teleconference) reviews.
The ICC is plotted for 2009–2010 (face-to-face) and 2011–2012 (teleconference). The ICC is stable, ranging from 0.84 to 0.87 (p<0.01 for all years) with a standard error of approximately 0.06 over all years (
Intraclass correlation for 2009, 2010 (face-to-face) and 2011, 2012 (teleconference) reviews (p<0.01 for all years).
There is a small level of variation in the ICC from 2009 to 2012, which is less than the calculated error, and there is no obvious trend observed over time for either the ICC or the IRR. These data suggest that the teleconference review setting does not contribute to the contentiousness of panel decisions and does not drive decisions toward or away from consensus.
The average time spent discussing each application was calculated for each year, and then a comparison was plotted in
Average discussion time per application over all panels for face-to-face (2009–2010) and teleconference (2011–2012) years. In 2009, there was an average of 28 applications per panel, whereas in 2010, 2011 and 2012, there were averages of 18, 16, and 20 applications, respectively, per panel.
However, there does appear to be a statistical difference in average discussion time between face-to-face and teleconference reviews (F[3,61] = 14.54; p<0.001), specifically between 2010 versus 2011(mean difference = 9.97; p<0.001, CI: 5.56, 14.37) and 2010 versus 2012 (mean difference = 7.43; p<0.001, CI: 2.63, 12.24). This difference may be in part due to the review setting, as reviewers are “captive” at face-to-face meetings. This physical restriction may lend itself to extended peripheral discussions. In contrast, discussion during teleconference reviews may be more focused and efficient, as reviewers have a reduced level of interaction and can quickly reengage into their daily activities once the teleconference has ended. It should be noted that no correlation was observed between average panel scores and discussion time (data not shown).
Video teleconferencing has been suggested as a virtual hybrid of face-to-face and teleconference meetings. In order to explore this mechanism for grant application review, AIBS piloted two video teleconference panels in 2011 (with 6 applications reviewed by one panel and 9 applications reviewed by the other). We observed average application discussion times of 15 and 17 minute per application, respectively, which are lower than the overall 2011 and 2012 teleconference averages (19 and 22 minutes/application, respectively), suggesting that video teleconferencing does not avoid the loss in discussion time seen in teleconferences. This observation is consistent with the literature
AIBS staff have also tracked review panel member demographics over time for this program, to get a sense of whether review setting has a significant effect on reviewer recruitment. In
Relative proportions of reviewer degrees for each year (MD, MD/PhD, and PhD).
Relative proportion of reviewers in terms of seniority for each year. The senior academic level grouping included full professors, chairs, deans, and/or directors, the intermediate level grouping included associate professors, and the junior level grouping included assistant professors or equivalents.
AIBS routinely surveys reviewers for quality control purposes and to identify any areas in which improvement can be made. In light of the change in setting from face-to-face to teleconference, we polled reviewers to see if they felt the setting had any influence on the outcome of the review. Survey data from 2012 (N = 90) assessed peer review quality using a questionnaire with answers scored on a scale of 1.0 to 5.0 (where 5 is the best and 1 is the worst).
For the question “To what extent did you find the panel discussions fair and thorough?” an average answer of 4.5 was returned, with 98% of reviewers scoring above a 3.0 (98% of surveyed reviewers answered this question). This question was asked in a 2008 survey of this program (when a face-to-face review setting was employed), which also resulted in an average answer of 4.5.
For the question “Thinking of your past experiences with in-person, on-site review panels, to what extent did the teleconference review panel format achieve a thorough review of each application?” an average score of 4.0 was returned, with 77% of reviewers scoring above a 3.0 (81% of surveyed reviewers answered this question).
These questions relate to the thoroughness and equity of the review discussions themselves. This issue is of importance particularly to the funding agency and to the applicants
In addition to surveying the reviewers about the review process, AIBS also receives feedback from the funding agency. Although the feedback was positive and no change in review quality from the switch from face-to-face to teleconference reviews was noted, AIBS is currently in the process of acquiring more quantitative survey data on this topic.
The PrX program serves as an interesting and informative case study of the effects of review setting on the metrics of the peer review process. The data indicate that little difference is found in most of the review metrics between face-to-face and teleconference settings, which is consistent with group performance literature
There was some difference noted in terms of discussion time; teleconferencing and, in a much smaller sample, video teleconferencing had shorter discussion times when compared with face-to-face review data. As mentioned above, the captivity of reviewers and the opportunity for interactions at a one- or two-day face-to-face panel meeting likely lends itself to prolonged peripheral discussion, which may not occur in the focused teleconference format. It has been hypothesized that while task-oriented focus is increased in a teleconference setting, there may be lower member engagement in group activities
The importance of potential decreased peripheral discussion and member engagement on the final scoring decisions of teleconference reviews, as well the ultimate productivity of the research funded through this process, is unclear. We have found no correlation of either final application scores or standard deviations with discussion time (data not shown). A few studies have attempted to examine the effects of discussion on reliability and scoring decisions, however the results are mixed and more studies need to be conducted before any clear conclusion can be drawn
We would like to thank Dr. David Irwin and Ms. Amy Elmore for their help in editing this manuscript. In addition, we appreciate the discussions we have had regarding the statistical measures with Dr. Charles DiMaggio of Columbia University. Finally, we would like to thank the SPARS staff for their excellent work in implementing these reviews for over a decade.