The authors have declared that no competing interests exist.
Constructed the algorithmic approach and performed the secondary structure predictions: RB SJL. Conceived and designed the experiments: WRH RB IS. Performed the experiments: IS SH. Analyzed the data: SJL RB WRH. Wrote the paper: SJL WRH RB SH IS.
The CRISPR-Cas (Clustered Regularly Interspaced Short Palindrome Repeats – CRISPR associated proteins) system provides adaptive immunity in archaea and bacteria. A hallmark of CRISPR-Cas is the involvement of short crRNAs that guide associated proteins in the destruction of invading DNA or RNA. We present three fundamentally distinct processing pathways in the cyanobacterium
The RNA-based prokaryotic defense mechanism involves (i) an array of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR), made up of a leader, frequently palindromic repeated sequences with unique spacers located in-between, and (ii) a defining set of CRISPR-associated (Cas) proteins (see general reviews
Currently, at least 45 families of Cas proteins have been identified
Endoribonucleases from various CRISPR-Cas systems (types I and III) are key players in crRNA maturation and are known to cleave repeats at distinct sequence and structure motifs leaving an 8 nt 5′ repeat handle (5′ tag): Cas6
CRISPR-Cas systems are not only highly diverse across species, but a single organism, such as the model cyanobacterium
Vital to a successful CRISPR-based defense is the expression and accurate processing of mature crRNAs; by analyzing different aspects of the expression and maturation of crRNA, we determined that these three CRISPR loci in
The plasmid pSYSA of
Several
CRISPR | Subtype |
repeat sequence | TSS | Leaderlength (nt) | No. of spacers | Spacer length (nt) | Identical spacer-repeat units |
1 | I-D |
|
16097 | 213 | 49 | 31–41 | 10–11, 30/31–32/33, 40–41 |
2 | III | 68374 | 125 | 56 | 34–46 | 6/7–8/9, 37/38–39/40 | |
3 | III | 90104 | 1 | 38 | 35–47 | none |
For each CRISPR, the repeat sequence and single occurring mutant variants near the 3′ end are given in brackets (the widely conserved
Three potential Cas6 endoribonuclease genes are located on pSYSA:
According to the previously published sequence
A transcriptome analysis revealed an extremely high level of CRISPR-derived RNA transcripts, especially in comparison to other loci on the pSYSA plasmid (
(A) depicts the read coverage in log-scale (grey track) across the entire plasmid and the locations of the CRISPRs 1 to 3 are annotated as blue bars. All three CRISPRs are the most abundantly expressed loci on the plasmid. (B–D) show the expression profiles for CRISPRs 1 to 3, in that order. The reads have been filtered to reduce noise and the grey tracks in (B–D) depict their coverage profiles in log-scale. The numbers in the square brackets represent the absolute read number range; CRISPR3 is clearly the most abundantly expressed in comparison to the other two. The repeats are marked below by blue squares with their occurrence number. Due to the consecutive duplications of repeat-spacer units in CRISPR1 and CRISPR2 (
In
To define exact transcript boundaries of the CRISPR arrays, we mapped the TSS from the data in Mitschke
In agreement with their characterization as distinct types of CRISPR-Cas systems, processing intermediates and mature crRNAs of different characteristic lengths were observed (
The number of reads (y-axis) starting (red) or ending (black) at a position relative to the closest repeat (x-axis) across an entire CRISPR locus illustrates the CRISPR maturation products (for RNA-seq dataset A). The repeat sequence is indicated in the pink+red, the 5′ crRNA tag in the red, and the relative position in the spacer in the yellow rectangles, respectively (x-axis). One repeat-spacer unit is framed by the thick cyan square (due to different spacer lengths, the mode is illustrated). The green arrows correspond to the most abundant reads, i.e. the processed mature crRNAs or intermediate products. Albeit spacers of different lengths, we clearly see the ruler mechanism as the mature crRNA is trimmed to a fixed. We identified the location of the accumulating reads by giving the percentage of reads in the respective read-length category that map to the illustrated location (square brackets). For CRISPR3, the first cleavage site is in the spacer (not in the repeat), supported by two observations (i) reads only end at the cleavage site in the spacer, not in the repeat, (ii) there is no accumulating RNA species that spans across the cleavage site in the spacer, whereas the 72 nt intermediate spans across the 13 nt cleavage site. CRISPR1 and CRISPR2 display only single cleavage sites and crRNAs are subsequently trimmed to their final length. CRISPR1 and CRISPR3 both have a second, less abundant mature crRNA transcript, which is exactly 6 nt shorter, whereas CRISPR2 only has one accumulating product. Note: Reads that appear 1–3 nt shorter are due to unknown read ends because of the poly(A) tails in the RNA-seq protocol.
(A) Read frequencies (y-axis) for each CRISPR loci, computed from RNA-seq dataset A. Read lengths are given on the x-axis, whereby it is important to note that the poly(A) tails of the RNA-seq protocol obscure read ends such that lengths of reads ending in A’s cannot be determined exactly. (B) Northern hybridization using a synthetic oligonucleotide probe against spacer 1 of CRISPR1 (C1S1), spacer 6 of CRISPR2 (C2S6) and spacer 2 of CRISPR3 (C3S2) to identify bands of accumulating transcript species. (C) Specific read frequencies for spacer 1 of CRISPR1 (C1S1), spacer 6 of CRISPR2 (C2S6) and spacer 2 of CRISPR3 (C3S2), for which the hybridization pattern is depicted in panel (B). The atypical accumulation of a transcript of 25 nt for spacer 6 of CRISPR2 is highlighted by the arrow. Mature crRNAs are indicated by asterisks (all panels).
To further characterize the accumulating transcripts for each CRISPR locus, we calculated (from the filtered sets
(A) Knock-out mutants of genes
The effect of its knock-out mutation (Δ) on the accumulation of CRISPR3-derived crRNAs, compared to wildtype (WT), is shown, together with the complementation by expressing a FLAG-tagged version of Cmr2 from a replicating plasmid
Despite the varying lengths of the spacers, all crRNAs accumulated to these fixed characteristic lengths, which further supports the ruler mechanism published for the Csm and Cmr systems
In summary, while the described processing patterns shared previously published common features, detailed evidence suggests distinctly different pathways.
An important protein involved in processing CRISPR precursor transcripts is the Cas6 endoribonuclease, as demonstrated in
In contrast to CRISPR1 and CRISPR2, the accumulation and processing of CRISPR3-derived crRNAs was not affected by the knock-out mutation of either
We observed vast differences in the processed crRNA abundancies across the CRISPR arrays (note that the log-scale reduces the visible differences in
The degradation of mature crRNAs correlates with spacer ensemble energies with a Pearson’s correlation coefficient r = 0.56 and p = 0.00025 (RNA-seq dataset B). Depicted is the CRISPR3 locus on the chromosome of
The general practice in the search for the functional CRISPR repeat structure is to compute the minimum free energy (MFE) structure of a single repeat sequence. The repeat is not transcribed as a single unit, however, but is located on a transcript in the context of other spacers and repeats. These flanking sequences can have a vast impact on the actual structure so that sub-optimal repeat structures could be preferred over the MFE structure. Although the MFE prediction is frequently correct due to highly stable stem-loop structures with many GC base-pairs, for example in
(A) The two most stable structure candidates; the MFE structure is in magenta. (B) The base-pair probability matrix, as computed by RNAfold
The black wedges indicate cleavage sites derived from the RNA-seq data and the yellow circles mark the 5′ repeat sequence tag of the mature crRNAs. The 5′ tags for CRISPR1 and CRISPR2 had the frequently published length of 8 nt. CRISPR3 was cleaved twice, first at the end of the spacer and second in the middle of the repeat leaving a novel-length 13 nt tag.
Our combined experimental and computational results describe three CRISPR-Cas systems on plasmid pSYSA, each with an independent and unique set of associated proteins and a distinct processing pathway. Recently, it was shown that diverse defense systems are frequently clustered in prokaryotic genomes
The maturation of crRNAs and precursor processing is essential to the function of the CRISPR-cas system
The three CRISPR-Cas systems were named CRISPR1-3 and are associated with distinct sets of associated Cas proteins, classified as a subtype I-D for CRISPR I and type III for CRISPR2 and CRISPR2; the latter two could not be classified into specific subtypes
High-throughput transcriptomics and molecular assays illustrated that transcripts from all CRISPR arrays were highly abundant, especially in comparison to other loci on the pSYSA plasmid. Mapping of transcription start sites gave rise to transcribed leaders for CRISPR1 and CRISPR2, but the TSS for CRISPR3 was only one nucleotide upstream of the first repeat. It is unknown whether this lack of a leader could affect new spacer acquisition; however, the array was evidently processed.
A more detailed analysis determined the length and locations of accumulating transcripts, identifying possible mature crRNA sequences, which were disproportionately abundant and thus clearly visible: 39 and 45 nt for CRISPR1, 37 nt for CRISPR2, and 42 and 48 nt for CRISPR3. In agreement to our results, the accumulation of two distinct crRNA species with 6 nt difference and their incorporation into the protein complex was shown previously, where the longer species was also the more dominant
In addition, the most frequent 5′ and 3′ read end mapping locations gave a detailed insight into cleavage sites and processing patterns and especially highlighted the fact that the crRNAs from each locus must have been generated by distinct pathways. CRISPR1 had many accumulating transcript species all shorter than one repeat-spacer unit indicating a possible step-wise trimming mechanism from the 3′ end (arising from the cleavage site in the downstream repeat), whereas CRISPR2 and CRISPR3 crRNA maturation seemed to be independent of a downstream repeat. CRISPR3 showed a double-cleavage mechanism where the first cleavage occurred in the spacer (or at the 5′ end of the repeat); the second cleavage in the repeat generated a crRNA 5′ tag of an unusual 13 nt. Whereas, CRISPR1 and CRISPR2 displayed single repeat cleavages generating the usual 8 nt tag
In spite of transcripts arising from single TSS, mature crRNAs accumulated to significantly different abundancies implying differences in their stabilities. Our computational analysis of CRISPR3 transcript accumulation indicated that spacers forming more stable structures are linked to higher degradation rates of the crRNA sequence. A similar observation has recently been reported for the crRNAs derived from CRISPR locus C in
Some Cas endoribonuclease proteins are known to bind to a hairpin motif in the repeat
Recently, it was demonstrated that a hairpin structure is important for Cas6-dependent processing in the type III CRISPR/Cas system of
In past work, the Cas6 endoribonuclease has been identified as the main player in the CRISPR RNA processing pathways in different organisms
Despite the fact that CRISPR3-derived RNA accumulated to high quantities and was evidently processed, its maturation was Cas6 independent: None of the three identified Cas6 homologs had an effect on CRISPR3 transcript accumulation. Given that Cas6 sequences are highly diverse, and are sometimes found as single genes detached from other Cas or Cmr gene cassettes, we searched for additional, possibly host-encoded,
To address the possibility that CRISPR3 is not functional, we searched among related cyanobacteria for a strain with a system closely related to CRISPR3. We found such a system in
Consequently, we searched for further factors affecting CRISPR3 crRNA accumulation by
For standard experiments, liquid cultures of
For each transformation 10 ml of
The triparental mating was used to conjugate
To analyse gene functions, selected
100 ml of
8 µg of total RNA per lane were separated on 10% polyacrylamide-urea gels and electroblotted on Hybond-N+ membranes from Amersham. Membranes were prehybridized for at least 30 min at 45°C with hybridization buffer (50% deionized formamide, 7% SDS, 250 mM NaCl, 120 mM NaPi buffer, pH 7.2) in glass tubes under continuous rotation. For northern hybridization, synthetic oligonucleotide probes (
The cDNA libraries for both datasets were prepared by vertis Biotechnologie, Germany (
For the RNA-seq analysis in dataset A, equal amounts of total RNA from cultures subjected to 10 different conditions (see ‘Culture media and growth conditions’) were mixed and rRNA was depleted using the using the MICROBExpress kit (Ambion). The RNA sample was fragmented with ultrasound (4 pulses of 30 s at 4°C) and then treated with phosphatase. Afterwards, the RNA fragments were re-phosphorylated using T4 polynucleotide kinase and then 3′ poly(A)-tailed using poly(A) polymerase, which was followed by ligation with an RNA adapter to the 5′-phosphate of the RNA fragments. First-strand cDNA synthesis was performed using an oligo(dT)-adapter primer and M-MLV reverse transcriptase. The resulting cDNA was PCR-amplified by 11 cycles to about 20–30 ng/µl using a high fidelity DNA polymerase and primers designed for TruSeq sequencing according to the instructions of Illumina. The cDNA was purified using the Agencourt AMPure XP kit (Beckman Coulter Genomics), analyzed by capillary electrophoresis and size-fractionated for the fraction <450 bp by elution from agarose gels. The cDNA pool was sequenced on an Illumina HiSeq 2000 machine yielding 33,357,164 reads of length 100 nt.
For the RNA-seq data of dataset B, the preparation and analysis of cDNA libraries on a Roche FLX (454) sequencer was previously described as (−) population
Mapping dataset A: Using the FASTQC analysis tool, we observed an increasingly poor sequencing quality towards read ends in this dataset, possibly due to the poly(A) tails and subsequent adapter sequences (see
Mapping of the smaller Dataset B was described in reference
To gain a more accurate picture of the CRISPR array expression, we filtered the original reads to reduce noise. The bulk of noise arises from short sequence reads that cover only the repeat regions and are therefore incorrectly mapped to all repeat instances, obscuring the coverage profiles. Thus we selected reads that mapped with a read quality of 1, an edit distance of 2, were located on the forward strand, and had a unique match. Due to the duplications in CRISPR1 and 2, we also allowed reads for these loci that mapped to two locations. This filtering delivered a cleaned up picture, but did not considerably change the original coverage profiles. These filtered reads are depicted in
Let
The RNA structure ensemble energies of each spacer were calculated by RNAfold
To explore the RNA-seq results and display structural properties we used the Integrative Genomics Viewer (IGV) version 2.0.3
All sequence analyses were done using the publicly available sequence in RefSeq (
We followed the procedure described below to produce more accurate structure predictions of repeats that also includes the context sequence of the array.
The most probable repeat structure candidates were determined using RNAfold
To determine the influence of the context sequence on each repeat sequence location, we predicted the structure of the entire CRISPR array. Due to long sequence lengths of CRISPR arrays and unknown contexts due to the intermediate processing steps, we used the local folding approach RNAplfold
Subsequently, the sub-matrices for each repeat instance were averaged to form an average dotplot for the repeat structure (see
The candidate from (1) with the highest structure accuracy in the average dotplot from step (3) (see
Dotplots are read as matrices. Each cell in the top triangle represents a base-pair probability for base i and base j in the bordering sequence. The dimension of each dot is given by the square root of its respective base-pair probability. The bottom triangle represents the base-pairs of the minimum free energy structure, where the dimensions are equal to 1. The average dotplot differs only in the fact that the dots in the upper triangle represent average base-pair probabilities for all sequence occurrences.
(EPS)
(DOCX)
(DOCX)
The utilised software FASTQC and FASTX are unpublished and have been developed by the Bioinformatics Group at the Babraham Institute in Cambridge, UK, and the Computational Biology Service Unit from Cornell University in Ithaca, New York, USA, which is partially funded by Microsoft Corporation, respectively. We would also like to thank Elena Conti and Christian Benda for helpful dicussion on possible Cmr2 functions, and Daniel Maticzka, Steve Hoffmann, and Christian Otto for sharing their expertise in the mapping of large-scale sequencing reads.