CAR Help Page

CAR (short for "Contig Assembly using Rearrangements") is an efficient and more accurate tool for assembling contigs of a prokaryotic draft genome based on a reference genome. Given a set of contigs in multi-FASTA format and a reference genome in FASTA format, CAR can output a list of scaffolds that are groups of contigs whose relative placement and orientation are determined.

1. Input of CAR

CAR provides a user interface (refer to Figure 1) that is intuitive and easy to operate. It takes as input a set of contigs of a draft chromosome in multi-FASTA format and a reference chromosome in FASTA format. For convenience, the user can choose one of the examples (1) we prepared in advance for running CAR, or submit a job according to the procedures described as follows.

  1. Upload the file containing the contig nucleotide sequences of a draft chromosome in multi-FASTA format (2).
  2. Upload the file containing the nucleotide sequence of a reference chromosome in FASTA format (3), where the reference chromosome should be a completely assembled (finished) chromosome.
  3. Choose the data type, which can be either "nucleotides" or "translated amino acids" (4), to identify conserved genetic markers between the draft and reference chromosomes. If the chosen data type is "nucleotides", NUCmer is performed on the input nucleotide sequences to identify conserved genetic markers between the draft and reference chromosomes, while if the chosen data type is "translated amino acids", PROmer is performed on the six frame amino acid translation of the input nucleotide sequences to identify conserved genetic markers between the draft and reference chromosomes. The programs NUCmer and PROmer are from the MUMmer 3.0 system.
  4. Check email box and simultaneously enter an email address (5), if the user would like to run CAR in a batch way. In this way, the user will be notified of the output via email when the submitted job is finished. Note that this step is optional, but we recommend the user to run CAR in this batch way if the uploaded draft and reference chromosomes are large because the computation of identifying their conserved genetic markers is a little time-consuming.
  5. Click "Run CAR" button (6) to run CAR, or click "Reset" button (7) to reset all the above settings.


Figure 1: The user interface of CAR.

2. Output of CAR

In the output page (see Figure 2 for an example), CAR first shows the input data and user-specified parameter (8), including the nucleotide sequences of input draft and reference chromosomes, a dot plot graph between them before performing contig assembly (9), and the user-specified parameter of identifying conserved genetic markers. In the dot-plot graph (see Figure 3 for an example), the contigs of draft chromosome are plotted on the y-axis, while the sequence of reference chromosome is plotted on the x-axis. Moreover, the forward matches are displayed in red, while the reverse matches are displayed in blue. Next, CAR shows a contig assembly result of draft chromosome based on reference chromosome (10), including total running time (11), a set of scaffolds (12) (click here for an example) and its corresponding multi-FASTA file (13), a dot plot graph between assembled draft chromosome and reference chromosome (14) (see Figure 4 for an example), and a comparison of dot-plot graphs between before contig assembly and after contig assembly (15).


Figure 2: A display of output page of CAR when running with a draft chromosome with nucleotide sequences.


Figure 3: The dot plot of draft and reference chromosomes before contig assembly.


Figure 4: The dot plot of assembled draft and reference chromosomes after contig assembly.

3. CAR Results of Examples

In the following table, we provide the results of CAR when running on some prokaryotic draft chromosomes. CAR can finish its assembly job in several seconds or a couple of minutes.

Table 1: Results of our CAR when running on some prokaryotic draft chromosomes.
CAR result of example 1 (15.14 sec) CAR result of example 2 (16.91 sec)
CAR result of example 3 (14.96 sec) CAR result of example 4 (16.28 sec)
CAR result of example 5 (30.94 sec) CAR result of example 6 (93.55 sec)

4. Genomes Used in the Testing Dataset

To validate CAR, we used a real dataset (click here for download) composed of several prokaryotic genomes to test it and compared its accuracy performance to eight other reference-based tools of contig assembly, namely Projector2 (van Hijum et al., 2005), OSLay (Richer et al., 2007), ABACAS (Assefa et al., 2009), Mauve Aligner (Rissman et al., 2009), fillScaffolds (Munozet al., 2010), r2cat (Husemann et al., 2010), CONTIGuator (Galardini et al., 2011) and SIS (Dias et al., 2012). This real dataset was used in the study of SIS by Dias et al., 2012, which contains 19 draft genomes of phylogenetically diverse prokaryotes that can be downloaded from the GenBank of NCBI. Among these 19 prokaryotic genomes, four of them have two chromosomes, while the others have only one, thus creating a total of 23 chromosomes in this dataset. Each of these 23 chromosomes was processed separately by each contig assembly tool. For the draft of each query chromosome, we used other 20 closest genomes (excluding the query chromosomal genome itself), which were also selected by Dias et al. (2012) from complete prokaryotic genomes at the GenBank of NCBI according to their phylogenetic distances from the query chromosomal genome, to serve as different reference genomes. The purpose for choosing 20 closest other genomes instead of only the closest is to understand how the accuracy performance of a contig assembly tool changes with increasing evolutionary distances between query and possible reference genomes.

5. Genome Rearrangements

In the studies of genome rearrangements (Fertin et al., 2009), a genetic marker (i.e., DNA sequence, gene or syntenic segment) usually is represented by a signed integer, in which the sign indicates the transcriptional direction (or strandedness) of the corresponding genetic marker.

5.1. Reversals

Reversals, also called inversions, affect a block of consecutive genetic markers on a chromosome by reversing the order and flipping the signs of the corresponding integers. For example, the reversal depicted in Figure 5 rearranges a chromosomal segment of three genetic markers (1, -3) into (3, -1).


Figure 5: A reversal rearranges the chromosomal segment (1, -3) into (3, -1).

5.2. Transpositions

Transpositions affect a chromosome by moving a block of its consecutive genetic markers to another location, as illustrated in Figure 6 for an example, or equivalently by exchanging two adjacent and non-overlapping blocks of consecutive genetic markers on the chromosome.


Figure 6: A transposition moves the block of a single genetic marker 1 into a new location between genetic markers -3 and 2 or, equivalently, exchanges two adjacent and non-overlapping block (1) and (-3). This is a special case of block-interchange.

5.3. Block-Interchanges (Generalized Transpositions)

Block-interchanges are a kind of generalized transpositions that exchange two non-overlapping but not necessarily adjacent blocks of consecutive genetic markers on a chromosome. Therefore, the rearrangement exemplified in Figure 6 are both a transposition and a block-interchange, but the one shown in Figure 7 is just a block-interchange, but not a transposition.


Figure 7: A block-interchange (also called generalized transposition) exchanges two non-adjacent blocks (-4) and (2). Notice that this is not a transposition.

6. FASTA and Multi-FASTA Formats of Sequences

A sequence in FASTA format starts with a single-line description, followed by lines of sequence data. The description line starts with a right angle bracket (">") and is usually followed by the sequence identifiers and description. An example of a FASTA-format sequence is given as follows.

>sequence
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACC
CTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTG
A multi-FASTA format consists of multiple FASTA-format sequences. The following is an example.
>sequence1
CAAAACGAAATCGAGCGCCATCTGCGGGCCCCGATTACCGACGCTCTCAGCCGCCGACTCGGACATCAGA
>sequence2
TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCGACGACACTACCGTGCCGCCTTCCGA
>sequence3
AAATCCTGCTACCACATCGCCAGACACCACAACCGACAACGACGAGATTGATGACAGCGCTGCGGCACGG

7. Contact Information

8. References

  1. S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman. (2009) ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968-1969.
  2. Z. Dias, U. Dias and J.C. Setubal. (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
  3. G. Fertin, A. Labarre, I. Rusu, E. Tannier and S. Vialette. (2009) Combinatorics of Genome Rearrangements, The MIT Press, Cambridge.
  4. M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni. (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
  5. P. Husemann and J. Stoye. (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570-571.
  6. C.-L. Li, K.-T. Chen and C.L. Lu. (2013) Assembling Contigs in Draft Genomes Using Reversals and Block-Interchanges. In Tang, H. and Jiang, T. (eds.), Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-SEQ) Beijing, China.
  7. C.L. Li, K.T. Chen and C.L. Lu. (2013) Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics, 14 Suppl 5, S9.
  8. A. Munoz, C.F. Zheng, Q.A. Zhu, V.A. Albert, S. Rounsley and D. Sankoff. (2010) Scaffold filling, contig fusion and comparative gene order inference. BMC Bioinformatics, 11, 304.
  9. D.C. Richter, S.C. Schuster and D.H. Huson. (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573-1579.
  10. A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna. (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071-2073.
  11. S.A.F.T. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok. (2005) Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, W560-W566.