

More formally, we determine for every position i in S the length x of the substring S such that it is unique while S is not. In contrast, we define local shortest unique substrings to be tied to a specific position in S. Such global shortest unique substrings can occur anywhere in S. It contains substrings, of which the following eight are unique. Consider for example the sequence S = ACCG. Our method of alignment-free sequence comparison is based on the idea of "shortest unique substrings", that is, the shortest substrings of a sequence which are not found elsewhere. Since the computation of alignments tends to take time proportional to the product of the lengths of the sampled sequences, elimination of this step often leads to dramatic increases in the speed of sequence analysis algorithms. Perhaps surprisingly, the applications of alignments just mentioned – signature oligos and detection of unique genomic regions – do not necessarily involve an alignment step. Once a sequence alignment has been computed, it can be used to determine, for example, signature oligonucleotides or unique genomic regions among a group of closely related organisms. The alignment procedure ensures that only homologous positions are compared and corresponding algorithms form the classical core of bioinformatics. Sequence comparison is traditionally carried out using alignments. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. Such substrings can be detected using generalized suffix trees. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Sequence comparison by alignment is a fundamental tool of molecular biology.
