What's SEVENS


SEVENS is a database including 7-TMR (7 transmembrane helix receptor) candidates predicted from various kinds of genome sequences, using sequence search, motif and domain assignment, transmembrane helix prediction and the gene quality refinement. This system is intended to detect sequences of multi-exon or remote homologues that can not be detected by using conventional sequence search tools alone.

Contents

  • Content Search
  • Retrieve 7-TMR candidate sequences by the "AND" combination of (1) Keyword in nr.aa database search results, (2)Chromosome number, (3)Data level, (4)Predicted exon number, (5) Gene length, (6)Protein length, (7)E-value of sequence search against Swiss-Prot, nr.aa, or UniGene, (8) PROSITE motifs, (9) Pfam domains. (10) Novel or not. and (11) Family.

    After selection with some of contents, 7-TMR candidates will be appear at the chromosomal viewer and the gene lists which navigate to the detailed analysis for each gene.


    #Notice: Except for human genome, the genomic positions were obtained from annotation of NCBI genome resources which have not include exon intron positions. Therefore, you can not search the predicted exon number and "UniGene" sequences using E-value.


  • News
  • Release Information and news concerning updates of analysis.


  • What's SEVENS
  • Introduction for SEVENS database and it's usage.


  • Statistics
  • Release information of data statistics.


    How we found 7-TMR sequences.

    Condidate 7-TMR genes were collected 9 eukaryotes genomes by using the automated gene discovery system. This system has two stages: (1) the gene prediction stage, (2) the gene screening stage. In the case of human, the amino acid sequences of the genes were discovered at stage (1) from the human genome sequence at NCBI (#Build 34), and subjected to stage (2). As regards the gene sequences of the other species, they were downloaded from the "Genomic Biology" category of the NCBI WEB site (http://www.ncbi.nlm.nih.gov/) and subjected to stage (2).

    1)Gene prediction stage  (i.e., translation of genomic sequences into amino acid sequences).
    2)Gene screening stage  of 7-TMR candidates by assessing genes with sequence search, motif- and domain assignment, and transmembrane helix (TMH) prediction.

    (1) Gene prediction stage:

    Genomic sequences were obtained from human sequences (Human Genome Resources of the NCBI). To maximize the number of gene candidates, we detected three kinds of sequence sets,

    (a)"6f-sequences" which were all possible combination between initial and stop codons in 6 reading frames with the rule of using the most upstream ATG possible.
    (b)"ALN-sequences" which were aligned with known protein sequences by ALN.
    (c)"GD-sequences" which were generated by GeneDecoder.

    Candidate sequences selected by the above process still contain the following redundancies. (1) Perfect matches or overlaps at the same genomic position (chromosome number, relative position on the genome). They originate in two independent sequence predictions: the 6-frame translation and the prediction by GeneDecoder. We regarded them as the same gene and adjusted the double count accordingly. (2) Multiple sequence copies in different genomic positions. We regarded them as different genes. (3) Separate sequence fragments linked by a known protein sequence. They originate in an erroneous prediction by the gene finding programs. We merged them using the linker sequence.
    These redundancies were detected by the following clustering method for each level. First, Swith-Waterman sequence alignment was applied to the candidate sequences in an all-against-all fashion. Then sequences were linked together only when they hit for more than 50 amino acids with more than 95% identity, and shared the same chromosome number and overlapping genomic position. If chromosome numbers were unknown for (either/both) sequences, they were linked with more than 99% identity. After computing transitive closures of the links, each of the known human 7-TMR sequences from the Swiss-Prot was aligned against all the candidate sequences. All clusters that hit for more than 50 amino acids with more than 99% identity were merged. Finally, in each cluster, the longest sequence was selected as the representative.


  • ALN
  • Using a new convention for encoding a DNA sequence into a series of 23 possible letters, a dynamic programming algorithm ('aln' written in ANSI-C) was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frame shift errors, coding potentials, and translation initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient was about 95% at the nucleotide level for the 288 genes tested, and 97.0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. (Gotoh, O., Bioinformatics.2000 Mar;16(3):190-202.).


  • GeneDecoder
  • Gene-finding system based on the Hidden Markov Model "HMM" (Asai, K., Itou, K., Ueno, Y. & Yada, T., Pacific Symposium on Biocomputing 98, pp. 228-239 (PSB98, 1998)). This system allows multiple inputs: not only sequence information, but homology scores and other data may be integrated for prediction. The prediction accuracy was evaluated with "genesets98"(http://bioinformatics.weizmann.ac.Il/databases/genesets/Human/). The sensitivity was 83% and the specificity was 74% for the detection of gene position without using homology scores.


    (2) Gene screening stage:

    Each analysis tool was first assessed to determine two threshold settings, best specificity and best sensitivity, with a reference dataset: 7-TMR sequences and non-7-TMR sequences in the Swiss-Prot database. The best specificity threshold is intended to achieve, when applied to the reference dataset, almost 100% specificity and with minimum false-negatives. On the other hand, the best sensitivity threshold is intended to achieve almost 100% sensitivity and with minimum false-positives.
    Using the thresholds shown in Table 1, those 7-TMR candidates were selected that showed significant sequence similarity or contained characteristic motifs and domains, and transmembrane helices.
    Especially for human genome, four confidence levels of the datasets were determined by combining the best specificity and best sensitivity thresholds. Level A data, expected to show the best specificity, were obtained by adding the candidate sequences given by best specificity thresholds of the sequence similarity search, motif- and domain assignments. To discover remote 7-TMR homologues, we combined candidates from the three-level thresholds for TMH prediction (see Table 1) with the sequences that were obtained by the best sensitivity thresholds of sequence search and motif- and domain assignment, and level D data are expected to show the best sensitivity.


    Table 1. Thresholds used for 7-TMR discovery.
      Level A
    (Best specificity)
    Level B Level C Level D
    (Best sensitivity)
    Sequence search
    with BLASTP
    E < 10-80 E < 10-30 E < 10-30 E < 10-30
    Domain assignment
    with Pfam
    E < 10-10 E < 1.0 E < 1.0 E < 1.0
    Motif assignment
    with PROSITE
    Not used Match Match Match
    TMH Prediction Not used TMwindows(7)
    AND
    Hirokawa(7)
    TMwindows(7)
    AND
    Hirokawa(6-8)
    TMwindows(7)
    OR
    Hirokawa(7)
    Sensitivity 99.4% 99.8% 99.9% 99.9%
    Specificity 96.6% 70.0% 48.4% 20.0%

    Thresholds of the programs are shown.

    Using BLASTP (Altschul, S. F., et al Nucleic Acids Res.25,3389-3402 (1997)) known 7-TMR seguences were searched against the reference dataset, and the sensitivity and specificity of E values were computed for discriminating correct pairs.
    Using HMMER (Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K, L. & Sonnhammer, E. L. Nucleic Acids Res.28,263-266 (2000).), 7-TMR specific Hidden Markov Models ( Pfam domain ) were assigned to reference sequences, and the sensitivity and specificity of E values were computed for correct assignment.
    Since PROSITE patterns are written by regular expression, we determined the P value, which is calculated as the multiplication of each residue frequency in the Swiss-Prot database; the sensitivity and specificity of P values were computed for correct assignment.
    For TMH prediction we used the TMwindows program, our original program along with the method of Hirokawa, et al . We treated the results as 7-TMR outputs when the predicted helix number was dispersed between n and m. Here we used n-m ranges 7-7, 6 -8, 5-9, and 4-10 and combined the sequences obtained from each range of the two programs. For example, the descriptor {TMwindows(7) OR Hirokawa(6-8) } unifies ("OR"), the sequences within range 7-7 that were obtained by TMwindows and the sequences within range 6-8 that were obtained by Hirokawa`s method.


  • Hirokawa Method

  • A useful tool for secondary structure prediction of membrane proteins from a protein sequence. The basic idea of prediction in this system is based on the physicochemical properties of amino acid sequences such as hydrophobicity and charges. The system deals with three types of prediction: discrimination of membrane proteins from soluble ones, prediction of the existence of transmembrane helices and determination of transmembrane helical regions.
    (Hirokawa, T., Boon-Chieng, S. & Mitaku, S. Bioinformatics.14,378-379 (1998).)


  • TMwindows

  • Predicts transmembrane helices by the following procedures.
    (1) It assigns the Engelman-Steitz-Goldman (Annual Review of Biophysics and Biophysical Chemistry.15,321-353 (1986).) hydropathy index to amino acid sequences and calculates average hydrophobicity within a pre-determined window. The index was selected, after comparing all indices in the AAindex database (Protein Eng. 9, 27-36 (1996). as the most powerful for discriminating membrane proteins from others using total average hydrophobicity.
    (2) The window size is changed from 19 to 27 and if the average hydrophobicity within each window exceeds 2.5, the region is regarded as a transmembrane helix. The total number of helices computed for each window size gives the range of predicted helix number.


    (3) Databases used for analysis:

  • Human Genome Resources NCBI build#34.3
  • Drosophila melanogaster Genome Resources NCBI release at May 2004
  • Caenorhabditis elegans Genome Resources NCBI release at May 2004
  • Plasmodium falciparum Genome Resources NCBI release at May 2004
  • Encephalitozoon cuniculi Genome Resources NCBI release at May 2004
  • Arabidopsis thaliana Genome Resources NCBI release at May 2004
  • Oryza sativa Genome Resources TIGR release at May 2004
  • Saccharomyces cerevisiae Genome Resources NCBI release at May 2004
  • Schizosaccharomyces pombe Genome Resources NCBI release at May 2004
  • Swiss-Prot ver. 43
  • PROSITE release 18.28
  • Pfam release 13.0
  • GPCRDB release 8.0
  • nr.aa release at Jun. 23 2004
  • UniGene build#70

  • Comments or questions to m-suwa@aist.go.jp
    Recent Revise on 2005/05/17.