Welcome to Development and Maintenance of Rice Knowledge Management System
Home  | Sitemap  | Contact

Skip Navigation Links
HOME
About RKMP
Expected SystemExpand Expected System
DiseasesExpand Diseases
Market
Auth Login
Photo Gallery
Video Gallery
Organic RiceExpand Organic Rice
Rice BioinformaticsExpand Rice Bioinformatics
RKMP TeamExpand RKMP Team
Media
Contact Us
Skip Navigation Links
Home
Why RKMP
IARI Role in RKMPExpand IARI Role in RKMP
Rice ProfileExpand Rice Profile
DatabaseExpand Database
Rice Statistics
Expected Outcome
FAQExpand FAQ
Feed Back
Contact Us
IARI - RKMP Associated Partners
 

Bioinformatics Key Words
 
 

  • General Definition

    Chromosome

    The structure in the cell nucleus that contains all of the cellular DNA together with a number of proteins that compact and package the DNA.


    Gene

    Genes are working subunits of DNA. DNA is a vast chemical information database that carries the complete set of instructions for making all the proteins a cell will ever need. Each gene contains a particular set of instructions, usually coding for a particular protein.


    Central Dogma

    The original postulate that genetic information can be transferred only from nucleic acid to nucleic acid and from nucleic acid to protein, that is from DNA to DNA from DNA to RNA and from RNA to protein (although information transfer from RNA to DNA was not excluded and is now known to occur [reverse transcription]). Transfer of genetic information from protein to nucleic acid never occurs.


    DNA

    The genetic material of most living organisms, which is a major constituent of the chromosomes within the cell nucleus and plays a central role in the determination of hereditary characteristics by controlling protein synthesis in cells. DNA is a nucleic acid composed of two chains of nucleotides in which the sugar is deoxyribose and the bases are adenine, cytosine, guanine, and thymine (compare RNA). The two chains are wound round each other and linked together by hydrogen bonds between specific complementary bases to form a spiral ladder-shaped molecule.


    DNA Replication

    DNA replication, the basis for biological inheritance, is a fundamental process occurring in all living organisms to copy their DNA. This process is "replication" in that each strand of the original double-stranded DNA molecule serves as template for the reproduction of the complementary strand. Hence, following DNA replication, two identical DNA molecules have been produced from a single double-stranded DNA molecule. Cellular proofreading and error toe-checking mechanisms ensure near perfect fidelity for DNA replication.


    Transcription

    Transcription, or RNA synthesis, is the process of creating an equivalent RNA copy of a sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as a complementary language that can be converted back and forth from DNA to RNA in the presence of the correct enzymes. During transcription, a DNA sequence is read by RNA polymerase, which produces a complementary, antiparallel RNA strand. As opposed to DNA replication, transcription results in an RNA complement that includes uracil (U) in all instances where thymine (T) would have occurred in a DNA complement.


    RNA

    Ribonucleic acid (RNA) is a biologically important type of molecule that consists of a long chain of nucleotide units. Each nucleotide consists of a nitrogenous base, a ribose sugar, and a phosphate. RNA is very similar to DNA, but differs in a few important structural details: in the cell, RNA is usually single-stranded, while DNA is usually double-stranded; RNA nucleotides contain ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen atom); and RNA has the base uracil rather than thymine that is present in DNA.
    They play crucial roles in protein synthesis and other cell activities:

    Messenger RNA (mRNA) is a type of RNA that reflects the exact nucleoside sequence of the genetically active DNA. mRNA carries the "message" of the DNA to the cytoplasm of cells where protein is made in amino acid sequences specified by the mRNA.

    Transfer RNA (tRNA) is a short-chain type of RNA present in cells. There are 20 varieties of tRNA. Each variety combines with a specific amino acid and carries it along (transfers it), leading to the formation of protein with a specific amino acid arrangement dictated by DNA.

    Ribosomal RNA (rRNA) is a component of ribosomes. Ribosomal RNA functions as a nonspecific site for making polypeptides.



    cDNA or complementary DNA

    DNA that is synthesized in the laboratory from a messenger RNA template.

    Translation

    A step in protein biosynthesis wherein the genetic code carried by mRNA is decoded to produce the specific sequence of amino acids in a polypeptide chain.


    Proteins

    Proteins (also known as polypeptides) are organic compounds made of amino acids arranged in a linear chain and folded into a globular form. The amino acids in a polymer are joined together by the peptide bonds between the carboxyl and amino groups of adjacent amino acid residues.


    Proteome

    The Proteome is the protein complement expressed by a genome. While the genome is static, the proteome continually changes in response to external and internal events.

    CDS

    The coding sequence or the portion of a nucleotide sequence that makes up the triplet codons that actually code for amino acids.

    Genetic Code

    The instructions in a gene that tell the cell how to make a specific protein. A, T, G, and C are the "letters" of the DNA code. They stand for the chemicals adenine, thymine, guanine, and cytosine, respectively, that make up the nucleotide bases of DNA. Each gene's code combines the four chemicals in various ways to spell out 3-letter "words" that specify which amino acid is needed at every step in making a protein.


    Expressed sequence tag or EST

    A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. Used in locating and mapping genes.

    Motif

    A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. Some common types of motifs are made up of two or more alpha helices or beta sheets.

    Open Reading Frame (ORF)

    An opening frame contains a series of codons (base triplets) coding for amino acids without any termination codons. There are six potential reading frames of an unidentified sequence.

    Orthologous

    Homologous sequences in different species that result from a common ancestral gene during speciation. Orthologous genes may or may not have similar functions.

    Paralogous

    Homologous sequences within a single species that are the result of gene duplication.

    Transformation

    A process by which the genetic material carried by an individual cell is altered by incorporation of exogenous DNA into its genome.

    Single Nucleotide Polymorphism

    Single Nucleotide Polymorphisms (snps) are single base pair positions in genomic DNA at which normal individuals in a given population show different sequence alternatives (alleles) with the least frequent allele having an abundance of 1 % or greater. Snps occur once every 100 to 300 bases and are hence the most common genetic variations.

    Proteomics

    Proteomics aims at quantifying the expression levels of the complete protein complement (the proteome) in a cell at any given time. While proteomics research was initially focussed on two-dimensional gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes the function of large sets of proteins. It is thus often used as a synonym for functional genomics.

    Genomics, Functional Genomics, Structural Genomics

    The goal of Genomics is to determine the complete DNA sequence for all the genetic material contained in an organism's complete genome. Functional genomics (sometimes refered to as functional proteomics) aims at determining the function of the proteome (the protein complement encoded by an organism's entire genome). It expands the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic fashion, using large-scale experimental methodologies combined with statistical analysis of the results. Structural Genomics is the systematic effort to gain a complete structural description of a defined set of molecules, ultimately for an organism’s entire proteome. Structural genomics projects apply X-ray crystallography and NMR spectroscopy in a high-throughput manner.

    Database

    A database (or data base) is a collection of data that is organized so that its contents can easily be accessed, managed, and modified by a computer. The most prevalent type of database is the relational database which organizes the data in tables; multiple relations can be mathematically defined between the rows and columns of each table to yield the desired information. An object-oriented database stores data in the form of objects which are organized in hierachical classes that may inherit properties from classes higher in the tree structure.

    Databank

    In the biosciences, a databank (or data bank) is a structured set of raw data, most notably DNA sequences from sequencing projects (e.g. The EMBL and genbank databases).

    NCBI National Center for Biotechnology Information

    Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease. Part of NIH. Http://www.ncbi.nlm.nih.gov

    EMBL (European Molecular Biology Laboratory

    Main laboratory is in Heidelberg, Germany, with outstations in Hamburg, Grenoble, France (access to high powered instruments for structure studies) and Hinxton, UK (bioinformatics). Supported by 14 European countries and Israel, shares data daily with DDBJ and genbank. Http://www.embl-heidelberg.de/

    DDBJ (DNA Data Bank of Japan)

    DNA Data Bank of Japan (DDBJ) is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. Since we exchange the collected data with EMBL-Bank/EBI (European Bioinformatics Institute) and GenBank/NCBI (National Center for Biotechnology Information) on a daily basis, the three data banks share virtually the same data at any given time. The virtually unified database is called "the International Nucleotide Sequence Database (INSD)". DDBJ collects sequence data mainly from Japanese researchers, but of course accepts data and issue the accession number to researchers in any other countries.

    Genbank

    NCBI, US http://www.ncbi.nlm.nih.gov/Genbank/ NIH genetic sequence database, annotated collection of all publicly available DNA sequence Mirrored at EMBL and DDBJ. Currently estimated (early 2000) that over 2 million bases are deposited here each day. This growth will only accelerate in the future. Began in the 1980’s by DOE. Cross reference DDBJ and EMBL.

    PDB

    PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven nat. Labs or EBI). Domain families represented in SMART and in the PDB are annotated as being of known structure; links are provided in SMART to the PDB via pdbsum and MMDB. Pdbsum links can be used to access a variety of sequence-based and structure-based tools, whereas MMDB provides access to literature information and structure similarities.

    MMDB Molecular Modeling database

    NCBI, US - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?Db=Structure  3D macromolecular structures, including proteins and polynucleotides. MMDB contains over 28,000 structures and is linked to the rest of the NCBI databases, including sequences, bibliographic citations, taxonomic classifications, and sequence and structure neighbors.

    Entrez

    NCBI http://www.ncbi.nlm.nih.gov/Entrez/ A retrieval system for searching several linked databases. It provides access to pubmed (Medline), Nucleotide sequence database (genbank) Protein sequence database, Structure: three- dimensional macromolecular structures, Genome: complete genome assemblies popset: Population study data sets, Taxonomy: organisms in genbank, OMIM: Online Mendelian Inheritance in Man.

    Draftsequence

    The sequence generated by the HGP as of June 2000 that, while incomplete, offers a virtual road map to an estimated 95% of all human genes. Draft sequence data are mostly in the form of 10,000 base pair-sized fragments whose approximate chromosomal locations are known.

    Accession number (genbank)

    The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to genbank. The genbank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or proteinid, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field.

    Accession number (refseq)

    This accession number is the unique identification number for a complete refseq sequence record. Refseq accession numbers are written in the following format: two letters followed by an underscore and six digits (e.g., NT_123456). The first two letters of the refseq accession number indicate the type of sequence included in the record as described below:

      NT_123456 constructed genomic contigs
      NM_123456 mRNAs (actually the cDNA sequences constructed from mRNA)
      NP_123456 proteins
      NC_123456 chromosomes

    Accession Number line (EMBL)

    The AC (Accession Number) line lists the accession numbers associated with this entry.

    Accession number line (Swiss-Prot)

    The AC (accession number) line lists the accession number(s) associated with an entry.

    Algorithm

    A series of steps defining a procedure or formula for solving a problem, that can be coded into a programming language and executed. Bioinformatics algorithms typically are used to process, store, analyse, visualise and make predictions from biological data.

    Alignment score

    The alignment score, represents the likelihood that the described alignment is not random, providing an indication of its validity. They are calculated by totaling the scores for each matched pair of residues at each position in the alignment, plus unmatched residues are given the gap open penalty, (the gap penalty for non-affine searches), or the gap extension penalty, if appropriate in the alignment, and if the affine search is running.

    Alignment

    The result of a comparison of two or more gene or protein sequences in order to determine their degree of base or amino acid similarity. Sequence alignments are used to determine the similarity, homology, function or other degree of relatedness between two or more genes or gene products.

    Annotation

    A combination of comments, notations, references, and citations, either in free format or utilising a controlled vocabulary, that together describe all the experimental and inferred information about a gene or protein. Annotations can also be applied to the description of other biological systems. Batch, automated annotation of bulk biological sequence is one of the key uses of Bioinformatics tools.    

    Biochips

    Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.  

    Bit score

    The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

    Conservation

    When the substitution of one amino for another preserves the physico-chemistry properties of the original residue. For example, when a hydrophobic amino acid residue is replaced by another hydrophobic residue.

    Contig

    Group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome.

    Consensus

    A single sequence that represents, at each subsequent position, the variation foundwithin corresponding columns of a multiple sequence alignment.

    Dendogram

    A form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.

    Dynamic Progamming

    In general, dynamic programming is an algorithmic scheme for solving discrete optimization problems that have overlapping subproblems. In a dynamic programming algorithm, the definition of the function that is optimized is extended as the computation proceeds. The solution is constructed by progressing from simpler to more complex cases, thereby solving each subproblem before it is needed by any other subproblem. In particular, the algorithm for finding optimal alignments is an example of dynamic programming.

    E value

    The number of different alignments with a score equal to or better than S that can be expected to occur simply by chance. Also referred to as the expectation value.

    FASTA format

    Sequence format that begins with a single-line description followed by lines of sequence data. This format can be used as query input when searching bioinformatic tools such as BLAST or clustal W. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. Blank lines are not allowed in the middle of FASTA input. An example of a protein sequence in FASTA format is:
      >GI|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
    QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

    Force-field

    In molecular dynamics and molecular mechanics calculations, the intra- and intermolecular interactions of a molecule are calculated from a simplified empirical parametrization called a force field. These include atom masses, charges, dihedral angles, improper angles, van-der-Waals and electrostatic interactions, etc.

    Gap

    A space introduced into an alignment to compensate for insertions or deletions in one sequence relative to another.


    Gene locus (pl. Loci)

    Gene's position on a chromosome or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.

    Gene name

    Official name assigned to a gene. According to the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee, it should be brief and describe the function of the gene.

    Gene Ontology

    A controlled vocabulary of terms relating to molecular function, biological process, or cellular components developed by the Gene Ontology Consortium. A controlled vocabulary allows scientists to use consistent terminology when describing the roles of genes and proteins in cells.

    Gene symbol

    Symbols for human genes are usually designated by scientists who discover the genes. The symbols are created using the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee. Gene symbols usually consist of no more than six upper case letters or combination of uppercase letters and Arabic numbers. Gene symbols should start with the first letters of the gene name. For example, the gene symbol for insulin is "INS." A gene symbol must be submitted to HUGO for approval before it can be considered an official gene symbol.

    Genome

    The genome is the gene complement of an organism. A genome sequence comprises the information of the entire genetic material of an organism.

    GI (genbank)

    A GI or "geninfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the genbank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases.

    Genetic algorithm

    A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.

    Global alignment

    When two nucleic acid or amino acid sequences are lined up along their entire length.

    Hidden Markov Model

    A Hidden Markov Model (HMM) is a general probabilistic model for sequences of symbols. In a Markov chain, the probability of each symbol depends only on the preceeding one. Hidden Markov models are widely used in bioinformatics, most notably to replace sequence profile in the calculation of sequence alignments.

    Homologous Proteins

    Two proteins with related folds and related sequences are called homologous. Commonly, homologous proteins are further divided into orthologous and paralogous proteins. While orthologous proteins evolved from a common ancestral gene, paralogous proteins were created by gene duplication.

    Homology

    Similarity in sequence that is based on descent from a common ancestor.


    Identity

    The extent to which two sequences are invariant.

    Ligand

    A small molecule noncovalently bonded to a larger macromolecule.

    Local alignment 

    The alignment of portions (rather than the entire sequence length) of two nucleic acid or amino acid sequences.

    Locus (pl. Loci)


    The position on a chromosome of a gene or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.

    Masking

    The removal of repeated or low complexity regions from a sequence so that sequences are compared.

    Neural Network

    A neural network is a computer algorithm to solve non-linear optimisation problems. The algorithm was derived in analogy to the way the densely interconnected, parallel structure of the brain processes information.

    Ontology

    The word ontology has a long history in philosophy, in which it refers to the study of being as such. In information science, an ontology is an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships among them.

    Open reading frame (ORF)

    The sequence of DNA or RNA located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon).

    Patent

    In genetics, conferring the right or title to genes, gene variations, or identifiable portions of sequenced genetic material to an individual or organization.

    Protein Folding Problem

    Proteins fold on a time scale from ms to s. Starting from a random coil conformation, proteins can find their stable fold quickly although the number of possible conformations is astronomically high. The Protein Folding Problem is to predict the folding and the final structure of a protein solely from its sequence. The Protein Structure Prediction Problem refers to the combinatorial problem to calculate the three-dimensional structure of a protein from its sequence alone. It is one of the biggest challenges in structural bioinformatics.

    Protein ID (genbank)

    The Protein ID is an identification number assigned to the amino acid sequence data included within a sequence record. This sequence identifier uses the accession.version format. Each protein ID is made up of three letters followed by five digits, a period, and a version number. For example, in a sequence record M12345, the Protein ID for the sequence translation could be AAA35650.1. If the protein sequence data changes in any way (even by just one amino acid), the version number in the Protein ID will be increased by an increment of one, while the accession number base remains constant. For example, AAA12345.1 would become AAA12345.2. Each amino acid sequence change also results in the assignment of a new GI number to the altered protein translation.

    PSI-BLAST

    Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST.

    P Value

    The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.

    Query

    The input sequence (in FASTA format or as bare sequence data) or sequence identifier with which all the sequences in a database are compared during a BLAST search.

    Sequence Contig

    A contig consists of a set of gel readings from a sequencing project that are related to one another by overlap of their sequences. The gel readings of a contig can be combined to form a contiguous consensus sequence whose length is called the length of the contig.

    Sequence Profile

    A sequence profile represents certain features in a set of aligned sequences. In particular, it gives position-dependent weights for all 20 amino acids and as for insertion and deletion events at any sequence position.

    Sequence tagged site or STS


    Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, stss are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks for developing physical maps of the human genome. Expressed sequence tags (ests) are stss derived from cdnas.

    Similarity

    How related one nucleotide or protein sequence is to another. The extent of similarity between two sequences is based on the percent of sequence identity and/or conservation.

    Substitution Matrix

    A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occuring through a period of evolution.

    Threading

    Threading techniques try to match a target sequence on a library of known three-dimensional structures by „threading“ the target sequence over the known coordinates. In this manner, threading tries to predict the three-dimensional structure starting from a given protein sequence. It is sometimes successful when comparisons based on sequences or sequence profiles alone fail due to a too low sequence similarity.

    Turing Machine

    The Turing machine is one of the key abstractions used in modern computability theory. It is a mathematical model of a device that changes its internal state and reads from, writes on, and moves a potentially infinite tape, all in accordance with its present state. The model of the Turing machine played an important role in the conception of the modern digital computer.

    Version (genbank)

    Similar to the Protein ID for protein sequences, the version is a nucleotide sequence identification number assigned to each genbank sequence. The format for this sequence identifier is accession.version (e.g., M12345.1). Whenever the author of a particular sequence record changes the sequence data in any way (even if just a single nucleotide is altered), the version number will be increased by an increment of one, while the accession number base remains constant. For example, M12345.1 would become M12345.2. Each sequence change also results in the assignment of a new GI number [link to GI entry]. Whenever an individual searches an NCBI sequence database, only the most recent version of a record is retrieved.     
 
     
 
Disclamer  | Privicy Policy Feed Back   | Member Login  | Contacts

This portal  belongs to Indian Agricultural Research Institute, Indian Council of Agricultural Research, an autonomous organization under the  l Research, an autonomous organization under the  r /> Department of Agricultural Research and Education,  Ministry of Agriculture, Government of India. Copyright © 2010 IARI
Developed and M
aintained by : IARI, RKMP Team ,, Unit of Simulation and Informatics, IARI, New Delhi. Best viewed in Mozilla Firefox / Internet Explorer  /Google Crome/Safari