|
IARI - RKMP Associated Partners
|
|
|
|
|
|
Bioinformatics Key Words
|
|
|
|
-
General Definition
Chromosome
The structure in the cell nucleus that contains all of the cellular
DNA together with a number of proteins that compact and package the DNA.
Gene
Genes are working subunits of DNA. DNA is a vast chemical information database
that carries the complete set of instructions for making all the proteins a cell
will ever need. Each gene contains a particular set of instructions, usually
coding for a particular protein.
Central Dogma
The original postulate that genetic information can be transferred only from
nucleic acid to nucleic acid and from nucleic acid to protein, that is from DNA
to DNA from DNA to RNA and from RNA to protein (although information transfer
from RNA to DNA was not excluded and is now known to occur [reverse
transcription]). Transfer of genetic information from protein to nucleic acid
never occurs.
DNA
The genetic material of most living organisms, which is a major constituent of
the chromosomes within the cell nucleus and plays a central role in the
determination of hereditary characteristics by controlling protein synthesis in
cells. DNA is a nucleic acid composed of two chains of nucleotides in which the
sugar is deoxyribose and the bases are adenine, cytosine, guanine, and thymine
(compare RNA). The two chains are wound round each other and linked together by
hydrogen bonds between specific complementary bases to form a spiral
ladder-shaped molecule.
DNA Replication
DNA replication, the basis for biological inheritance, is a fundamental process
occurring in all living organisms to copy their DNA. This process is
"replication" in that each strand of the original double-stranded DNA molecule
serves as template for the reproduction of the complementary strand. Hence,
following DNA replication, two identical DNA molecules have been produced from a
single double-stranded DNA molecule. Cellular proofreading and error
toe-checking mechanisms ensure near perfect fidelity for DNA replication.
Transcription
Transcription, or RNA synthesis, is the process of creating an equivalent RNA
copy of a sequence of DNA. Both RNA and DNA are nucleic acids, which use base
pairs of nucleotides as a complementary language that can be converted back and
forth from DNA to RNA in the presence of the correct enzymes. During
transcription, a DNA sequence is read by RNA polymerase, which produces a
complementary, antiparallel RNA strand. As opposed to DNA replication,
transcription results in an RNA complement that includes uracil (U) in all
instances where thymine (T) would have occurred in a DNA complement.
RNA
Ribonucleic acid (RNA) is a biologically important type of molecule that
consists of a long chain of nucleotide units. Each nucleotide consists of a
nitrogenous base, a ribose sugar, and a phosphate. RNA is very similar to DNA,
but differs in a few important structural details: in the cell, RNA is usually
single-stranded, while DNA is usually double-stranded; RNA nucleotides contain
ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen
atom); and RNA has the base uracil rather than thymine that is present in DNA.
They play crucial roles in protein synthesis and other cell activities:
Messenger RNA (mRNA) is a type of RNA that reflects the exact nucleoside
sequence of the genetically active DNA. mRNA carries the "message" of the DNA to
the cytoplasm of cells where protein is made in amino acid sequences specified
by the mRNA.
Transfer RNA (tRNA) is a short-chain type of RNA present in cells. There are 20
varieties of tRNA. Each variety combines with a specific amino acid and carries
it along (transfers it), leading to the formation of protein with a specific
amino acid arrangement dictated by DNA.
Ribosomal RNA (rRNA) is a component of ribosomes. Ribosomal RNA functions as a
nonspecific site for making polypeptides.
cDNA or complementary DNA
DNA that is synthesized in the laboratory from a messenger RNA template.
Translation
A step in protein biosynthesis wherein the genetic code carried by mRNA is
decoded to produce the specific sequence of amino acids in a polypeptide chain.
Proteins
Proteins (also known as polypeptides) are organic compounds made of amino acids
arranged in a linear chain and folded into a globular form. The amino acids in a
polymer are joined together by the peptide bonds between the carboxyl and amino
groups of adjacent amino acid residues.
Proteome
The Proteome is the protein complement expressed by a genome. While the genome
is static, the proteome continually changes in response to external and internal
events.
CDS
The coding sequence or the portion of a nucleotide sequence that makes up the
triplet codons that actually code for amino acids.
Genetic Code
The instructions in a gene that tell the cell how to make a specific protein. A,
T, G, and C are the "letters" of the DNA code. They stand for the chemicals
adenine, thymine, guanine, and cytosine, respectively, that make up the
nucleotide bases of DNA. Each gene's code combines the four chemicals in various
ways to spell out 3-letter "words" that specify which amino acid is needed at
every step in making a protein.
Expressed sequence tag or EST
A short strand of DNA that is a part of a cDNA molecule and can act as
identifier of a gene. Used in locating and mapping genes.
Motif
A discrete portion of a protein assumed to fold independently of the rest of the
protein and possessing its own function. Some common types of motifs are made up
of two or more alpha helices or beta sheets.
Open Reading Frame (ORF)
An opening frame contains a series of codons (base triplets) coding for amino
acids without any termination codons. There are six potential reading frames of
an unidentified sequence.
Orthologous
Homologous sequences in different species that result from a common ancestral
gene during speciation. Orthologous genes may or may not have similar functions.
Paralogous
Homologous sequences within a single species that are the result of gene
duplication.
Transformation
A process by which the genetic material carried by an individual cell is altered
by incorporation of exogenous DNA into its genome.
Single Nucleotide Polymorphism
Single Nucleotide Polymorphisms (snps) are single base pair positions in genomic
DNA at which normal individuals in a given population show different sequence
alternatives (alleles) with the least frequent allele having an abundance of 1 %
or greater. Snps occur once every 100 to 300 bases and are hence the most common
genetic variations.
Proteomics
Proteomics aims at quantifying the expression levels of the complete protein
complement (the proteome) in a cell at any given time. While proteomics research
was initially focussed on two-dimensional gel electrophoresis for protein
separation and identification, proteomics now refers to any procedure that
characterizes the function of large sets of proteins. It is thus often used as a
synonym for functional genomics.
Genomics, Functional Genomics, Structural
Genomics
The goal of Genomics is to determine the complete DNA sequence for all the
genetic material contained in an organism's complete genome. Functional genomics
(sometimes refered to as functional proteomics) aims at determining the function
of the proteome (the protein complement encoded by an organism's entire genome).
It expands the scope of biological investigation from studying single genes or
proteins to studying all genes or proteins at once in a systematic fashion,
using large-scale experimental methodologies combined with statistical analysis
of the results. Structural Genomics is the systematic effort to gain a complete
structural description of a defined set of molecules, ultimately for an
organism’s entire proteome. Structural genomics projects apply X-ray
crystallography and NMR spectroscopy in a high-throughput manner.
Database
A database (or data base) is a collection of data that is organized so that its
contents can easily be accessed, managed, and modified by a computer. The most
prevalent type of database is the relational database which organizes the data
in tables; multiple relations can be mathematically defined between the rows and
columns of each table to yield the desired information. An object-oriented
database stores data in the form of objects which are organized in hierachical
classes that may inherit properties from classes higher in the tree structure.
Databank
In the biosciences, a databank (or data bank) is a structured set of raw data,
most notably DNA sequences from sequencing projects (e.g. The EMBL and genbank
databases).
NCBI National Center for Biotechnology
Information
Established in 1988 as a national resource for molecular biology information,
NCBI creates public databases, conducts research in computational biology,
develops software tools for analyzing genome data, and disseminates biomedical
information - all for the better understanding of molecular processes affecting
human health and disease. Part of NIH.
Http://www.ncbi.nlm.nih.gov
EMBL (European Molecular Biology Laboratory
Main laboratory is in Heidelberg, Germany, with outstations in Hamburg,
Grenoble, France (access to high powered instruments for structure studies) and
Hinxton, UK (bioinformatics). Supported by 14 European countries and Israel,
shares data daily with DDBJ and genbank.
Http://www.embl-heidelberg.de/
DDBJ (DNA Data Bank of Japan)
DNA Data Bank of Japan (DDBJ) is the sole nucleotide sequence data bank in Asia,
which is officially certified to collect nucleotide sequences from researchers
and to issue the internationally recognized accession number to data submitters.
Since we exchange the collected data with EMBL-Bank/EBI (European Bioinformatics
Institute) and GenBank/NCBI (National Center for Biotechnology Information) on a
daily basis, the three data banks share virtually the same data at any given
time. The virtually unified database is called "the International Nucleotide
Sequence Database (INSD)". DDBJ collects sequence data mainly from Japanese
researchers, but of course accepts data and issue the accession number to
researchers in any other countries.
Genbank
NCBI, US http://www.ncbi.nlm.nih.gov/Genbank/ NIH genetic sequence database,
annotated collection of all publicly available DNA sequence Mirrored at EMBL and
DDBJ. Currently estimated (early 2000) that over 2 million bases are deposited
here each day. This growth will only accelerate in the future. Began in the
1980’s by DOE. Cross reference DDBJ and EMBL.
PDB
PDB is an archive of experimentally-determined three-dimensional structures
(Brookhaven nat. Labs or EBI). Domain families represented in SMART and in the
PDB are annotated as being of known structure; links are provided in SMART to
the PDB via pdbsum and MMDB. Pdbsum links can be used to access a variety of
sequence-based and structure-based tools, whereas MMDB provides access to
literature information and structure similarities.
MMDB Molecular Modeling database
NCBI, US - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?Db=Structure 3D
macromolecular structures, including proteins and polynucleotides. MMDB contains
over 28,000 structures and is linked to the rest of the NCBI databases,
including sequences, bibliographic citations, taxonomic classifications, and
sequence and structure neighbors.
Entrez
NCBI http://www.ncbi.nlm.nih.gov/Entrez/ A retrieval system for searching
several linked databases. It provides access to pubmed (Medline), Nucleotide
sequence database (genbank) Protein sequence database, Structure: three-
dimensional macromolecular structures, Genome: complete genome assemblies
popset: Population study data sets, Taxonomy: organisms in genbank, OMIM: Online
Mendelian Inheritance in Man.
Draftsequence
The sequence generated by the HGP as of June 2000 that, while incomplete, offers
a virtual road map to an estimated 95% of all human genes. Draft sequence data
are mostly in the form of 10,000 base pair-sized fragments whose approximate
chromosomal locations are known.
Accession number (genbank)
The accession number is the unique identifier assigned to the entire sequence
record when the record is submitted to genbank. The genbank accession number is
a combination of letters and numbers that are usually in the format of one
letter followed by five digits (e.g., M12345) or two letters followed by six
digits (e.g., AC123456). The accession number for a particular record will not
change even if the author submits a request to change some of the information in
the record. Take note that an accession number is a unique identifier for a
complete sequence record, while a Sequence Identifier, such as a Version, GI, or
proteinid, is an identification number assigned just to the sequence data. The
NCBI Entrez System is searchable by accession number using the Accession [ACCN]
search field.
Accession number (refseq)
This accession number is the unique identification number for a complete refseq
sequence record. Refseq accession numbers are written in the following format:
two letters followed by an underscore and six digits (e.g., NT_123456). The
first two letters of the refseq accession number indicate the type of sequence
included in the record as described below:
NT_123456 constructed genomic contigs
NM_123456 mRNAs (actually the cDNA sequences constructed from mRNA)
NP_123456 proteins
NC_123456 chromosomes
Accession Number line (EMBL)
The AC (Accession Number) line lists the accession numbers associated with this
entry.
Accession number line (Swiss-Prot)
The AC (accession number) line lists the accession number(s) associated with an
entry.
Algorithm
A series of steps defining a procedure or formula for solving a problem, that
can be coded into a programming language and executed. Bioinformatics algorithms
typically are used to process, store, analyse, visualise and make predictions
from biological data.
Alignment score
The alignment score, represents the likelihood that the described alignment is
not random, providing an indication of its validity. They are calculated by
totaling the scores for each matched pair of residues at each position in the
alignment, plus unmatched residues are given the gap open penalty, (the gap
penalty for non-affine searches), or the gap extension penalty, if appropriate
in the alignment, and if the affine search is running.
Alignment
The result of a comparison of two or more gene or protein sequences in order to
determine their degree of base or amino acid similarity. Sequence alignments are
used to determine the similarity, homology, function or other degree of
relatedness between two or more genes or gene products.
Annotation
A combination of comments, notations, references, and citations, either in free
format or utilising a controlled vocabulary, that together describe all the
experimental and inferred information about a gene or protein. Annotations can
also be applied to the description of other biological systems. Batch, automated
annotation of bulk biological sequence is one of the key uses of Bioinformatics
tools.
Biochips
Miniaturized arrays of large numbers of molecular substrates, often
oligonucleotides, in a defined pattern. They are also called DNA microarrays and
microchips.
Bit score
The value S' is derived from the raw alignment score S in which the statistical
properties of the scoring system used have been taken into account. Because bit
scores have been normalized with respect to the scoring system, they can be used
to compare alignment scores from different searches.
Conservation
When the substitution of one amino for another preserves the physico-chemistry
properties of the original residue. For example, when a hydrophobic amino acid
residue is replaced by another hydrophobic residue.
Contig
Group of cloned (copied) pieces of DNA representing overlapping regions of a
particular chromosome.
Consensus
A single sequence that represents, at each subsequent position, the variation
foundwithin corresponding columns of a multiple sequence alignment.
Dendogram
A form of a tree that lists the compared objects (e.g., sequences or genes in a
microarray analysis) in a vertical order and joins related ones by levels of
branches extending to one side of the list.
Dynamic Progamming
In general, dynamic programming is an algorithmic scheme for solving discrete
optimization problems that have overlapping subproblems. In a dynamic
programming algorithm, the definition of the function that is optimized is
extended as the computation proceeds. The solution is constructed by progressing
from simpler to more complex cases, thereby solving each subproblem before it is
needed by any other subproblem. In particular, the algorithm for finding optimal
alignments is an example of dynamic programming.
E value
The number of different alignments with a score equal to or better than S that
can be expected to occur simply by chance. Also referred to as the expectation
value.
FASTA format
Sequence format that begins with a single-line description followed by lines of
sequence data. This format can be used as query input when searching
bioinformatic tools such as BLAST or clustal W. The description line is
distinguished from the sequence data by a greater-than (">") symbol in the first
column. It is recommended that all lines of text be shorter than 80 characters
in length. Blank lines are not allowed in the middle of FASTA input. An example
of a protein sequence in FASTA format is:
>GI|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP
Force-field
In molecular dynamics and molecular mechanics calculations, the intra- and
intermolecular interactions of a molecule are calculated from a simplified
empirical parametrization called a force field. These include atom masses,
charges, dihedral angles, improper angles, van-der-Waals and electrostatic
interactions, etc.
Gap
A space introduced into an alignment to compensate for insertions or deletions
in one sequence relative to another.
Gene locus (pl. Loci)
Gene's position on a chromosome or other chromosome marker; also, the DNA at
that position. The use of locus is sometimes restricted to mean expressed DNA
regions.
Gene name
Official name assigned to a gene. According to the Guidelines for Human Gene
Nomenclature developed by the HUGO Gene Nomenclature Committee, it should be
brief and describe the function of the gene.
Gene Ontology
A controlled vocabulary of terms relating to molecular function, biological
process, or cellular components developed by the Gene Ontology Consortium. A
controlled vocabulary allows scientists to use consistent terminology when
describing the roles of genes and proteins in cells.
Gene symbol
Symbols for human genes are usually designated by scientists who discover the
genes. The symbols are created using the Guidelines for Human Gene Nomenclature
developed by the HUGO Gene Nomenclature Committee. Gene symbols usually consist
of no more than six upper case letters or combination of uppercase letters and
Arabic numbers. Gene symbols should start with the first letters of the gene
name. For example, the gene symbol for insulin is "INS." A gene symbol must be
submitted to HUGO for approval before it can be considered an official gene
symbol.
Genome
The genome is the gene complement of an organism. A genome sequence comprises
the information of the entire genetic material of an organism.
GI (genbank)
A GI or "geninfo Identifier" is a sequence identifier that can be assigned to a
nucleotide sequence or protein translation. Each GI is a numeric value of one or
more digits. The protein translation and the nucleotide sequence contained in
the same record will each be assigned different GI numbers. Every time the
sequence data for a particular record is changed, its version number increases
and it receives a new GI. However, while each new version number is based upon
the previous version number, a new GI for an altered sequence may be completely
different from the previous GI. For example, in the genbank record M12345, the
original GI might be 7654321, but after a change in the sequence is submitted,
the new GI for the changed sequence could be 10529376. Individuals can search
for nucleotide sequences and protein translations by GI using the UID search
field in the NCBI sequence databases.
Genetic algorithm
A kind of search algorithm that was inspired by the principles of evolution. A
population of initial solutions is encoded and the algorithm searches through
these by applying a pre-defined fitness measurement to each solution, selecting
those with the highest fitness for reproduction. New solutions can be generated
during this phase by crossover and mutation operations, defined in the encoded
solutions.
Global alignment
When two nucleic acid or amino acid sequences are lined up along their entire
length.
Hidden Markov Model
A Hidden Markov Model (HMM) is a general probabilistic model for sequences of
symbols. In a Markov chain, the probability of each symbol depends only on the
preceeding one. Hidden Markov models are widely used in bioinformatics, most
notably to replace sequence profile in the calculation of sequence alignments.
Homologous Proteins
Two proteins with related folds and related sequences are called homologous.
Commonly, homologous proteins are further divided into orthologous and
paralogous proteins. While orthologous proteins evolved from a common ancestral
gene, paralogous proteins were created by gene duplication.
Homology
Similarity in sequence that is based on descent from a common ancestor.
Identity
The extent to which two sequences are invariant.
Ligand
A small molecule noncovalently bonded to a larger macromolecule.
Local alignment
The alignment of portions (rather than the entire sequence length) of two
nucleic acid or amino acid sequences.
Locus (pl. Loci)
The position on a chromosome of a gene or other chromosome marker; also, the DNA
at that position. The use of locus is sometimes restricted to mean expressed DNA
regions.
Masking
The removal of repeated or low complexity regions from a sequence so that
sequences are compared.
Neural Network
A neural network is a computer algorithm to solve non-linear optimisation
problems. The algorithm was derived in analogy to the way the densely
interconnected, parallel structure of the brain processes information.
Ontology
The word ontology has a long history in philosophy, in which it refers to the
study of being as such. In information science, an ontology is an explicit
formal specification of how to represent the objects, concepts and other
entities that are assumed to exist in some area of interest and the
relationships among them.
Open reading frame (ORF)
The sequence of DNA or RNA located between the start-code sequence (initiation
codon) and the stop-code sequence (termination codon).
Patent
In genetics, conferring the right or title to genes, gene variations, or
identifiable portions of sequenced genetic material to an individual or
organization.
Protein Folding Problem
Proteins fold on a time scale from ms to s. Starting from a random coil
conformation, proteins can find their stable fold quickly although the number of
possible conformations is astronomically high. The Protein Folding Problem is to
predict the folding and the final structure of a protein solely from its
sequence. The Protein Structure Prediction Problem refers to the combinatorial
problem to calculate the three-dimensional structure of a protein from its
sequence alone. It is one of the biggest challenges in structural
bioinformatics.
Protein ID (genbank)
The Protein ID is an identification number assigned to the amino acid sequence
data included within a sequence record. This sequence identifier uses the
accession.version format. Each protein ID is made up of three letters followed
by five digits, a period, and a version number. For example, in a sequence
record M12345, the Protein ID for the sequence translation could be AAA35650.1.
If the protein sequence data changes in any way (even by just one amino acid),
the version number in the Protein ID will be increased by an increment of one,
while the accession number base remains constant. For example, AAA12345.1 would
become AAA12345.2. Each amino acid sequence change also results in the
assignment of a new GI number to the altered protein translation.
PSI-BLAST
Position-Specific Iterative BLAST. An iterative search using the BLAST
algorithm. A profile is built after the initial search, which is then used in
subsequent searches. The process may be repeated, if desired with new sequences
found in each cycle used to refine the profile. Details can be found in this
discussion of PSI-BLAST.
P Value
The probability of an alignment occurring with the score in question or better.
The p value is calculated by relating the observed alignment score, S, to the
expected distribution of HSP scores from comparisons of random sequences of the
same length and composition as the query to the database. The most highly
significant P values will be those close to 0. P values and E values are
different ways of representing the significance of the alignment.
Query
The input sequence (in FASTA format or as bare sequence data) or sequence
identifier with which all the sequences in a database are compared during a
BLAST search.
Sequence Contig
A contig consists of a set of gel readings from a sequencing project that are
related to one another by overlap of their sequences. The gel readings of a
contig can be combined to form a contiguous consensus sequence whose length is
called the length of the contig.
Sequence Profile
A sequence profile represents certain features in a set of aligned sequences. In
particular, it gives position-dependent weights for all 20 amino acids and as
for insertion and deletion events at any sequence position.
Sequence tagged site or STS
Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the
human genome and whose location and base sequence are known. Detectable by
polymerase chain reaction, stss are useful for localizing and orienting the
mapping and sequence data reported from many different laboratories and serve as
landmarks for developing physical maps of the human genome. Expressed sequence
tags (ests) are stss derived from cdnas.
Similarity
How related one nucleotide or protein sequence is to another. The extent of
similarity between two sequences is based on the percent of sequence identity
and/or conservation.
Substitution Matrix
A substitution matrix containing values proportional to the probability that
amino acid i mutates into amino acid j for all pairs of amino acids. Such
matrices are constructed by assembling a large and diverse sample of verified
pairwise alignments of amino acids. If the sample is large enough to be
statistically significant, the resulting matrices should reflect the true
probabilities of mutations occuring through a period of evolution.
Threading
Threading techniques try to match a target sequence on a library of known
three-dimensional structures by „threading“ the target sequence over the known
coordinates. In this manner, threading tries to predict the three-dimensional
structure starting from a given protein sequence. It is sometimes successful
when comparisons based on sequences or sequence profiles alone fail due to a too
low sequence similarity.
Turing Machine
The Turing machine is one of the key abstractions used in modern computability
theory. It is a mathematical model of a device that changes its internal state
and reads from, writes on, and moves a potentially infinite tape, all in
accordance with its present state. The model of the Turing machine played an
important role in the conception of the modern digital computer.
Version (genbank)
Similar to the Protein ID for protein sequences, the version is a nucleotide
sequence identification number assigned to each genbank sequence. The format for
this sequence identifier is accession.version (e.g., M12345.1). Whenever the
author of a particular sequence record changes the sequence data in any way
(even if just a single nucleotide is altered), the version number will be
increased by an increment of one, while the accession number base remains
constant. For example, M12345.1 would become M12345.2. Each sequence change also
results in the assignment of a new GI number [link to GI entry]. Whenever an
individual searches an NCBI sequence database, only the most recent version of a
record is retrieved.
|
|
|
|
|
|
|
|
|
Disclamer
|
Privicy Policy
|
Feed
Back
|
Member Login
|
Contacts
This portal belongs to
Indian Agricultural Research Institute, Indian Council of Agricultural Research,
an autonomous organization under the l Research, an autonomous organization
under the r />
Department of Agricultural Research and Education, Ministry of Agriculture,
Government of India. Copyright © 2010 IARI
Developed and Maintained by :
IARI,
RKMP Team
,, Unit of Simulation and
Informatics, IARI, New Delhi. Best viewed in Mozilla Firefox / Internet
Explorer /Google Crome/Safari
|
|
|
|
|
|