Traditional Culture Encyclopedia - Traditional festivals - Information biology?

Information biology?

Bioinformatics (BT)

Chinese name: bioinformatics English name: Bioinformatics

Definition 1: It integrates the theories and methods of computer science, information technology and mathematics to study the interdisciplinary subject of biological information. Including biological data research, archiving, display, processing and simulation, genetic and physical map processing, nucleotide and amino acid sequence analysis, new gene discovery and protein structure prediction.

Discipline: biochemistry and molecular biology (first-class discipline); Introduction (two disciplines)

Definition 2: The discipline of using computer technology and information technology to develop new algorithms and statistical methods, analyze biological experimental data, determine the biological significance contained in the data, develop new data analysis tools, and realize the acquisition and management of various information.

Subject: Cell Biology (first-class subject); Introduction (two disciplines)

Definition 3: The discipline of using computer technology and information technology to develop new algorithms and statistical methods, analyze biological experimental data, determine the biological significance contained in the data, develop new data analysis tools, and realize the acquisition and management of various information.

Discipline: genetics (first-class discipline); General Introduction (two disciplines) This content was approved and published by the National Committee for the Examination and Approval of Scientific and Technical Terminology.

Bioinformatics is a subject that studies the collection, processing, storage, dissemination, analysis and interpretation of biological information. It comprehensively uses biology, computer science and information technology to reveal the biological mystery endowed by a large number of complex biological data.

Main research direction

Bioinformatics has formed many research directions in just over ten years. The following briefly introduces some major research hotspots.

1, sequence alignment

The basic problem of sequence alignment is to compare the similarity or dissimilarity of two or more symbol sequences. From the biological point of view, this problem contains the following meanings: (1) reconstructing the complete sequence of DNA from overlapping sequence fragments; Determine the physical and genetic map storage from the probe data under various experimental conditions, traverse and compare the DNA sequences in the database, compare the similarities of two or more sequences, search related sequences and subsequences in the database, find out the continuous generation mode of nucleotides, find out the information components in protein and DNA sequences, and compare the biological characteristics of DNA sequences, such as local insertion, deletion (the former two are referred to as indel for short) and replacement. The objective function of sequences obtains the minimum distance weighted sum or maximum similarity sum of variation sets between sequences. The methods of alignment include global alignment, local alignment and generation gap punishment. Dynamic programming algorithm is often used to compare two sequences, which is suitable for short sequence length, but not for massive gene sequences (such as human DNA sequence as high as 109bp), and even the algorithm complexity is linear. Therefore, the heuristic method is difficult to work.

2. Comparison and prediction of protein structure.

The basic problem is to compare the similarities or dissimilarities of the spatial structures of two or more protein molecules. The structure and function of protein are closely related. It is generally believed that protein with similar functions is generally similar in structure. Protein is a long chain composed of amino acids, with the length ranging from 50 to1000 to 3000 aa. Protein has many functions, such as storage and transportation of enzymes and substances, and signal transmission. Antibodies, etc. The sequence of amino acids inherently determines the three-dimensional structure of protein. It is generally believed that protein has four different structures. The reason for studying the structure and prediction of protein is to understand the function of organisms in medicine, to find the target of docking drugs, and to obtain better crop genetic engineering in agriculture. Enzymatic synthesis is used in industry. The reason for directly comparing protein structure is that the three-dimensional structure of protein is more stable than the first-order structure in evolution and contains more information than AA sequence. The premise of protein's three-dimensional structure research is that the internal amino acid sequence corresponds to the three-dimensional structure one by one (not necessarily true). Physics can be explained by minimum energy. The structure of unknown protein is predicted by observing and summarizing the protein structure law of known structures. Homologous modeling and threading both fall into this category. Homology modeling is used to find protein structures with high similarity (more than 30% amino acids are the same), and the latter is used to compare different protein structures in evolutionary families. However, the research status of structural prediction in protein is far from meeting the actual needs.

3. Non-coding region analysis of gene recognition.

The basic problem of gene recognition is to correctly identify the range and exact position of genes in a given genome sequence. Non-coding regions are composed of introns, which are usually discarded after protein formation. However, from the experiment, if the non-coding regions are removed, gene replication cannot be completed. Obviously, DNA sequence, as a genetic language, is not only contained in the coding region, but also implied in the non-coding sequence. At present, there is no general guiding method for analyzing DNA sequences in non-coding regions. In the human genome, not all sequences are encoded, that is, some kind of protein template, and the encoded part only accounts for 3-5% of the total sequence of human genes. Obviously, it is inconceivable to search such a large gene sequence manually. The method of detecting the password region includes measuring the frequency of codons in the password region. First-order and second-order Markov chains, ORF (open reading frame), promoter recognition, HMM (hidden Markov model) and GENSCAN, splicing alignment and so on.

4. Molecular Evolution and Comparative Genomics

Molecular evolution is to use the similarities and differences of the same gene sequence in different species to study the evolution of organisms and build an evolutionary tree. We can not only use DNA sequences, but also use the amino acid sequences encoded by them, even through the structural comparison of related protein, on the premise that similar races are genetically similar. By comparison, we can find out which races are the same. What is the difference? Early research methods usually use external factors, such as size, skin color and number of limbs, as the basis of evolution. In recent years, with the completion of many model organism genome sequencing tasks, people can study molecular evolution from the perspective of the whole genome. When matching genes of different races, there are generally three situations to be dealt with: orthohomology: genes of different races with the same function; Collateral homology: Homologous genes with different functions; Heterologous gene: a gene that spreads between organisms by other means, such as a virus injection gene. The common method in this field is to construct a phylogenetic tree, which is realized by methods based on features (that is, the specific positions of amino acid bases in DNA sequences or protein) and distances (alignment scores) and some traditional clustering methods (such as UPGMA).

5, sequence overlapping group assembly

According to the current sequencing technology, only 500 or more base pairs can be detected in each reaction. For example, short shot method is used to measure human genes, which requires a large number of short sequences to form overlapping groups. The process of splicing them gradually to form a longer contig until a complete sequence is obtained is called contig assembly. From the perspective of algorithm, the overlapping group of sequences is a NP-complete problem.

6, the origin of genetic code

Generally speaking, the study of genetic code thinks that the relationship between codons and amino acids is caused by an accidental event in the history of biological evolution and has been fixed on the same ancestor of modern organisms until now. Different from this "freezing" theory, some people put forward three theories to explain the genetic code, namely, selection optimization, chemistry and history. With the completion of various biological genome sequencing tasks, it provides new materials for studying the origin of genetic code and testing the authenticity of the above theory.

7. Structure-based drug design

One of the purposes of human genetic engineering is to understand the structure, function and interaction of about 654.38+ million kinds of protein in human body and their relationship with various human diseases, and to seek various treatment and prevention methods including drug therapy. Drug design based on biomacromolecules and micromolecules is an extremely important research field in bioinformatics. In order to inhibit the activity of some enzymes or protein, based on the known tertiary structure of proteins, inhibitor molecules can be designed as candidate drugs on the computer by using molecular permutation algorithm. The purpose of this field is to find new gene drugs, which have great economic benefits.

8. Modeling and simulation of biological system

With the development of large-scale experimental technology and data accumulation, it has become another research hotspot in the post-genome era-system biology to study and analyze biological systems from the global and systematic levels and reveal their development laws. At present, its research contents include simulation of biological system (Curr Opin Rheumatol, 2007, 463-70), system stability analysis (nonlinear dynamic psychological life Sci, 2007, 4 13-33) and system robustness analysis (Ernst Schering Res Found Workshop, 2007, 69-83). The modeling language represented by SBML (Bioinformatics, 2007, 1297-8) has developed rapidly. Boolean networks (PLoS Comput Biol, 2007, e 163), differential equations (Mol Biol Cell, 2004, 3841-. In 2007, 3262-92) and discrete dynamic event system (Bioinformatics, 2007, 336-43), many models have been established by referring to the modeling methods of physical systems such as circuits, and many studies have tried to solve the complexity of the system from the macroscopic analysis ideas such as information flow, entropy and energy flow (Anal Quant Cytol Histol, 2007, 296-308). Of course, it will take a long time to establish the theoretical model of biological system. Although the experimental observation data are increasing greatly, the data needed for biological system model identification far exceeds the output capacity of current data. For example, for the chip data of time series, the number of sampling points is not enough to use the traditional time series modeling method, and the huge experimental cost is the main difficulty of system modeling at present. System description and modeling methods also need pioneering development.

9. Research on Bioinformatics Technology and Methods

Bioinformatics is not only a simple arrangement of biological knowledge and a simple application of knowledge in mathematics, physics, information science and other disciplines. Massive data and complex background lead to the rapid development of machine learning, unified data analysis and system description under the background of bioinformatics. Huge amount of calculation, complex noise patterns and massive time-varying data bring great difficulties to traditional statistical analysis, which requires more flexible data analysis techniques, such as nonparametric statistics (BMC Bioinformatics, 2007,339) and cluster analysis (Qual Life Res, 2007, 1655-63). The analysis of high-dimensional data requires the compression technology of feature space such as partial least squares (PLS). In the development of computer algorithm, it is necessary to fully consider the time and space complexity of the algorithm, and use parallel computing, grid computing and other technologies to expand the realizability of the algorithm.

10, biological image

Why do people who are not related by blood look so alike? Appearance is made up of points. The more points overlap, the more they look alike. Why do these two unrelated points overlap? What is the biological basis? Are the genes similar? I don't know, I hope experts can answer.

1 1, others

Such as gene expression profile analysis and metabolic network analysis; Gene chip design and protein omics data analysis have gradually become new important research fields in bioinformatics. In terms of disciplines, disciplines derived from bioinformatics, including structural genomics, functional genomics, comparative genomics, protein's research, pharmacogenomics, traditional Chinese medicine genomics, oncology, molecular epidemiology and environmental genomics, have become important research methods in systems biology. It is not difficult to see from the current development that genetic engineering has entered the post-genome era. We also have a clear understanding of how to deal with the possible misleading in machine learning and mathematics closely related to bioinformatics.

Edit this paragraph Bioinformatics and Machine Learning

Large-scale biological information brings new problems and challenges to data mining, which requires new ideas to join. Traditional computer algorithms can still be applied to biological data analysis, but they are increasingly unsuitable for sequence analysis. The reason is that the biological system is inherently complex and lacks a complete life organization theory established at the molecular level. Simon once defined learning as the change of the system, which can make the system more effective when doing the same work. The purpose of machine learning is to automatically acquire corresponding theories from data. By using methods such as reasoning, model fitting and sample learning, it is especially suitable for the lack of general theory, "noise" mode and large-scale data sets. Therefore, machine learning has formed a feasible method complementary to conventional methods. Machine learning makes it possible to extract useful knowledge and discover knowledge from massive biological information by computer. Multi-vector data analysis plays an increasingly important role, but at present, a large number of gene database processing needs computer automatic identification and labeling to avoid time-consuming and labor-intensive manual processing methods. Early scientific methods-observation and hypothesis-can no longer rely solely on human perception to deal with the requirements of high data volume, fast data acquisition rate and objective analysis. Therefore, the combination of bioinformatics and machine learning is inevitable. The most basic theoretical framework in machine learning is based on probability. In a sense, it is the continuation of statistical model fitting and its purpose is to extract useful information. Machine learning is closely related to pattern recognition and statistical reasoning. The learning method includes data clustering. Neural network classifier and nonlinear regression. Hidden Markov model is also widely used to predict the genetic structure of DNA. Current research focuses include: 1) observing and exploring interesting phenomena. At present, the focus of ML research is how to visualize and mine high-dimensional vector data. The general method is to reduce it to low-dimensional space, such as conventional principal component analysis (PCA) and kernel principal component analysis (KPCA). Independent component analysis, local linear embedding. 2) generate hypotheses and formal models to explain the phenomenon [6]. Most clustering methods can be regarded as a mixture of fitting vector data to some simple distribution. Clustering method has been used in microarray data analysis in bioinformatics. In the direction of cancer type classification, machine learning is also used to obtain the corresponding phenomenon explanation from the gene database. Machine learning accelerates the progress of bioinformatics, but also brings corresponding problems. Most machine learning methods assume that data conform to a relatively fixed model, while the general data structure is usually variable, especially in bioinformatics. Therefore, it is necessary to establish a set of general methods to find the internal structure of data sets without relying on the assumed data structure. Secondly, machine learning methods often use "black box" operations, such as neural network and hidden Markov model, and the internal mechanism of obtaining specific solutions is still unclear.