Dna Data Mining Sequence Report

DNA DATA MINING SEQUENCE REPORT

DNA Data Mining Sequence Report

DNA Data Mining Sequence Report

Introduction

The international effort called the Human Genome Project is rapidly sequencing the complete DNA sequences of all 24 human chromosomes. As well, the chromosomes from a number of other organisms are being entirely sequenced. The DNA component of chromosomes are long linear molecules comprised of strings of the four nucleotides (A, C, T, G), the information bearing chemical units. Coding sequences (exons) are interspersed by non-coding sequences(introns) along the chromosomes whose information encodes protein structures. Transcription of the coding DNA sequence into mRNA, which is then translated into proteins in the cell comprise the general flow of information. This process is responsible for all normal cellular functions as diverse as development into multicellular organisms, organ development, the immune system, to name a few, as well as abnormal function such as cancer, birth defects, etc.

Discussion

DNA Data Mining Sequence Report In this paper we describe data exploration techniques designed to classify DNA sequences. Several visualization and data mining techniques were used to validate and attempt to discover new methods for distinguishing coding DNA sequences, or exons, from non-coding DNA sequences, or introns. The goal of the data mining was to see whether some other possibly non-linear combination of the fundamental position dependent DNA nucleotide frequency values could be a better predictor than the AMI[6]. We tried many different classification techniques including rule-based classifiers and neural networks. We also used visualization of both the original data and the results of the data mining to help verify patterns and to understand the distinction between the different types of data and classifications. In particular, the visualization helped us develop refinements to neural network classifiers, which have accuracy's as high as any known method. In the conclusion, we discuss the interactions between visualization and data mining and suggest an integrated approach.

The current approach for finding genes (protein coding sequences) is both experimental and computational. Any small increase in the accuracy of computer classification can therefore result in substantial time and cost savings. In this paper we describe our experiences to harness data exploration techniques to classify DNA sequences.

In order to use visualization and data mining techniques to develop new methods for distinguishing coding DNA sequences (exons) from non-coding DNA sequences (introns), it is necessary to represent symbolic DNA sequences by numbers or vectors. It has been demonstrated by Fickett et al.[4] that the proper choice of this representation is as important as the later processing of the numbers by neural nets or other classification schemes. The representation of DNA sequences we chose was guided by the recent discovery that a non-linear correlation statistic for DNA sequences, called the average mutual information (AMI), [6], is capable of distinguishing coding from non-coding DNA sequences in all taxonomic classes ranging from the most simple to the most complex organisms. mathematically, the AMI is a non-linear function based on the vector of 12 frequencies p_i^k by which the nucleotide i = a, c, ...

Dna Data Mining Sequence Report

Strand For Dna Sequence

Sequence A: Gtgcaactgc Ag...

Impala Platinum Mines Ben...

Dna Sequence Data

Sequencing