Data Mining Assignment

DATA MINING ASSIGNMENT

Data Mining

Data Mining

Introduction

Data mining is the process of finding patterns in information contained in large databases. It is a research area at the intersection of several disciplines, including statistics, databases, pattern recognition and AI, visualization, optimization, and high-performance and parallel computing. With the success of database systems, and their widespread use, the role of the database expanded from being a reliable data store to being a decision support system (DSS). This has been manifested in the growth of data warehouses that consolidate transactional and distributed databases. Examples of applications of data mining techniques include: fraud detection in banking and telecommunications; marketing; science data analysis involving cataloging objects of interest in large data sets (e.g. sky objects in a survey, volcanoes on Venus, finding atmospheric events in remote sensing data); problem diagnosis in manufacturing, medicine, or networking; and so forth. The techniques are particularly relevant in settings where data is plentiful and the processes generating it are poorly understood.

Human-driven data analysis and exploration, while effective in low-dimensional small data settings, breaks down in the presence of high dimensionality and massive data sets (Tukey, 1975). It is common for modem databases to contain thousands of dimensions (fields) per record (row). Such data sets pose fundamental problems that transcend query execution and optimization. The fundamental problem is query formulation: how do we provide data access when a user cannot specify the target set exactly (as the database query language SQL requires)) Typical DSS queries are very difficult to state; e.g. which records are likely to represent fraud in credit card, banking, or telecommunications transactions? Which records are most similar to records in table A but dissimilar to those in table B? How many clusters (or segments—groups of related records) are in a database and how are they characterized? Data mining techniques allow computer-driven exploration of the data, hence admitting a more abstract model of interaction than SQL permits.

Data mining techniques are fundamentally data reduction and visualization techniques. As the number of dimensions grows, the number of ways of choosing combinations for dimensionality reduction explodes. For an analyst exploring models, it is infeasible to go through the various ways of projecting the dimensions or selecting the right subsets of the data (reduction along columns and rows). Furthermore, a projection to lower dimensions could render an easy discrimination problem extremely difficult by eliminating important distinctions. An effective means to visualize data would be to employ data mining algorithms to perform the appropriate reductions, allowing an analyst to find patterns or models which may otherwise remain hidden in the high dimensional space. For example, a clustering algorithm could select a subset of the data embedded in a high-dimensional space and determine a few dimensions to distinguish it from the rest of the data or from other clusters.

Analysis

Adopting definitions given in Fayyad et al. (1996b, Chapter 1): Knowledge discovery in databases (KDD) is the process of identifying valid, novel, potentially useful, and ultimately understandable structure in ...

Assignment Data

Data Mining Techniques An...

Data Mining

Data Mining

Data Management: Data War...