Data Discretization in Data Mining

In this tutorial, we will learn about the data discretization in data mining, why discretization is important, etc. By Palkesh Jain Last updated : April 17, 2023

What is Data discretization?

Data discretization is characterized as a method of translating attribute values of continuous data into a finite set of intervals with minimal information loss. Data discretization facilitates the transfer of data by substituting interval marks for the values of numeric data. Similar to the values for the 'generation' variable, interval labels such as (0-10, 11-20...) or (0-10, 11-20...) may be substituted (kid, youth, adult, senior). Data discretization can be divided into two forms of supervised discretization in which the class data is used and the other is unsupervised discretization depending on which way the operation proceeds, i.e. 'top-down splitting strategy' or 'bottom-up merging strategy.'

Continuous attributes are something that many real-world data mining tasks need. However, many of the latest exploratory data mining techniques struggle to appeal to such attributes. Besides, even though the machine learning task is able to manage a continuous attribute, its output will benefit greatly if the continuous attributes are replaced with their quantized values. Data discretization is a process of translating continuous data into intervals and then assigning the specific value within this interval. It can also be defined as discretizing time calculated by time interval units, rather than a specific value. Each discrete interval of the discretized attribute domain does not have to contain the discrete values from the discrete attribute domain, but these discrete values do have to induce an ordering on the domain of the discrete attribute itself. As a result, it increases the consistency of discovered information very significantly and also decreases the running time of various data mining tasks such as association rule discovery, classification, and of course, prediction. It offers gradual improvement for domains with a low number of continuous attributes, the number of attributes increases, but is typically almost always correct.

Discretization of the Top-down

If the procedure begins by first finding one or a few points to divide the whole set of attributes (called split points or cut points) and then performs this recursively at the resulting intervals, then it is called top-down discretization or slicing.

Discretization from the Bottom-up

If the procedure begins by considering all the continuous values as possible split-points, others are discarded by combining neighborhood values to form intervals, so it is called bottom-up discretization or merging.

To have a hierarchical partitioning of the attribute values, known as a definition hierarchy, discretization can be done quickly on an attribute.

Using the methods discussed below, data discretization can be extended to the data to be converted.

Binning

For data discretization and further for the creation of idea hierarchy, this approach can also be used. Values found for an attribute are grouped into a number of equal-width or equal-frequency bins. Then the values are smoothened using bin mean or bin median in each bean. Using this method recursively you can generate concept hierarchy. Binning is unsupervised discretization as it does not use any class information

Histogram Analysis

The histogram distributes an attribute's observed value into a disjoint subset, often called buckets or bins.

Cluster Analysis

Cluster analysis is a common form of data discretization. A clustering algorithm may be implemented by partitioning the values of A into clusters or classes to isolate a computational feature of A.

It is possible to further decompose each initial cluster or partition into many subcultures, creating a lower hierarchy level.

Data discretization using decision tree analysis

Data discretization in a decision tree analysis is performed in which a top-down slicing approach is used; this is done using a supervised procedure. To discretize a numeric attribute, first pick out the attribute that has the lowest entropy, and then run it through a recursive process that will break it up into several discretized disjoint intervals, one below the other, using the same splitting criterion.

Data discretization using correlation analysis

By discretizing data by linear regression, the best neighbouring intervals are found and then the large intervals are combined to create larger overlaps to form the final 20 overlapping intervals. It is a supervised procedure.

Generation Concept Hierarchy for Nominal Data

The nominal data or nominal attribute is one that has a finite number of unique values, and between the values there is no ordering. For example, job-category, age-category, geographic-regions, item-category, etc. are nominal attributes. By adding a group of attributes, the nominal attributes form the definition hierarchy. It can create definition hierarchy, such as path, area, state, nation all together.

Concept hierarchy transforms the data into many layers. By adding partial or absolute ordering between the attributes, the definition hierarchy can be generated and this can be achieved at the level of the schema.

Why Discretization is Important?

An infinite number of degrees of freedom (DoF) have mathematical problems with continuous data. For a variety of purposes, data scientists need the implementation of discretization.

  • Features Interpretation - Due to infinite degrees of freedom, continuous functions have a lower probability of correlating with the target variable and can have a complex non-linear interaction. Therefore, the understanding of such a function could be more complicated. Groups corresponding to the target may be viewed after discretizing a variable.
  • Ratio Signal-to-Noise - As we discretize a model, we fit it into bins and reduce the influence of minor data fluctuations. Sometimes, minor variations are known as noise. By discretization, we will reduce this noise. This is the "smoothing" method in which fluctuations are smoothed from each bin, thereby reducing noise in the results.




Comments and Discussions!

Load comments ↻






Copyright © 2024 www.includehelp.com. All rights reserved.