# Data Exploration in Data Mining

**Data Mining | Data Exploration**: In this tutorial, we are going to learn about the data exploration with the **Definition of Data Exploration**, **Statistical Description of Data**, **Concept of Data Visualization**, **Various techniques of Data Visualization**.

Submitted by Harshita Jain, on November 26, 2019

In the previous article, we have learnt about Data Mining with its advantages, disadvantages and various applications. Now, let us move forward in the depth of data mining which includes various steps by which the data is dealt out. Let us start with **Data Exploration**. This article includes,

- Definition of Data Exploration
- Statistical Description of Data
- Concept of Data Visualization
- Various technique of Data Visualization

### 1) Definition of Data Exploration

**Data exploration** is the process of accumulating data relevant and concerned with information about a target object or field. These characteristics will embrace the size or quantity of information, completeness of the information, correctness of the information, doable relationships amongst knowledge components or files/tables within the knowledge.

**Data exploration** is usually conducted employing a combination of automatic and manual activities. Automatic activities will embrace data profiling or data visualization or tabular report to offer the analyst initial read into the information and an understanding of key characteristics. Usually, it is followed by manual drill-down or filtering of the information to spot anomalies or patterns known through the automatic actions.

**Data exploration** can even need manual scripting and queries into the information (e.g. exploitation languages like SQL or R) or exploitation spreadsheets or similar tools to look at the data. All of those activities are aimed toward making a mental model and understanding of the information within the mind of the analyst, and shaping basic information (statistics, structure, relationships) for the information set that may be employed in future analysis. Once this initial understanding of the information is done, the information is pruned or refined by removing unusable elements of the information (data cleansing), correcting poorly formatted components and shaping relevant relationships across datasets. This method is additionally referred to as crucial knowledge quality.

### 2) Statistical Description of Data

Statistics play an important role in all fields. It helps in collecting data, be it in any field. Along with that, it also helps in analyzing data using statistical techniques. Statistics is all about the “collection” of data. Also, the goal is to maintain the data for the welfare of everyone in the area. According to various calculations, there are several predictions that led to one or the other answer.

Various methods of statistics include,

**2.1) Measure of Central Tendency**

In statistics, a central tendency. maybe referred to as a middle or location of the distribution. Measures of central tendency are often called averages. The most common measures of central tendency area unit,

**The arithmetic mean**: the sum of all numerical values divided by the total number of numerical values.**Median**: It refers to the midpoint of data after arranging the data in ascending order.**Mode**: It refers to the most frequently occurring number in the data.

**2.2) Measure of Dispersion**

In statistics, dispersion is related to variability, scattering and spread is the extent to which a distribution is stretched or squeezed. It tells the variation of the info from each other and provides a transparent plan concerning the distribution of the info. The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of the observations Common examples of measures of statistical dispersion are,

**Range**: It refers to the difference between the highest value to the lowest value.**Variance**: It refers to the sum of the square of deviations from the sample mean which is divided by one less than the sample size.**Standard Deviation**: It refers to the square root of the variance.**Interquartile Range**: The IQR is a measure of variability, based on dividing information set into quartiles. Quartiles divide a rank-ordered knowledge set into four equal components. The values that separate components square measure known as the primary, second, and third quartiles; and that they square measure denoted by Q1, Q2, and Q3.

**2.3) Measure of Skewness and Kurtosis**

Skewness may be a live of symmetry, or more precisely, the lack of symmetry. The data set is symmetric if it looks the same to the left and right of the center point.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, information sets with high kurtosis tend to possess serious tails or outliers. Data sets with low kurtosis tend to possess lightweight tails or a lack of outliers. A uniform distribution would be an extreme case.

### 3) Concept of Data Visualization

Data image is that the graphical illustration of knowledge and data. By mistreatment visual parts like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day.

Data image helps to inform stories by curating information into a type easier to know, highlighting the trends and outliers. A good image tells a story, removing the noise from data and highlighting the useful information. In the world of huge information, information image tools and technologies area unit essential to investigate huge amounts of data and create data-driven selections.

### 4) Various Technique of Data Visualization

**4.1) Common general types of data visualization**

- Charts
- Tables
- Graphs
- Maps
- Infographics
- Dashboards

**4.2) More specific examples of methods to visualize data**

- Area Chart
- Bar Chart
- Box-and-whisker Plots
- Bubble Cloud
- Bullet Graph
- Cartogram
- Circle View
- Dot Distribution Map
- Gantt Chart
- Heat Map
- Highlight Table
- Histogram
- Matrix
- Network
- Polar Area
- Radial Tree
- Scatter Plot (2D or 3D)
- Streamgraph
- Text Tables
- Timeline
- Treemap

TOP Interview Coding Problems/Challenges

- Run-length encoding (find/print frequency of letters in a string)
- Sort an array of 0's, 1's and 2's in linear time complexity
- Checking Anagrams (check whether two string is anagrams or not)
- Relative sorting algorithm
- Finding subarray with given sum
- Find the level in a binary tree with given sum K
- Check whether a Binary Tree is BST (Binary Search Tree) or not
- 1[0]1 Pattern Count
- Capitalize first and last letter of each word in a line
- Print vertical sum of a binary tree
- Print Boundary Sum of a Binary Tree
- Reverse a single linked list
- Greedy Strategy to solve major algorithm problems
- Job sequencing problem
- Root to leaf Path Sum
- Exit Point in a Matrix
- Find length of loop in a linked list
- Toppers of Class
- Print All Nodes that don't have Sibling
- Transform to Sum Tree
- Shortest Source to Destination Path

Comments and Discussions