Big Data Analytics Tutorial

Big Data Analytics MCQs

Big Data Types, Their Advantages and Disadvantages

Big Data Types: In this tutorial, we will learn about the various big data types, their advantages and disadvantages. By IncludeHelp Last updated : June 08, 2023

Big Data has categorized majorly into three categories -

Structured data
Unstructured data
Semi-structured data

Some other categories of data are as follows -

Time-Series Data
Spatial Data
Streaming Data
Graph Data

This tutorial demonstrates all the mentioned categories of data with their advantages, disadvantages, and applications.

In the context of big data, there are different categories of data. These types are based on various characteristics of the data, such as its structure, volume, velocity, and variety. Here are some commonly recognized types of big data:

1. Structured Data

Structured data refers to data that has a predefined format and is organized in a fixed schema. It stores in RDBMS form and can be processed easily. Hence, it means highly organized information that can be saved and recovered using a set of algorithms. For example, the employee table in a company's database will be structured so that employee information, job titles, remuneration, etc. can be presented in an organized manner. Below are the examples of applications that stores and retrieve structured data.

Examples - spreadsheets or SQL databases.

The below table represents sales and discount information for different products; which shows the structure form of data

Product	Sum of Sales	Sum of Discount
Bike Tyres	72	4%
Car & Bike Care	118	3%
Car Body Covers	117	4%
Car Media Players	560	12%
Car Speakers	211	4%
Tyre	250	3%

Structured data plays a crucial role in big data analytics. It refers to data that is organized and stored in a fixed format. This type of data is highly organized, and each data element is assigned a specific data type and stored in predefined fields.

Advantages of Structured Data

Efficient storage and processing: Structured data is stored in a predefined format, allowing for efficient storage and retrieval. Databases optimized for structured data, such as relational databases, enable fast querying and processing of large volumes of data.
Data integrity and consistency: Structured data enforces data integrity and consistency through predefined schemas. The schema defines the structure of the data, including data types, constraints, and relationships. This ensures that the data is accurate and reliable, facilitating meaningful analysis.
Easily used by machine learning (ML) algorithms: The precise and organised structure of structured data facilitates the manipulation and querying of machine learning (ML) data.
Easy integration and interoperability: Structured data can be easily integrated with other structured datasets or systems. It follows a consistent format, making it compatible with various data processing and analytics tools. This enables organizations to combine and analyze data from multiple sources, gaining deeper insights and making informed decisions.
Easily used by business users: Structured data does not necessitate an in-depth comprehension of the various categories of data and their functions. Users can access and interpret the data if they have basic knowledge of the data.
Standardized analysis techniques: Structured data allows for the use of standardized analysis techniques, such as SQL queries, reporting, and data visualization. These techniques have been widely adopted and well-established, making it easier for analysts and data scientists to derive insights from structured datasets.
Regulatory compliance: Many industries have regulations and compliance requirements that mandate structured data formats. By using structured data in big data analytics, organizations can ensure compliance with these regulations, maintain data governance, and provide auditable records.
Accessible by more tools: Structured data is organised data, it is categorised as quantitative data which can be used and analyze by different tools.

Disadvantages of Structured Data

Limited usage: Data stored in a predefine structure such as tables and databases can only be used for its intended purpose due to this feature it limits its flexibility and applicability.
Fixed structure: This fixed structure makes it challenging to accommodate changes or modifications to the data schema. Adding new data fields or altering the structure requires careful planning and may involve significant effort.
Inability to capture unstructured data: Structured data struggles to handle unstructured or semi-structured data, such as text documents, social media posts, and multimedia content. Unstructured data often contains valuable insights that structured data alone cannot capture.
Limited storage options: Mostly, structured data is saved in a relational database with predefine schema, "data warehouses." Because of this, when data needs to change, all organized data needs to be updated, which takes a lot of time and resources.
Difficult to manage variability in data: Generally, structured data have a consistent and standardized format. When dealing with diverse data sources that may have variations in data types, formats, or quality, integrating and normalizing the data can be challenging.

2. Unstructured Data

Unstructured data refers to data that does not have a predefined structure or format. The formal structural principles of data models are not followed by Unstructured Data, which is a completely different type of data. Unstructured data includes text documents, social media posts, emails, audio and video files, sensor data, and more.

Figure: Unstructured Data

Normal data tools and methods can't be used to process and analyze unstructured data, which is usually called qualitative data. It is often challenging to analyze using traditional methods because it lacks a uniform structure. Unstructured data is best handled in non-relational (NoSQL) systems because it doesn't have a set data model. Another way to handle disorganized data is to keep it in its raw form in "data lakes."

Advantages of Unstructured Data

Native format: Keeps in native format, and doesn't change until needed. Its adaptability increases file formats and enables data experts to prepare and analyze the data they need.
Flexibility and adaptability: Unstructured data is versatile and can be easily captured and stored without the need for predefined schemas or data models, allowing organizations to adapt and evolve their data collection strategies as needed.
Fast accumulation rates: Data may be acquired fast and readily because there is no need to predefine it.
Rich and diverse information: Contextual information in different contexts such as subjective opinions, sentiments is not captured in structured data formats. Unstructured data allows for storing and analyzing complex topics or situations.
Real-world representation: By analyzing unstructured data; companies can get insights such as customer preferences, behavior, and trends which help them to make more informed decisions.
Data enrichment: Unstructured data can be enriching to make fruitful decisions. For example, adding sentiment analysis scores to customer reviews can provide a more comprehensive view of customer satisfaction.
Data lake storage: Allows for huge storage and pay-as-you-go pricing, lowering expenses and facilitating scaling.
Data-driven decision-making: Analyzing unstructured data allows organizations to make data-driven decisions based on information. Organizations can gain a more complete understanding of their business environment, market trends, and customer needs.

Disadvantages of Unstructured Data

Complexity: Managing and understanding unstructured data can be complex and require specialized tools and techniques.
Requires expertise: Due to the non-formatted nature of unstructured data, data experts are required to expertisation to prepare and analyze it.
Difficult data extraction: Extracting relevant information from unstructured data can be challenging.
Incompatibility: Unstructured data may not be compatible with traditional databases or analytical tools designed for structured data. Converting unstructured data into a structured format can be time-consuming and may result in a loss of information or context.
Limited data quality control: Unstructured data may contain errors, inconsistencies, or inaccuracies, making it difficult to ensure data quality.
Analysis complexity: Analyzing unstructured data requires specialized skills and advanced analytics techniques.
Specialized tools: Unstructured data requires specialized tools to work on it.

3. Semi-structured Data

Semi-structured data refers to data that does not have a rigid structure like traditional relational databases; it's data that is neither completely structured nor unstructured. This includes the characteristics of both types of data.

In semi-structured data, information is organized in a way that allows for easy processing and retrieval, but the structure is not as strict as in a relational database. It typically includes tags, labels, or other markers that provide some level of organization or hierarchy. This allows for flexibility in storing and analyzing data, making it suitable for various applications.

Examples of Semi-structured Data

Some common examples of semi-structured data include:

XML (eXtensible Markup Language): XML is a popular format for representing semi-structured data. It uses tags to define elements and attributes to provide additional information about the data.
JSON (JavaScript Object Notation): JSON is a lightweight data-interchange format that is commonly used for web APIs and data storage. It organizes data into key-value pairs and nested structures.
Log files: Log files generated by systems, applications, or devices often contain semi-structured data. They typically include timestamps, event descriptions, and other relevant information.
Avro: Avro is an RPC framework and data serialization that was originally designed to use Apache Hadoop. Avro serializes data in a compact, binary format using JSON schemas that can be sent to any app or program for deserialization.
ORC: Optimised Row Columnar (ORC) is a semi-structured data format that was created to accomplish more efficient data compression and to improve the performance of reading, writing, and processing data in comparison to earlier Hive formats.
HTML (Hypertext Markup Language): Although primarily used for defining the structure of web pages, HTML can also be considered semi-structured data due to its tag-based organization.

Types of Semi-Structured Data

Images/Videos: When we take a picture with our phone, the picture and information about when and where it was taken are saved in the album. After that, we might change the name of the picture or put it in a new group.
Emails: Emails are automatically put into the Inbox, Spam, or Outbox folders based on the sender, recipient, subject, and date. The data in the mail is unstructured, but we can find it using keywords.
Social Media Platforms: On different social media platforms; comments, content, and likes are semi-structured. Twitter Tweets, Instagram images/videos, and YouTube contents are semi-structured data.
Machine-Generated Semi-Structured Data: The weather, traffic, satellite pictures, and video are some of examples of semi-structured data.

All the above-mentioned examples can be considered sources of semi-structured data. Semi-structured data can be processed and analyzed using various techniques. For example, XML and JSON data can be parsed and transformed into structured formats for easier analysis. Additionally, technologies like NoSQL databases and data lakes are often used to store and process semi-structured data due to their flexible schemas and scalability.

Advantages of Semi-structured Data

Able to manage different data types and formats: Semi-structured data allows for more flexibility compared to structured data.as it doesn't follow a preset framework. Its flexible schema enables the inclusion of new data elements without requiring a complete overhaul of the existing data structure.
Scalability: Semi-structured data can handle a wide variety of data types and formats. It is well-suited for handling diverse and heterogeneous data sources, making it scalable for managing large and complex datasets.
Reduced data integration efforts: It enables the combination of data from various sources, even if they have different structures, by leveraging common attributes or key-value pairs.
Schema evolution: Semi-structured data accommodates schema evolution without disrupting existing data.
Simplified data exchange: Semi-structured data formats like JSON and XML are widely used for data exchange due to their human-readable and self-descriptive nature.
Analytical flexibility: Semi-structured data provides analytical flexibility, allowing users to explore and analyze data in different ways. This empowers data analysts and data scientists to extract insights and discover patterns that may not be captured in a predefined schema.
Data enrichment: Semi-structured data can be enriched with additional metadata or annotations to enhance its meaning and context. This enrichment can improve data quality, provide better searchability, and support advanced analytics.
Highly storable and portable: Compared to unstructured data, semi-structured data is significantly easier to store and transport. Data portability is the ease with which data can be transferred, accessed, shared, and organized. It becomes relatively simple to transfer data from one network location to another.

Disadvantages of Semi-structured Data

Lack of data integrity: Without a predefined schema or structure, it becomes challenging to ensure the accuracy and validity of the data.
Limited querying and analysis capabilities: Analyzing and querying Semi-structured data can be more complex and time-consuming, requiring specialized techniques such as XPath or JSON querying.
Data integration challenges: Semi-structured data requires additional effort to map and transform the data into a unified format.
Lack of data consistency: Without a rigid structure, Semi-structured data may lack consistency as different data sources use different data formats which lead to inconsistencies that can hinder data analysis and interpretation.
The complexity of data validation and cleaning: Due to the lack of predefined rules or schemas, validating and cleaning Semi-structured data can be challenging.
Increased storage requirements: Without a predefined structure, additional metadata or tags may be required to maintain the organization and meaning of the data.
Difficulty in maintaining data quality: With the absence of strict rules and constraints, ensuring data quality in Semi-structured data becomes more challenging.

Some other categories of data are as follows –

4. Time-Series Data

Data from repetitive measurements over time constitute a time series. If you plotted the points on a graph, time would always be one of the axes.

For instance, a metric could refer to the amount of inventory sold from one day to the next in a store. In this data format, each data point is associated with a specific timestamp or time period.

Examples of time series data:

Stock prices
Rainfall measurements
Annual retail sales
Heartbeats per minute
Monthly subscribers etc.

Time-series data can exhibit various patterns and properties, such as trends, seasonality, cyclicality, and irregular fluctuations. Understanding and analyzing these patterns can provide valuable insights into the underlying processes, help in making predictions, and support decision-making.

Advantages of Time-Series Data

Trend Analysis: Time series data allows for the detection and analysis of trends and patterns over time.
Cleaning data: This enables to identification the true "signal" in a data set by eliminating noise and outliers.
Seasonality Detection: By analyzing the data we can identify and account for seasonal variations, enabling better forecasting and decision-making.
Understanding data: The models utilized in time series analysis aid in the interpretation of the data's actual significance.
Forecasting: By studying historical patterns and relationships within the data, we can build models to predict future values and trends.
Anomaly Detection: Time series data can be useful in detecting anomalies or outliers.
Correlation and Causality Analysis: By analyzing the correlations between different time series, we can uncover dependencies; identify leading or lagging indicators, and gain insights into the driving factors behind observed patterns.
Time-dependent Modeling: This enables model development that explicitly captures the time-dependent nature of the data which allows us for accurate forecasting and prediction such as autoregressive integrated moving average (ARIMA).

Disadvantages of Time-Series Data

Data Quality and Completeness: Time series data can be prone to missing values, outliers, or errors.
Large-scale time series data: Analyzing and interpreting large-scale time series data is challenging, as it requires managing datasets, and patterns, and extracting relevant information.
Sensitivity to Outliers: Outliers, or extreme values, can significantly impact time series analysis.
Computational Complexity: Analyzing time series data, particularly with advanced techniques like machine learning algorithms, can be computationally intensive.
Causality and Interpretability: Establishing causal relationships from time series data alone can be difficult, as other factors and external influences may be involved.

5. Spatial Data

Spatial data also known as geospatial data or geographic information is a type of data that references a specific geographical location. Spatial data is made up of points, lines, polygons, and other geographic and geometric data primitives. These can be tracked by location, saved with an item as information, or used by a communication system to find end-user devices.

Types of Spatial Data

Geometric data: Geometric data is a type of physical data that is mapped on a flat surface in two dimensions. The geometry data on floor plans is an example of this. Google Maps is another simplest example of geometric data which gives correct directions.
Example: Google Map
Geographic data: Geographic data is a type of data that is plotted on a map of a sphere. This shows the latitude and longitude of an object or place on the map.
Example: A global positioning system

Advantages of Spatial Data

Data Visualization: Spatial data provides a visual representation of data, allowing users to observe and analyze information in a geospatial context.
Geographic Context: It allows users to consider the location-specific factors and variables that may influence outcomes or behaviors.
Data Integration: It integrates and analyzes different data sources, such as demographic information, environmental data, economic indicators, and infrastructure data.
Geospatial Analysis: Spatial data analysis techniques, such as spatial clustering, spatial interpolation, and spatial regression, enable advanced analysis and modeling.
Location-Based Services: Spatial data forms the foundation for location-based services (LBS) such as navigation systems, geolocation services, and proximity-based marketing.
Environmental Analysis: It helps in environmental analysis by mapping and tracking various factors, such as land use, vegetation cover, air quality, water resources etc.
Scientific Research: It explores different disciplines, including geology, astronomy, archaeology, ecology, and climatology.

Disadvantages of Spatial Data

Complexity: Spatial data requires specialized tools, software, and skills to manipulate and analyze effectively. It is often complex.
Data Quality Issues: Spatial data can suffer from quality issues such incompleteness, and inconsistencies.
Data Acquisition Costs: Gathering and processing spatial data may be expansive related to data collection, storage, maintenance, and skilled personnel.
Privacy and Security Concerns: Privacy and security concerns arise when handling such data, as unauthorized access or misuse.
Resource Intensiveness: Performing complex spatial operations or running large-scale analyses may require substantial computational power, storage, and processing.

6. Data Streaming

Data streaming continually transmits, ingests, and processes data. Data streaming is essential for real-time analytics. Processing and analyzing streaming data as it arrives is its value.

Streaming data makes it possible to handle parts of this data in real-time or very close to real-time. Two of the most popular ways to use data streams are:

Streaming media (video)
Real-time analytics

Examples of Data Streaming

Followings are some examples of data streaming

Social media feeds
Multiplayer video games
Ride-sharing

Advantages of Data Streaming

Real-time or near-real-time processing: Data streaming enables the processing of in real-time.
Continuous data flow: Streaming allows for a continuous flow of data rather than processing data in batches.
Event-driven architecture: This allows for the development of more responsive and reactive applications that can automatically trigger actions based on specific events, such as sending alerts, notifications, or initiating workflows.
Identifying patterns: It is well suited to identify patterns over time.
Handling of data volumes: Sorting and saving pieces of data will be useful in the long run.
Handling of data volumes: Sorting and saving pieces of data will be useful in long run.

Disadvantages of Data Streaming

Streaming Data is Very Complex
Difficult to integrate and accessing streaming data

7. Graph Data

Graph data is used to model and analyze complex systems that involve interconnected entities, such as social networks, transportation networks, biological networks, recommendation systems, and more. In data science, graph data is often analyzed using graph algorithms and techniques to extract meaningful information and gain insights. These algorithms can help with tasks such as finding the shortest path between two nodes, identifying communities or clusters within the graph, detecting anomalies or outliers, and performing network analysis.

Graph databases are specifically designed to store and process graph data efficiently. They provide a flexible and scalable way to handle large-scale graph datasets and perform complex graph queries.

Advantages of Graph Data

Visual Representation: Graph data represents and stores complex relationships between entities.
Flexible Structure: Graph data is inherently flexible and can handle various types of data and relationships.
Contextual Insights: By exploring the relationships between entities, we can uncover patterns, discover hidden connections, and gain a deeper understanding of the data.
Real-world Applications: Graph data finds extensive applications in various domains.
Data Integration: Graph data enables the integration of heterogeneous data sources.

Disadvantages of Graph Data

Large Storage: Graph databases typically require additional storage space compared to relational databases.
Complex data modeling: Designing the graph schema and defining relationships between entities is more complex.
Performance degradation with deep traversal: Deep traversal involves multiple levels of connections can result in performance degradation and time-consuming.