Big Data Comprehensive Training (Practical)

The Big Data foundation course provides you with an understanding of Big Data, potential data sources that can be used for solving real business problems, and an overview of data mining and the tools used in it.

  • 4 Days Workshop
  • Completion Certificate awarded by GKK

  • Please contact us directly for more details


Module 1: Big Data – History, Overview, and Characteristics

Big Data Definition
Big Data Benefits
Big Data Characteristics

Big Data Technologies – Overview

Big Data Success Stories

Big Data – Privacy and Ethics

Privacy – Compliance
Privacy – Challenges
Privacy – Approach

Big Data Projects

Who Should Be Involved?
What Is Involved?

Module 2: Big Data Sources

2.1 Enterprise Data Sources

Enterprise Systems
Data Warehouses
Unstructured Data – Introduction
Unstructured Data – Metadata

2.2 Social Media Data Source

Facebook – Introduction
Facebook – Public Feed API
Facebook – Keyword Insights API
Facebook – Graph API
Twitter – Introduction
Twitter – Streaming APIs
Twitter – REST APIs
Other Social Media

2.3 Public Data Sources

Regulatory Bodies


Module 3: Data Mining – Concepts and Tools

3.1 Data Mining – Introduction

Types of Data Mining – Overview
Types of Data Mining – Classification
Types of Data Mining – Association
Types of Data Mining – Clustering

3.2 Data Mining – Tools

Modules of Weka Applications
KNIME – Example
R Language


Module 4: The Hadoop Distributed File System (HDFS)

4.1 Hadoop Fundamentals

Main Components of Hadoop
Additional Components of Hadoop

4.2. The Hadoop Distributed File System (HDFS)

Overview of HDFS
Launching HDFS in Pseudo-Distributed Mode Core HDFS Services
Installing and Configuring HDFS
HDFS Commands
HDFS Safe Mode
Check Pointing HDFS
Federated and High Availability HDFS
Running a Fully-Distributed HDFS Cluster with Docker

4.3. MapReduce with Hadoop

MapReduce from the Linux Command Line Scaling MapReduce on a Cluster Introducing Apache Hadoop Overview of YARN
Launching YARN in Pseudo-Distributed Mode Demonstration of the Hadoop Streaming API Demonstration of MapReduce with Java

Module 5: Apache

5.1. Introduction to Apache Spark

Why Spark?
Spark Architecture
Spark Drivers and Executors
Spark on YARN
Spark and the Hive Metastore
Structured APIs, DataFrames, and Datasets
The Core API and Resilient Distributed Datasets (RDDs)
Overview of Functional Programming
MapReduce with Python

5.2. Apache Hive

Hive as a Data Warehouse
Hive Architecture
Understanding the Hive Metastore and HCatalog Interacting with Hive using the Beeline Interface Creating Hive Tables
Loading Text Data Files into Hive
Exploring the Hive Query Language
Partitions and Buckets
Built-in and Aggregation Functions Invoking MapReduce Scripts from Hive Common File Formats for Big Data Processing Creating Avro and Parquet Files with Hive Creating Hive Tables from Pig
Accessing Hive Tables with the Spark SQL Shell

5.3. Persisting Data with Apache HBase

Features and Use Cases
HBase Architecture
The Data Model
Command Line Shell
Schema Creation
Considerations for Row Key Design

5.4 Apache Storm

Processing Real-Time Streaming Data
Storm Architecture: Nimbus, Supervisors, and ZooKeeper
Application Design: Topologies, Spouts, and Bolts


Module 6: Data Modelling with Document Databases

6.1 MongoDB Fundamentals

Sharding and Replication
MongoDB Ecosystem – Languages and Drivers
MongoDB Ecosystem – Hadoop Integration
MongoDB Ecosystem – Tools

6.2 Install and Configure

How to Install and Configure

6.3 Document Databases

Document Design Considerations

6.4 Data Modelling with Document Databases

Twitter Sentiment Analysis
Twitter Sentiment Analysis – Algorithm
Network Log Analysis
Network Log Analysis – Algorithm

What are the course objectives?

  • What are the prerequisites?

    All trainees to have the following:

    i) Required knowledge for attendees
    Conversant with any imperative programming language like C
    Knowledge of SQL query

    ii) Hardware Requirement
    Minimum Configuration of Laptop
    Memory/ RAM 8 GB
    Free Disk Space 30 GB
    4 CPU cores

    iii) Software Requirement:
    Windows or Mac
    Oracle Virtual Box (

Who should take the course?

  • Software developers
  • IT managers
  • Service management professionals
  • Technology Managers

Who is your trainer for the program?


We offer the following options:

  • Cash
  • HRDF Claimable
  • Maybank Ezpay (Up to 24 months @ 0% Interest)
  • CIMB Easy Pay (Up to 12 months @ 0% Interest)
  • Cash Installment (Case by case basis)

Futureproof Yourself With Us!

Find Out More