Hadoop and Spark

32 Hours

TDXBI-104

Course outline

The course delivers the key concepts and expertise participants need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. How to employ Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala. Learning about the challenges faced by Hadoop developers. Participants learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools.

Upcoming meetings

There are no upcoming meetings for this course.
Contact us to schedule this course, which will be customized specifically for your organization.
info@hackerupro.com
Download Full Syllabus

Modules

  • Problems with Traditional Large-scale Systems
  • The Hadoop EcoSystem
  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture
  • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN
  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2
  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases
  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching
  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression
  • Partitioning Overview
  • Partitioning in Impala and Hive
  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration
  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark
  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations
  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala & Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging
  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks
  • RDD Lineage
  • Caching Overview
  • Distributed Persistence
  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means
  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL with Impala

Prerequisites

  • 01 Basic knowledge of database concepts and development environments

Target Audience

  • Software developers
  • Software architects
  • Professionals willing to start working with Big Data

Contact us

    Skip to content