What is big data?
Big data is nothing but large volume of data obtained by servers like facebook. In facebook, there are 250 million photos are uploaded per day. Every 60 seconds on Facebook: 510,000 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. This is a big data. 500 million people are watching videos in facebook each day. so facebook has to maintain these data in an order which can be shown to the end user as they desire.
What is hadoop?
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. It is used for offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. It is written in java.
What is spark?
Spark is a cluster computing platform designed to be fast and general purpose. It is distributing tool that works in hadoop. It is a data processor.
Who can study Big data?

There is no strict prerequisite to start learning Hadoop. if you want to become an expert in Hadoop and make an excellent career, you should have at least basic knowledge of Java and Linux. Anyone with following high-level skills can study Big data

  • Programming
  • Data analysis
  • Data mining or machine learning
  • Statistical analysis software
Where to study Big data?

We recommend you to visit us before you go elsewhere. We are the best Big data training institute in Coimbatore. Our practitioners will train you for the today’s trend.

What are the components of Hadoop?

The main components of Hadoop are,

Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)

What are HDFS and YARN?

HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.

YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

What is distributed cache and what is its benefits?

Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing. Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are:

It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.

Why both spark and Hadoop needed?

Spark is often called cluster computing engine or simply execution engine. Spark uses many concepts from Hadoop MapReduce. Both Spark and Hadoop work together well. Spark with HDFS and YARN gives better performance and also simplifies the work distribution on cluster. As HDFS is storage engine for storing huge volume of data and Spark as a processing engine(In memory as well as more efficient data processing).

HDFS: It is used as a storage engine for Spark as well as Hadoop.

YARN: It is a framework to manage cluster using pluggable scheduler.

Run other than MapReduce: With Spark you can run MapReduce algorithm as welll as other higher level of operators for instance map(), filter(), reduceByKey(), groupByKey() etc.

Which limits the maximum size of a partition?

The maximum size of a partition is ultimately limited by the available memory of an executor.

What is Shuffling?

Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors. Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

What is check pointing?

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.

You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

Define Spark architecture

Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes.

What is the purpose of Driver in Spark Architecture?

A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created.

  • Drive splits a Spark application into tasks and schedules them to run on executors.
  • A driver is where the task scheduler lives and spawns tasks across workers.
  • A driver coordinates workers and overall execution of tasks.
Can you define the purpose of master in Spark architecture?

A master is a running Spark instance that connects to a cluster manager for resources. The master acquires cluster nodes to run executors.

What are the workers?

Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool.

What is Speculative Execution of a tasks?

Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.

Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.

Which all cluster manager can be used with Spark?

Apache Mesos, Hadoop YARN, Spark standalone and Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.

+91 9842048481

0422 – 4512342

63, Sivanandha Colony, Coimbatore - 641038