Learn Big Data from the Best Tutors
Search in
Answered on 06/09/2021 Learn Big Data
Dr Gaurav Khandelwal
Teaching in a practical and fun way is my way
Answered on 09/04/2021 Learn Big Data
How is SNP detection done in the genome sequence Hadoop?
Udayakumar N
Python Training, Class 12 Tuition, CCNA, Dotnet and more.
Lesson Posted on 26/11/2020 Learn Big Data
CheckPointing Process - Hadoop
Silvia Priya
Experienced datawarehouse professional for 10 years. Certified Big data-Hadoop and Python Trainer. I...
CHECK POINTING
Checkpointing process is one of the vital concept/activity under Hadoop. The Name node stores the metadata information in its hard disk.
We all know that metadata is the heart core of the distributed file system; if it is lost, we cannot access any files inside the file system.
The metadata physically gets stored in the machine in the form of two files
1. FSIMAGE - Snapshot of the file system at a point of time
2. EDITS FILE - Contains every transaction (creation,deletion,moving,renaming,copying ..etc of files) in the file system.
Based on HA(High Availability) in Hadoop V2, the backup of the NN's metadata will be stored in another machine called SNN(StandBy Name Node). Since different clients very frequently access metadata for reading other files, instead of keeping it in the hard disk, it is good to store it in the RAM, so that it can be accessed faster.
But Stop...What happens if the machine goes down. :(. We will lose everything in the RAM. Hence taking a backup of the data stored in the RAM is a viable option.
FSIMAGE0 -- Represents the fsimage file at a particular time
FSIMAGE1 -- Represents the copy of the FSIMAGE0 file, taken as a backup.
Let's imagine the backup of the file has to be taken for every 6 hours if something goes wrong in the cluster, and the machine gets down before taking the backup, i.e. before 6 hours, then we end up in losing the latest fsimage file.
So to overcome this problem, a new system has to be exclusively added in the cluster for doing the process of safeguarding the metadata efficiently, and that process is called CheckPointing Process.
Please have a look at the picture and let's understand the process step by step.
STEP 1
A copy of the Metadata(Fsimage and Edits file) from NameNode will be taken and placed inside the Secondary name node(SNN).
STEP 2
Once the copy is placed in SNN, the Edits file which captures every single transaction happening in the file system will be merged with the fsimage file (Snapshot of the filesystem). The combined result will give the updated or latest file system.
STEP 3
The latest merged Fsimage will be moved to the NN's metadata location.
STEP 4
During the process of merging also, some of the files may be deleted or created or copied some transactions could have happened and those details will be stored in a new file called Edits.new, because the original Edits file has been opened/utilized for copying into the SNN, remember the deadlock principle.
STEP 5
Now the Edits.new file will become the latest Edits file, and the Merged fsimage will become the original fsimage file. This process will be continued for a specific interval.
So, now no more backup's are needed to save the metadata in NN in case of failover scenarios.
Will see more details and programs in the upcoming lessons.
Thank you!!
read lessLearn Big Data from the Best Tutors
Lesson Posted on 01/05/2020 Learn Big Data
Loading Hive tables as a parquet File
Silvia Priya
Experienced datawarehouse professional for 10 years. Certified Big data-Hadoop and Python Trainer. I...
Hive tables are very important when it comes to Hadoop and Spark as both can integrate and process the tables in Hive.
Let's see how we can create a hive table that internally stores the records in it in a parquet fashion.
Storing a hive table as a parquet file with a snappy compression in a traditional hive shell
create table transaction(no int,tdate string,userno int,amt int,pro string,city string,pay string) row format delimited fields terminated by ',';
load data local inpath '/home/cloudera/online/hive/transactions' into table transaction;
create table tran_snappy(no int,tdate string,userno int,amt int,pro string,city string,pay string) stored as parquet tblproperties('parquet.compression' = 'SNAPPY');
insert into table tran_snappy select * from transaction;
Storing a hive table as a parquet file with a snappy compression in spark sql
1.Import the hive context in the spark shell and create and load the hive table in a parquet format.
Import org.apache.spark.sql.hive.HiveContext
Val sqlContext = new HiveContext(sc)
Scala> sqlContext.sql(“create table transaction(no int,tdate string,userno int,amt int,pro string,city string,pay string) row format delimited fields terminated by ','
”)
2.Load the created table
Scala>sqlContext.sql(“load data local inpath '/home/cloudera/online/hive/transactions' into table transaction”)
3.Create a snappy compressed parquet table
Scala>sqlContext.sql(“create table tran_snappy(no int,tdate string,userno int,amt int,pro string,city string,pay string) stored as parquet tblproperties('parquet.compression' = 'SNAPPY')”)
4.Load the table from the table gets created in the step 1.
Scala>val records_tran=sqlContext.sql(“select * from transaction”)
Scala>records_tran.insertInto(“tran_snappy”)
Now the records are inserted into the snappy compressed hive table. Go to the /user/hive/warehouse directory to check whether the parquet file gets generated for the corresponding table.
read lessAnswered on 27/11/2019 Learn Big Data
I have Apache pig query. Can you give me answer for this query. On the average which state has the highest amount of precipitation in September 2013?
read lessURIAH Training Center
Lesson Posted on 14/09/2019 Learn Big Data
Joshua Charles
I have 10+ years of IT experience and involved responsibilities such as Production/Application Support...
# Fully equiped bigdata lab , for training and practice .Users can practice bigdata, datascience and machine learning technologies . User Can access this through internet , learn from anywhere. Kindly contact me for activation and subscription
read lessLearn Big Data from the Best Tutors
Answered on 06/07/2019 Learn Big Data
Sravan
Best Real time trainings as per industry standards to easily under
Answered on 20/05/2019 Learn Big Data
Anurag
Big data, Python, Spark Trainer, Data Science
Lesson Posted on 29/03/2019 Learn Big Data
Silvia Priya
Experienced datawarehouse professional for 10 years. Certified Big data-Hadoop and Python Trainer. I...
Hello Big Data Enthusiast,
Many of you would have heard about this term "Big Data" getting buzzed out everywhere and wondering what it could be.
Ok, let's sort out things with an example.
Imagine you have a machine with a capacity of 8 GB storage, and you want to store a data of size 12 GB from a client and perform some analytics on it. So, think of the possible ways in which you can store the desired data.
1. Extend your HD capacity to around 15GB or beyond for a succesfull storage.
2. Hire a cloud serivce and upload the data in cloud for analysis, but if the client don't want to upload the data into cloud due to multiple reasons, then this option will be ruled out.
3. Upload the data into a distributed file system, after analysing it pro's and con's.
True, you can follow any one of the above mentioned cases.
This data of 12GB size which is beyond the machine storage capacity is actually called as BIG DATA.
This BIG DATA can be in any format/type like structured,unstructured and semistructured.
Structured - RDBMS data(table data with proper rows,columns,keys etc..)
Unstructured - Images, Pictures ,Videos etc.
Semistructured - Files of format HTML,XML etc.
A data can be big data to you and not necessarily be a big data to other person.
Seems confusing..., giong back to the previous example, if you have have machine of 20 GB storage capacity, you can conveniently store our 12GB and it is not a big data for you at first place.
Hope now have you climbed up a little, on the mountain of Big data!!..
Big data can also be defined in other way if it satisifies the below criterias.
Volume
If the size of the data you are planning to analyse, is much bigger than the capacity of your machine, then call it as a big data.
Velocity
If the rate or speed of the data entering into your machine increases exponentially with respect to time , then call it as a big data.
Variety
The data could be in any formats like Structured,Unstructured,Semi-structured as we have seen previously.
Veracity
The data that we are going to process can contain some uncertain information or incorrect data.
Value
The data should makes some sense to the business, that is we should be able to make some analysis out of the data.
Now cooking up all of the above information, you can keep your first step into learning Big data.
We will see more insights about this on our next lesson.
Thank you!!
read lessLearn Big Data from the Best Tutors
Answered on 23/12/2018 Learn Big Data
Hemanth Reddy
Hi Priyanka,
Spark has many Tools like Spark-Sql,Mlib..
Spark-Sql is the Tool which will work as like hive in Spark. We just need to login to the Spark-Sql Console By using 'Spark-Sql'Command on Linux box where you can run the same sql. Spark-Sql is Much much faster than Hive so, we use to go with Spark-Sql rather than Hive. I think I have answered your question. Still you have any queries you can reach out to me directly.
read less
UrbanPro.com helps you to connect with the best Big Data Training in India. Post Your Requirement today and get connected.
Ask a Question
The best tutors for Big Data Classes are on UrbanPro
The best Tutors for Big Data Classes are on UrbanPro
Book a Free Demo