07 - HBase Database

​7.1 What is HBase

HBase is built up database system over Hadoop environment. It is basically a distributed column oriented database and it is used to access large datasets with doing real time read write.

Installation

First we need to download HBase tar file from Apache website then we need to copy the same at our file system and need to issue the following command

  • % tar xzf hbase-x.y.z.tar.gz

As same with the other installable of Hadoop, we first need to tell HBase that where Java is located on our file system. Make sure that we have the JAVA_HOME environment variable set to point to a suitable Java installation. Otherwise we need to set the Java installation that HBase uses by editing the file

  • conf/hbase-env.sh

Alternatively to make it simple we can add the HBase binary directory to the command line path by issuing following command

  • % export HBASE_HOME=/home/hbase/hbase-x.y.z
  • % export PATH=$PATH:$HBASE_HOME/bin

Once the Hbase is installed successfully then to make sure everything is working, we can issue a simple command that would give us various options of Hbase

To start an HBase instance - % start-hbase.sh

To administer HBase instance - % hbase shell

% hbase

shell- runs the HBase shell

master - runs an HBase HMaster node

regionserver- runs an HBase HRegionServer node

zookeeper- runs a Zookeeper server

rest - runs an HBase REST server

thrift- runs an HBase Thrift server

avro- runs an HBase Avro server

migrate- upgrades an hbase.rootdir

hbck - runs the hbase 'fsck' tool

7.2 Understand the Basics of HBase

HBase also has the same structure that of RDBMS. It has horizontal rows and vertical columns. Cross section of row and column is known as the cell where the individual data value resides. In HBase Table’s column family needs to be specified up front as part of the table schema definition. However as per the requirement the new column family members can be added. As an example a new column can be added any time as part of the update procedure. As previously we described Hbase as a column-oriented database, it would be more precise if it is described as a column-family-oriented database. The reason for this is that the storage specifications and tunings are done at the column-family level.

In HBase the tables are actually automatically partitioned horizontally into regions. Each region is made up of a subset of a table’s rows. A region is denoted by the table it belongs to.A table comprises of a single region,but as and how the size of the region grows, after it crosses a particular fixed size threshold, it splitsat a row boundary into two new regions of almost equal size. Note that until this first split happens, entire data loading will happen against the single server hosting the original region. Regions get distributed over an HBase cluster. In this way, a table that is very big for a single server to handle with, carried by a cluster of servers, with each node hosting a subset of the table’s total.

Hbase is modeled with master slave architecture. Just like the HDFS and MapReduce built in Hadoop framework. The HBase master is responsible for bootstrapping, assigning regions to registered regionservers, recoveringRegionserver failures.

HBase communicates with ZooKeeper and it manages a ZooKeeper instance. Following diagram explain how entire communication takes place among Hbase master, Zookeeper cluster, Regionserver and HDFS.

7.3 HBase Vs RDBMS

HBase and RDBMS are two different database. They are actually different in the way they store the data and implementation. Compare to traditional RDBMS approach that is row oriented database, the HBase is column oriented database. RDBMS is not built with providing a very large scale data support. While HBase provides very large scale support. Many RDBMS vendor offer a solution such that replication and partitioning tosupport the large sized dataset but by doing this too many joins would be implemented and query performance would be slow. Note that the reads are still fine but write would be too slow. At the end you will realize that these add-ons are complicated to maintain and install, also you will realize that you are compromising some very good features of RDBMS. It also severely compromise Joins, complex queries, triggers,views and foreign-key constraints.

For a small to medium sized volume applications RDBMS based database solutions are good, Here RDBMS will provide ease of use, flexibility, and powerful feature set.  Howeverif you need to scale upin terms of handling large sized dataset to read write and process, you will  find that the RDBMS will not be able to handle it properly. The scaling of an RDBMS will actually force you to break the Codd’s rules and loosening ACID restrictions. Ultimately you would end up losing most of the desirable properties of relational databases management.

Here are the features of HBase that makes it different

  • In HBase, the rows are stored sequentially, as are the columns within each row. Therefore it has as such not real index. Note that the insert performance is independent of table size.
  • HBase provides automatic partitioning, as the data in table grows, they will automatically be split into regions and would get distributedacross all available nodes.
  • This feature of HBase is very powerful that when we add a node, point it to the existing cluster and run the regionserver then Regions would get automatically rebalance, and load will be spread evenly.
  • HBase can be implemented over commodity hardware that makes it working comparatively cheap.
  • HBase is highly fault tolerance. It has lots of nodes that eventually means each node is relatively insignificant. Hence no need to worry about individual node failing.
  • HBase does batch processing. It provides fully parallel distributed data processing.

Ultimately if you are really worried about the growing size of database then switch from RDBMS to HBase.

7.4 HBase– A Practical Approach

As we know that HBase is a database that is used to read and write data. We will see some practical aspects of the same –

  • In the shell, we need to define our table as –

hbase(main):036:0> create 'tableA', {NAME => 'info', ID => 1}

  • HBase has various methods of loading data into tables. The most upfront methods are
    • To use the TableOutputFormat class from a MapReduce job
    • To use the normal client APIs

 However, these may not be always the most efficient methods to work with.

Bulk load feature uses a MapReduce job so that to output table data in HBase's internal data format, and then it actually directly loads the generated StoreFiles into a running cluster. Bulk load will assure you less CPU usage and optimal network resources consumption than simply using the HBase API.

  • If you have a plain file and want to make it part of Hadoop by loading into HBase then here are the different options –
  • Use tools in HBase like importtsv and completebulkload
  • We can write a Pig script something like

       TableA = LOAD '/mytext.txt' USING PigStorage(',') as (name:chararray, id:long);

       STORE TableA INTO 'hbase://mydata'

       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:intdata');

  • Create a program using HBase API
  • Prepare a MapReduce job
  • Write a Hive program to bulk load data into HBase
  • HBase data files are also known as the HFiles. Note that when we do a bulk load then it bypasses the HBase API and write as in format of HFile. Always remember that HBase can be used as data source.
  • If you really understand HBase's architecture then you would know that tables are actuallysplit down into a number of regions. In order to work correctly HFileOutputFormat must be configured in a way that each HFile which is getting generated, should get fit into a single region. To achieve this Hadoop's TotalOrderPartitioner map’soutput to the key ranges of the regions in the HBase table.configureIncrementalLoad() function  that is part of HFileOutputFormat and it automatically groups upa TotalOrderPartitioner based on current region boundaries of table.

Like us on Facebook