Skip to main content

Know the Hive

Big data involves processing massive amounts of diverse information and delivering insights rapidly—often summed up by the four V's: volume, variety, velocity, and veracity. Data scientists and analysts need dedicated tools to help turn this raw information into actionable content, a potentially overwhelming task. Fortunately, some effective tools exist to make the task easier.

Hadoop is one of the most popular software frameworks designed to process and store Big Data information. Hive, in turn, is a tool designed for use with Hadoop. This article details the role of Hive in big data, as well as details such as Hive architecture and optimization techniques.

Let’s start by understanding what Hive is in Hadoop.

What is Hive in Hadoop?

No one can better explain what Hive in Hadoop is than the creators of Hive themselves: "The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage."

In other words, Hive is an open-source system that processes structured data in Hadoop, residing on top of the latter for summarizing Big Data, as well as facilitating analysis and queries.

Now that we have investigated what is Hive in Hadoop, let’s look at the features and characteristics.

Architecture of Hive

The major components of Hive and its interaction with the Hadoop is demonstrated in the figure below and all the components are described further: 

  • User Interface (UI) – As the name describes User interface provide an interface between user and hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive command line, and Hive HD Insight (In windows server) are supported by the user interface. 
  • Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
  • Driver – Queries of the user after the interface are received by the driver within the Hive. Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled on JDBC/ODBC interfaces is provided by the user. 
  • Compiler – Queries are parses, semantic analysis on the different query blocks and query expression is done by the compiler. Execution plan with the help of the table in the database and partition metadata observed from the metastore are generated by the compiler eventually.
  • Metastore – All the structured data or information of the different tables and partition in the warehouse containing attributes and attributes level information are stored in the metastore. Sequences or de-sequences necessary to read and write data and the corresponding HDFS files where the data is stored. Hive selects corresponding database servers to stock the schema or Metadata of databases, tables, attributes in a table, data types of databases, and HDFS mapping. 
  • Execution Engine – Execution of the execution plan made by the compiler is performed in the execution engine. The plan is a DAG of stages. The dependencies within the various stages of the plan is managed by execution engine as well as it executes these stages on the suitable system components. 
In the above diagram along with architecture, job execution flow in Hive with Hadoop is demonstrated step by step.
  • Step-1: Execute Query – Interface of the Hive such as Command Line or Web user interface delivers query to the driver to execute. In this, UI calls the execute interface to the driver such as ODBC or JDBC.
  • Step-2: Get Plan – Driver designs a session handle for the query and transfer the query to the compiler to make execution plan. In other words, driver interacts with the compiler. 
  • Step-3: Get Metadata – In this, the compiler transfers the metadata request to any database and the compiler gets the necessary metadata from the metastore. 
  • Step-4: Send Metadata – Metastore transfers metadata as an acknowledgment to the compiler.
  • Step-5: Send Plan – Compiler communicating with driver with the execution plan made by the compiler to execute the query. 
  • Step-6: Execute Plan – Execute plan is sent to the execution engine by the driver. Execute Job. Job Done. Dfs operation (Metadata Operation)
  • Step-7: Fetch Results – Fetching results from the driver to the user interface (UI). 
  • Step-8: Send Results – Result is transferred to the execution engine from the driver. Sending results to Execution engine. When the result is retrieved from data nodes to the execution engine, it returns the result to the driver and to user interface (UI).

Hive's Features

These are Hive's chief characteristics:

  • Hive is designed for querying and managing only structured data stored in tables
  • Hive is scalable, fast, and uses familiar concepts
  • Schema gets stored in a database, while processed data goes into a Hadoop Distributed File System (HDFS)
  • Tables and databases get created first; then data gets loaded into the proper tables
  • Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar File), and TEXTFILE
  • Hive uses an SQL-inspired language, sparing the user from dealing with the complexity of MapReduce programming. It makes learning more accessible by utilizing familiar concepts found in relational databases, such as columns, tables, rows, and schema, etc.
  • The most significant difference between the Hive Query Language (HQL) and SQL is that Hive executes queries on Hadoop's infrastructure instead of on a traditional database
  • Since Hadoop's programming works on flat files, Hive uses directory structures to "partition" data, improving performance on specific queries
  • Hive supports partition and buckets for fast and simple data retrieval
  • Hive supports custom user-defined functions (UDF) for tasks like data cleansing and filtering. Hive UDFs can be defined according to programmers' requirements

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

  • Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but not Online Transaction Processing (OLTP).
  • It doesn’t support subqueries.
  • It has a high latency.
  • Hive tables don’t support delete or update operations.

How Data Flows in the Hive?

  1. The data analyst executes a query with the User Interface (UI).
  2. The driver interacts with the query compiler to retrieve the plan, which consists of the query execution process and metadata information. The driver also parses the query to check syntax and requirements.
  3. The compiler creates the job plan (metadata) to be executed and communicates with the metastore to retrieve a metadata request.
  4. The metastore sends the metadata information back to the compiler
  5. The compiler relays the proposed query execution plan to the driver.
  6. The driver sends the execution plans to the execution engine.
  7. The execution engine (EE) processes the query by acting as a bridge between the Hive and Hadoop. The job process executes in MapReduce. The execution engine sends the job to the JobTracker, found in the Name node, and assigns it to the TaskTracker, in the Data node. While this is happening, the execution engine executes metadata operations with the metastore.
  8. The results are retrieved from the data nodes.
  9. The results are sent to the execution engine, which, in turn, sends the results back to the driver and the front end (UI).

Since we have gone on at length about what Hive is, we should also touch on what Hive is not:

  • Hive isn't a language for row-level updates and real-time queries
  • Hive isn't a relational database
  • Hive isn't a design for Online Transaction Processing

As we have looked into what is Hive, let us learn about the Hive modes.

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

  • Local mode
  • Map-reduce mode

User Local mode when:

  • Hadoop is installed under the pseudo mode, possessing only one data node
  • The data size is smaller and limited to a single local machine
  • Users expect faster processing because the local machine contains smaller datasets.

Use Map Reduce mode when:

  • Hadoop has multiple data nodes, and the data is distributed across these different nodes
  • Users must deal with more massive data sets

MapReduce is Hive's default mode.

Hive and Hadoop on AWS

Amazon Elastic Map Reduce (EMR) is a managed service that lets you use big data processing frameworks such as Spark, Presto, Hbase, and, yes, Hadoop to analyze and process large data sets. Hive, in turn, runs on top of Hadoop clusters, and can be used to query data residing in Amazon EMR clusters, employing an SQL language.

Hive and IBM Db2 Big SQL

Data analysts can query Hive transactional (ACID) tables straight from Db2 Big SQL, although Db2 Big SQL can only see compacted data in the transactional table. Data modification statement results won’t be seen by any queries generated in Db2 Big SQL until you perform a compaction operation, which places data in a base directory.

Hive vs. Relational Databases

Relational databases, or RDBMS, is a database that stores data in a structured format with rows and columns, a structured form called “tables.” Hive, on the other hand, is a data warehousing system that offers data analysis and queries.

Here’s a handy chart that illustrates the differences at a glance:

Relational Database

Hive

Maintains a database

Maintains a data warehouse

Fixed schema

Varied schema

Sparse tables

Dense tables

Doesn’t support partitioning

Supports automation partition

Stores normalized data

Stores both normalized and denormalized data

Uses SQL (Structured Query Language)

Uses HQL (Hive Query Language)

Hive Optimization Techniques

Data analysts who want to optimize their Hive queries and make them run faster in their clusters should consider the following hacks:

  • Partition your data to reduce read time within your directory, or else all the data will get read
  • Use appropriate file formats such as the Optimized Row Columnar (ORC) to increase query performance. ORC reduces the original data size by up to 75 percent
  • Divide table sets into more manageable parts by employing bucketing
  • Improve aggregations, filters, scans, and joins by vectorizing your queries. Perform these functions in batches of 1024 rows at once, rather than one at a time
  • Create a separate index table that functions as a quick reference for the original table.

Comments