Apache Hive is a data warehouse tool which uses SQL commands to query and analyze data on top of the Hadoop ecosystem. The hive reduces the programmer’s tasks of using complex MapReduce Hadoop program. It provides a mechanism for structuring query written statements using Hive Query Language (HQL) which are similar to SQL statements.
You don’t have to worry about the complex MapReduce programs for processing data since the Hive compiler converts the queries to simple SQL statements.
Hive use cases
- Used in data encapsulation operation
- Carrying out on-demand queries
- To analyze large volumes of datasets
In HQL, the queries execute on Hadoop environment while SQL statements executed on the traditional database.
Hive supports partitioning technique making it easy to retrieve data when you execute a query.
Apache Hive database uses Data Manipulation Language (DML), Data Definition Language (DDL) and User Defined Functions (UDF) statements.
Difference between Hive distributed database and Relational database
Hive has additional functionalities not found on the traditional relational database. This makes it easy to process a large volume of data in petabytes. Querying the large datasets is fast and more efficient as the results are displayed in seconds.
Relational database use schema READ and schema WRITE where a user creates a table and then write data to the table. Database operation like insert, update, and delete are used to modify the relational database.
Hive uses Schema READ only, therefore, operations like an update or any modification can’t be used in this.
Hive new version has added new features; Update and Delete feature which allows you to update the table.
Hadoop Hive Architecture
- Hive clients
Apache Hive client applications are written using C++, Java, and Python languages, etc. Users can write the application using any language of their choice.
The Hive client has different types of applications which are used to perform various queries. These client applications include;
- Thrift client: It allows queries to be structured on languages that support thrift. The Apache Hive server uses thrift, therefore receive a request from languages that support thrift.
- JDBC clients: All Java applications are connected to the Apache Hive using the JDBC driver.
- ODBC Clients: ODBC driver connects all application supporting ODBC protocol to the Apache Hive. For example, JDBC driver allows ODBC applications to use thrift in order to communicate with the Hive server.
- Hive services
This component offers several query operation services. These services include;
- Web Interface: It provides a web-based GUI interface to execute Hive queries and commands.
- CLI: Command Line Interface act as the default shell for Hive Application. It allows you to execute Hive queries and commands directly.
- Hive Server: It is also known as Thrift server since it is built on Apache Thrift. The server allows different clients to send requests to the Hive system and retrieve the final results.
- Hive Driver: This driver receives queries send through Thrift, ODBC, JDBC, CLI or Web UL by the Hive client.
The received queries are passed to the compiler where parsing, type checking, and semantic analysis is done with the help of a schema in the metastore.
After the compilation of data, an optimizer driver is used to generate an optimized logical plan inform of Directed Acyclic Graph (DAG).
Later execution engine executes the tasks based on the order of their dependency using Hadoop.
- Metastore: It acts as a central repository in the Hive architecture and stores metadata for Hive tables and all the partitions of a relational database. Clients can access the stored data using the metastore service API.
- Processing and resource management
MapReduce framework is used to execute queries internally.
- Distributed storage
Hive is always built on top of the Hadoop ecosystem and it uses the Hadoop distributed file system (HDFS) as the distributed storage. Structured data details are stored in the metastore database.
How to execute data within Apache Hive
- Query execute through the user interface (UI) driver
- The driver sends a query to the compiler in order to generate an execution plan.
- The compiler submits a request to the metastore to get metadata details of the job to be executed.
- The compiler receives the metadata details from the metastore.
- The compiler uses the received metadata information to generate the query. Compiler communicates with the driver and generates an execution plan.
- The generated plan is sent to the execution engine.
- The execution engine sends each operation tree stage to the appropriate component. It acts as a link between the Hadoop ecosystem and the Hive system. The execution engine contacts the NameNode and later the DataNode to get the stored data.
The engine fetches actual data from the DataNode and metadata from the NameNode in the Hadoop ecosystem.
Once the actual data from the generated query is fetched, the execution engine communicates with the metastore and carries out the DDL operations.
The execution engine later communicates with the Hadoop ecosystem and executes the query on top of the Hadoop infrastructure.
- Read the results of the temporary file directly from the driver to the user interface.
- Sending the results from the data nodes to the driver and UI.
Advantages of using Apache Hive
- Suitable for people with little knowledge of programming since you don’t have to write complex programs codes using the MapReduce framework.
- Hive query language is extensible and scalable. HQL can cope with growing clusters of data from multiple sources without affecting the performance of the Hive Hadoop system.
- During query execution, Apache Hive reduces the time to carry out semantic checks since the metadata information is stored in the RDBMS.
Structured data in the Apache Hive environment is stored in the Hadoop file system (HDFS) whereas MapReduce framework is used to execute queries.
Hive provides users with a flexible data query language to easily query and process structured data. It offers additional features for data processing as compared to using RDBMS.