Hive Interview Questions and Answers for 2024

Compilation of Hive Interview Questions and Answers for freshers and experienced that are most likely to be asked in Hadoop job interviews in 2024.

Get access to all Big Data Careers Projects View all Big Data Careers Projects

Hive Interview Questions and Answers for 2024

Last Updated: 11 Apr 2024 | BY ProjectPro

Preparing for a Hadoop job interview then this list of most commonly asked Hive Interview questions and answers will help you ace your hadoop job interview.These Hive Interview questions and answers are formulated just to make candidates familiar with the nature of questions that are likely to be asked in a Hadoop job interview on the subject of Hive.

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Hadoop Hive Interview Questions and Answers
Scenario based or Real-Time Interview Questions on Hadoop Hive
Other Interview Questions on Hadoop Hive

Hadoop Hive Interview Questions and Answers

1) What is the difference between Pig and Hive ?

**Pig vs Hive**
Criteria	Pig	Hive
Type of Data	Apache Pig is usually used for semi structured data.	Used for Structured Data
Schema	Schema is optional.	Hive requires a well-defined Schema.
Language	It is a procedural data flow language.	Follows SQL Dialect and is a declarative language.
Purpose	Mainly used for programming.	It is mainly used for reporting.
General Usage	Usually used on the client side of the hadoop cluster.	Usually used on the server side of the hadoop cluster.
Coding Style	Verbose	More like SQL

For a detailed answer on the difference between Pig and Hive, refer this link -

/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

2) What is the difference between HBase and Hive ?

Hive vs HBase
HBase	Hive
HBase does not allow execution of SQL queries.	Hive allows execution of most SQL queries.
HBase runs on top of HDFS.	Hive runs on top of Hadoop MapReduce.
HBase is a NoSQL database.	Hive is a datawarehouse framework.
Supports record level insert, updated and delete operations.	Does not support record level insert, update and delete.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

2) I do not need the index created in the first question anymore. How can I delete the above index named index_bonuspay?

DROP INDEX index_bonuspay ON employee;

Test Your Practical Hadoop Knowledge

3) Can you list few commonly used Hive services?

Command Line Interface (cli)
Hive Web Interface (hwi)
HiveServer (hiveserver)
Printing the contents of an RC file using the tool rcfilecat.
Jar
Metastore

Here's what valued users are saying about ProjectPro

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain hands-on experience and prepare for job interviews. I would highly recommend this platform to anyone...

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic was "Credit Risk Modelling". To understand other domains, it is important to wear a thinking cap and...

Gautam Vermani

Data Consultant at Confidential

Not sure what you are looking for?

View All Projects

4) Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?

Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.

FREE eBook on 250 Hadoop Interview Questions and Answers

5) What is the use of Hcatalog?

Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.

6) Write a query to rename a table Student to Student_New.

Alter Table Student RENAME to Student_New

**question**

7) Where is table data stored in Apache Hive by default?

hdfs: //namenode_server/user/hive/warehouse

8) Explain the difference between partitioning and bucketing.

Partitioning and Bucketing of tables is done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
Basically, a bucket is a file in Hive whereas partition is a directory.

9) Explain about the different types of partitioning in Hive?

Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.

Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.

When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.

Prepare for Your Next Big Data Job Interview with Kafka Interview Questions and Answers

New Projects

10) When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?

When running Hive in embedded mode, it creates a local metastore. When you run the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.

The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.

11) How will you read and write HDFS files in Hive?

i) TextInputFormat- This class is used to read data in plain text file format.

ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.

iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.

iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.

12) What are the components of a Hive query processor?

Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies. The various components of a query processor are-

Parser
Semantic Analyser
Type Checking
Logical Plan Generation
Optimizer
Physical Plan Generation
Execution Engine
Operators
UDF’s and UDAF’s.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

13) Differentiate between describe and describe extended.

Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.

Describe extended database/schema- Gives the details of the database or schema in a detailed manner.

14) Is it possible to overwrite Hadoop MapReduce configuration in Hive?

Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.

15) I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?

Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.

16) What is the use of explode in Hive?

Explode in Hive is used to convert complex data types into desired table formats. explode UDTF basically emits all the elements in an array into multiple rows.

17) Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.

SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.

ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.

DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.

CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.

18) Difference between HBase and Hive.

HBase is a NoSQL database whereas Hive is a data warehouse framework to process Hadoop jobs.
HBase runs on top of HDFS whereas Hive runs on top of Hadoop MapReduce.

19) Write a hive query to view all the databases whose name begins with “db”

SHOW DATABASES LIKE ‘db.*’

20) How can you prevent a large job from running for a long time?

This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;

The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

**question**

21) What is a Hive Metastore?

Hive Metastore is a central repository that stores metadata in external database.

22) Are multiline comments supported in Hive?

23) What is ObjectInspector functionality?

ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.

24) Explain about the different types of join in Hive.

HiveQL has 4 different types of joins –

JOIN- Similar to Outer Join in SQL

FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.

LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.

RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

25) How can you configure remote metastore mode in Hive?

To configure metastore in Hive, hive-site.xml file has to be configured with the below property –

hive.metastore.uris

thrift: //node1 (or IP Address):9083

IP address and port of the metastore host

26) Is it possible to change the default location of Managed Tables in Hive, if so how?

Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword.

27) How data transfer happens from HDFS to Hive?

If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user just has to define the table using the keyword external that creates the table definition in the hive metastore.

Create external table table_name (

id int,

myfields string

)

location '/my/location/in/hdfs';

28) In case of embedded Hive, can the same metastore be used by multiple users?

We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.

29) The partition of hive table has been modified to point to a new directory location. Do I have to move the data to the new location or the data will be moved automatically to the new location?

Changing the point of partition will not move the data to the new location. It has to be moved manually to the new location from the old one.

30) What will be the output of cast (‘XYZ’ as INT)?

It will return a NULL value.

Build Industry Ready Apache Hive Projects to Land a Top Data Gig

31) What are the different components of a Hive architecture?

Hive Architecture consists of a –

User Interface – UI component of the Hive architecture calls the execute interface to the driver.
Driver create a session handle to the query and sends the query to the compiler to generate an execution plan for it.
Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.

32) What happens on executing the below query? After executing the below query, if you modify the column –how will the changes be tracked?

Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)

AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.

33) What is the default database provided by Hive for Metastore ?

Derby is the default database.

34) Is it possible to compress json in Hive external table ?

Yes, you need to gzip your files and put them as is (*.gz) into the table location.

35) What is the difference between local and remote metastore?

A local metastore runs on the same JVM (Java Virtual Machine) in which the Hive service is running. It connects to a database running in a separate JVM that can either be on the same machine or a remote machine.

In a remote metastore, the service runs on its JVM differently from the JVM on which the Hive service runs. Other processes can communicate with the server containing the metastore using Thrift Network APIs. There can be more than one metastore server to provide more availability.

36) Why does Hive not support storage of metadata information in HDFS?

Hive stores the metadata in RDBMS rather than HDFS. RDBMS is chosen primarily for low latency since reads and writes to the HDFS are time-consuming. The storage for metastore must be optimized for online transactions involving random accesses and updates. HDFS is optimized for sequential scans and not for random access, and hence it is not a suitable choice for the storage of metadata.

37) What is the difference between external tables and managed tables in Hive?

Managed tables	External tables
Managed tables are stored by default in the Hive warehouse in HDFS. The default storage location of managed tables can be modified.	When an external table is created, the location from where the data is taken from the HDFS has to be mentioned. These tables take data from other locations on the HDFS.
When managed tables are dropped, their metadata and table data are both permanently deleted.	When external tables are dropped, only the metadata associated with that particular table is deleted from Hive. The tables take data from other locations on the HDFS and this data remains untouched.
Hive has full ownership of the data on managed tables.	Hive does not have any ownership of the data in External tables, it only has ownership of the metadata that it creates for that table when the table is created using data from an external location.
Support Query results caching in cases where the result of one query can be cached for use in another query.	No support for query result caching.
ACID (atomicity, consistency, isolation, durability) transaction property is supported.	There is no support for ACID transaction property.
Support ARCHIVE, UNARCHIVE, TRUNCATE, MERGE, CONCATENATE commands.	No support for ARCHIVE, UNARCHIVE, TRUNCATE, MERGE, CONCATENATE commands.

38) What is the purpose of partitioning in Hive?

Partitioning in Hive refers to the process of dividing the table into smaller parts based on the values of a particular column, such as date, city, age or country. Partitioning is done so that data can be stored in slices, enabling the query response to become faster. Since Hadoop is used to handle large amounts of data, it is important to find ways to access the data using the best approach.

39) Mention some limitations of Hive.

Apache Hive does not support update and delete operations on the rows of table data.
The latency for Hive queries is pretty high. Hadoop was built for long sequential scans, and Hive is built on Hadoop and has a high latency.
Hive cannot be used for real-time data querying since it takes quite some time to give the results.
Apache Hive does not support subqueries.
Hive does not support online transaction processing (OLTP) and only supports online analytical processing (OLAP). This is because Hive does not support update or deletion at the row level, and it is not suitable for real-time processing. Hive is ideal for processes where the response time of the query is not highly interactive.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

40) How can you change the data type of a column in Hive?

You can use the following command in Hive to change the data type of a column name.

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

E.g. for a table named “Students” with a column “age”. To change the data type of the “age” column from String to int, we would use the following command -

ALTER TABLE Students CHANGE age age int;

41) Can Hive be used in OLTP systems?

Hive cannot be used to support online transaction processing (OLTP), but can only be used for online analytical processing (OLAP). Hive does not support update or delete at the row level and is not suitable for real-time query processing.

42) What is the purpose of archiving tables in Hive?

Hadoop archives are used as an approach to reduce the number of files in partitions. Apache Hive has built-in support for converting files on existing partitions to a Hadoop Archive. In this manner, a partition that would have once consisted of hundreds of files can be allowed to occupy just around 3 files. This number can vary depending on the settings. The trade-off in such a case is that the queries may end up having a higher latency, caused by the additional overhead of reading from Hadoop Archive. Archiving does not compress the files, but works similar to the Linux ‘tar’ command.

43) Is there any method to delete the DBPROPERTY in Hive?

There is no way to delete or “unset” the DBPROPERTY in Hive.

44) How can you list all the databases which end with ‘e’ in Hive?

The following command can be used:

SHOW DATABASES LIKE ‘*e’;

45) Differentiate between local mode and MapReduce mode in Hive.

Hive can operate in Local mode or the MapReduce mode depending on the size of data nodes in Hadoop.

Local mode is to be used if the data size is smaller and can be limited to a single local machine. Processing of smaller data sets will be faster using the local mode.
Hive may run in local mode if the following conditions are met:
The local mode can be used in cases where the total input size of the job is lower than the value specified by hive.exec.mode.local.auto.inputbytes.max, which is 128MB by default
the local mode is applicable if the value of the total number of map tasks is less than hive.exec.mode.local.auto.tasks.max, which is 4 by default
The total number of reduce tasks is 1 or 0.

MapReduce mode is the default mode that is used. If the Hadoop system has multiple data nodes, with the data distributed across the nodes, then MapReduce mode is used. MapReduce mode is more suitable for large data sets and better performance can be achieved through this mode.

46) What is the purpose of ORC tables in Hive?

In Hive, the Optimized Row Columnar (ORC) file format was designed to provide a highly efficient way to store Hive data. It overcomes the limitations which are present in other Hive file formats. ORC files improve performance for reads, writes and data processing. ORC file format supports Hive types including datetime, decimal and complex types, including struct, list, map and union, which are not supported by the RC file format. The ORC file format supports lightweight indexes to be stored within the file which allow skipping to row groups that do not meet a certain filtering. Separate RecordReaders in ORC files allow concurrent reads from the same file. ORC files generate a single file as the output of each task, and hence reduce the load on the NameNode. A high level of compression is supported by the ORC file format.

47) What is STREAMTABLE in Hive?

In the map/reduce stage in a query involving a join, the table to be streamed can be specified by using a hint via the keyword STREAMTABLE.

E.g. Consider a table x with over 1.5 billion records and a table y with a few thousand records.

SELECT /*+ STREAMTABLE(x) */ x.val, y.val FROM x JOIN y ON x.key = y.key

This command tells Hive to Stream table ‘x’. Without the use of the STREAMTABLE keyword, Hive normally streams the right-most table and buffers the other tables. In this case, since y is the rightmost table, y would be streamed and x would be buffered. Buffering more than 1.5 billion records would cause the query to fail resulting in a Java-Heap space exception.

In such a case, the STREAMTABLE hint is given to Hive to allow query optimization so that Hive streams the table mentioned after the STREAMTABLE keyword and not the rightmost table after the join.

Scenario based or Real-Time Interview Questions on Hadoop Hive

How will you optimize Hive performance?

There are various ways to run Hive queries faster -

Using Apache Tez execution engine
Using vectorization
Using ORCFILE
Do cost based query optimization.

Will the reducer work or not if you use “Limit 1” in any HiveQL query?
Why you should choose Hive instead of Hadoop MapReduce?
I create a table which contains transaction details of customers for the year 2018.
CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
I have inserted 60K tuples in this table and now want to know the total revenue that has been generated for each month. However, Hive takes too much time to process this query. List all the steps that you would follow to solve this problem.
There is a Python application that connects to Hive database for extracting data, creating sub tables for data processing, drops temporary tables, etc. 90% of the processing is done through hive queries which are generated from python code and are sent to hive server for execution.Assume that there are 100K rows , would it be faster to fetch 100K rows to python itself into a list of tuples and mimic the join or filter operations hive performs and avoid the executuon of 20-50 queries run against hive or you should look into hive query optimization techniques ? Which one is performance efficient ?

SQL	HiveQL of Apache Hive
It is a domain-specific language which is used to manage data in relationship database management systems (RDBMS) and to handle structured data.	HiveQL is a query language used by Hive to analyze and query structured data in the Hadoop ecosystem.
SQL these data types: Integral, floating-point, fixed-point, text and binary strings, temporal.	HiveQL supports the following data types: Boolean, integral, floating-point, fixed-point, text and binary strings, temporal, array, map, struct.
Multi Table inserts are not supported.	Multi Table inserts are supported.
SQL does not support MapReduce.	HiveQL supports MapReduce.
SQL is ideal if it is required to update or modify the data frequently as it supports update and delete queries for the table data.	Hive is ideal to analyse large datasets, and not ideal in cases where the data has to be modified or updated as update and delete is not supported on a row level.
The view can be updated in SQL.	Views cannot be updated in HiveQL.
Subqueries can be used with multiple clauses.	Subqueries are only supported with FROM, WHERE and HAVING clauses.

Hive Interview Questions and Answers for 2024

Table of Contents

Hadoop Hive Interview Questions and Answers

Pig vs Hive

Criteria

Pig

Hive

HBase

Hive

Test Your Practical Hadoop Knowledge

Here's what valued users are saying about ProjectPro

FREE eBook on 250 Hadoop Interview Questions and Answers

Scenario based or Real-Time Interview Questions on Hadoop Hive

Other Interview Questions on Hadoop Hive

Posts Related to Hadoop Interview Questions-

About the Author