In this blog, we would discuss how to install Cloudera Hive on Linux (RHEL) EC2 instance.
Apache Hive is a data warehouse tool built on top of Apache Hadoop for providing data query and analysis. Hive provides a SQL_Like Interface to query data stored in various databases and filesystem that integrate with Hadoop. It enables reading, writing and managing large datasets in distributed storage. Hive queries are converted to a series of jobs that execute on a Hadoop cluster through MapReduce or Apache Spark. It also provides easy, familiar batch processing for Apache Hadoop.
Few keys features of Hive
- Provide SQL-Like Interface
- Shared Data Structure
- Faster Batch Processing
Few common use cases of Hive
- ETL
- Data Mining
- Data Preparation
Follow the below steps to install Cloudera Hive on AWS platform.
Step 1: Download the Software from https://www.cloudera.com/downloads/connectors/hive/odbc/2-5-25.html
Downloaded file name: ClouderaHiveODBC-2.5.25.1020-1.el7.x86_64.rpm
Step 2: Login to EC2 using root and create below directory
mkdir -p /tmp/cloudera-hive && cd /tmp/cloudera-hive
Step 3: Place the rpm package in above path
aws s3 cp s3://test-bucket/Hive/ClouderaHiveODBC-2.5.25.1020-1.el7.x86_64.rpm .
Step 4: Install the package
yum --nogpgcheck localinstall ClouderaHiveODBC-2.5.25.1020-1.el7.x86_64.rpm
Step 5: Verify the package
yum list | grep ClouderaHiveODBC
Step 6: Create the directory if not exists
mkdir -p /opt/odbc_path/client/ODBC_64
Step 7: Go to the directory & create two odbc files (odbc.ini & obcinst.ini) inside it
cd /opt/odbc_path/client/ODBC_64
Create a odbc.ini file with below contents
[ODBC] #QEWSD=2458358 InstallDir=/opt/teradata/client/ODBC_64 Trace=no Pooling=yes [ODBC Data Sources] Teradata ODBC DSN=Teradata Database ODBC Driver 16.20 [Teradata ODBC DSN] Description=Teradata Database ODBC Driver 16.20 [testdsn] Driver=/opt/teradata/client/ODBC_64/lib/tdataodbc_sb64.so DBCName=<Your DB End Point> MechanismName=LDAP Username=<User Name> Passowrd=<Password> Database=<DB Name> AccountString= CharacterSet=ASCII DatasourceDNSEntries= DateTimeFormat=AAA DefaultDatabase= DontUseHelpDatabase=0 DontUseTitles=1 EnableExtendedStmtInfo=1 EnableReadAhead=1 IgnoreODBCSearchPattern=0 LogErrorEvents=0 LoginTimeout=20 MaxRespSize=65536 MaxSingleLOBBytes=0 MaxTotalLOBBytesPerRow=0 MechanismName= NoScan=0 PrintOption=N retryOnEINTR=1 ReturnGeneratedKeys=N SessionMode=System Default SplOption=Y TABLEQUALIFIER=0 TCPNoDelay=1 TdmstPortNumber=1025 UPTMode=Not set USE2XAPPCUSTOMCATALOGMODE=0 UseDataEncryption=0 UseDateDataForTimeStampParams=0 USEINTEGRATEDSECURITY=0 UseSequentialRetrievalOnly=0 UseXViews=0 [ODBC] Trace = 1 TraceFile = [ODBC Data Sources] Cloudera Hive 32-bit=Cloudera ODBC Driver for Apache Hive 32-bit Cloudera Hive 64-bit=Cloudera ODBC Driver for Apache Hive 64-bit [Cloudera Hive 32-bit] Description=Cloudera ODBC Driver for Apache Hive (32-bit) DSN Driver=/opt/cloudera/hiveodbc/lib/32/libclouderahiveodbc32.so HOST=[HOST] PORT=[PORT] Schema=default ServiceDiscoveryMode=0 ZKNamespace= HiveServerType=2 AuthMech=2 ThriftTransport=1 UseNativeQuery=0 UID= KrbHostFQDN=_HOST KrbServiceName=hive KrbRealm= SSL=0 TwoWaySSL=0 ClientCert= ClientPrivateKey= ClientPrivateKeyPassword= [Hive] Description=Cloudera ODBC Driver for Apache Hive (64-bit) DSN Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so HOST= PORT= Schema= ServiceDiscoveryMode=0 ZKNamespace= HiveServerType=2 AuthMech=3 ThriftTransport=1 UseNativeQuery=0 UID= PWD= KrbHostFQDN=_HOST KrbServiceName=hive KrbRealm= SSL=0 TwoWaySSL=0 ClientCert= ClientPrivateKey= ClientPrivateKeyPassword=
Create a odbcinst.ini file with below contents
[ODBC Drivers] Teradata Database ODBC Driver 16.20=Installed [Teradata Database ODBC Driver 16.20] Description=Teradata Database ODBC Driver 16.20 Driver=/opt/teradata/client/ODBC_64/lib/tdataodbc_sb64.so [ODBC Drivers] Cloudera ODBC Driver for Apache Hive 32-bit=Installed Cloudera ODBC Driver for Apache Hive 64-bit=Installed [Cloudera ODBC Driver for Apache Hive 32-bit] Description=Cloudera ODBC Driver for Apache Hive (32-bit) Driver=/opt/cloudera/hiveodbc/lib/32/libclouderahiveodbc32.so [Cloudera ODBC Driver for Apache Hive 64-bit] Description=Cloudera ODBC Driver for Apache Hive (64-bit) Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so
Step 8: Set the environment variables
export ODBCINI=/opt/odbc_path/client/ODBC_64/odbc.ini export ODBCINSTINI=/opt/odbc_path/client/ODBC_64/odbcinst.ini
You could also update the same in your /etc/profile file to avoid each time environment variables update
Now, you are all set to test the connectivity using pyodbc python packages to connect to Hive EDL. Below is an example to connect to Hive
import pyodbc pyodbc.autocommit = True pyodbc.pooling = False conn_str = "DSN="+'Hive'+";HOST="+'Hostname'+";UID="+'User_ID'+";PWD="+'Password'+";PORT="+'Port_No' con = pyodbc.connect(conn_str, autocommit=True)
I hope, this blog helps to install Cloudera Hive driver on AWS platform. Please like and comments if you have any query related to this blog.