Go Data Infras: Creating a standalone HiveMetastore (Not in Hadoop cluster)

Thursday, December 3, 2015

Creating a standalone HiveMetastore (Not in Hadoop cluster)

When benchmarking Presto database on top of S3 files, I found out that I have to install a Hive metastore instance.

Image result for lonely bee

Image result for lonely bee

(A standalone bee)

I didn't need HiveServer, Mapreduce, or Hadoop cluster. So how do you do that?

Here are the steps:

Install Hive Metastore Repository - an instance of one of the dbs that hive metastore works with (MySql, PostgresSql, MsSql, Oracle .. check documentation)
Install Java
Download Vanilla Hadoop http://hadoop.apache.org/releases.html and unpack on the hive metastore instance (let's say that you unpacked to /apps/hadoop-2.6.2)
Set environment variables :

export HADOOP_PREFIX=/apps/hadoop-2.6.2
export HADOOP_USER_CLASSPATH_FIRST=true

Download hive http://www.apache.org/dyn/closer.cgi/hive/ and upack on you instace
Create a schema (user) for hive user and build the hive schema in the the hive metastore repository db using hive scripts (a sample script for mysql):

/apps/apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql

configure hive-site.xml with the right parameters for:

ConnectionUrl (jdbc:mysql://localhost:3666/hive for example)
ConnectionDriverName
ConnectionUserName (the created database user)
ConnectionPassword (the created database user password)
hive.metastore.warehouse.dir - set it to a local path (file:///home/presto/ for example)

Copy the required jar for jdbc connection to the metastore repository in the hive class path. (Ojdbc6 for oracle, mysql-jdbc-connector for mysql and so on)
Start hive metastore - /apps/apache-hive-1.2.1-bin/bin/hive --service metastore
For accessing S3:

copy these jars to the classpath:

aws-java-sdk-1.6.6.jar (http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.6.6)
hadoop-aws-2.6.0.jar (http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar)

you can specify these parameters on the hadoop core-site.xml

fs.s3.awsAccessKeyId
fs.s3.awsSecretAccessKey

<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

for secured access to s3, use S3A connection in your URL,

add fs.s3a.connection.ssl.enabled to haddop_home/etc/hadoop/core-site.xml
you also need to set these parameters for s3 access in the hadoop core-site.xml file:

fs.s3a.secret.key
fs.s3a.access.key

Unfortunately, There is no support currently for temporary S3 credentials

Finally, when running presto, we will use the thrift address and port of the hive metastore service.

If you are running EMR from time to time, you are able to use that external metastore repository according to AWS documentation

That's it. No need for additional Hadoop libraries or settings.

Good luck!

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)