Thursday, December 3, 2015

Creating a standalone HiveMetastore (Not in Hadoop cluster)

When benchmarking Presto database on top of S3 files, I found out that I have to install a Hive metastore instance.

Image result for lonely bee

(A standalone bee)

I didn't need HiveServer, Mapreduce, or Hadoop cluster. So how do you do that?

Here are the steps:


  1. Install Hive Metastore Repository - an instance of one of the dbs that hive metastore works with (MySql, PostgresSql, MsSql, Oracle .. check documentation)
  2. Install Java
  3.  Download Vanilla Hadoop http://hadoop.apache.org/releases.html and unpack on the hive metastore instance  (let's say that you unpacked to /apps/hadoop-2.6.2)
  4. Set environment variables :
    1.  export HADOOP_PREFIX=/apps/hadoop-2.6.2
    2.   export HADOOP_USER_CLASSPATH_FIRST=true
  5. Download hive http://www.apache.org/dyn/closer.cgi/hive/ and upack on you instace
  6. Create a schema (user) for hive user and build the hive schema in the the hive metastore repository db using hive scripts (a sample script for mysql): 
    1.    /apps/apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql
  7. configure hive-site.xml with the right parameters for:
    1. ConnectionUrl   (jdbc:mysql://localhost:3666/hive  for example)
    2. ConnectionDriverName 
    3. ConnectionUserName  (the created database user)
    4. ConnectionPassword  (the created database user password)
    5. hive.metastore.warehouse.dir - set it to a local path (file:///home/presto/ for example)
  8. Copy the required jar for jdbc connection to the metastore repository in the hive class path. (Ojdbc6 for oracle, mysql-jdbc-connector for mysql and so on)
  9. Start hive metastore -   /apps/apache-hive-1.2.1-bin/bin/hive --service metastore
  10. For accessing S3:
    1. copy these jars to the classpath:
      1. aws-java-sdk-1.6.6.jar        (http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.6.6)
      2. hadoop-aws-2.6.0.jar    (http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar)
    2.  you can specify these parameters on the hadoop core-site.xml
      1. fs.s3.awsAccessKeyId
      2. fs.s3.awsSecretAccessKey
      3. <property>
           <name>fs.s3n.impl</name>
           <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
        </property>
    3. for secured access to s3, use S3A connection in your URL, 
      1. add fs.s3a.connection.ssl.enabled to haddop_home/etc/hadoop/core-site.xml
      2. you also need to set these parameters for s3 access in the hadoop core-site.xml file:
        1. fs.s3a.secret.key
        2. fs.s3a.access.key
      3. Unfortunately, There is no support currently for temporary S3 credentials



Finally, when running presto, we will use the thrift address and port of the hive metastore service.

If you are running EMR from time to time, you are able to use that external metastore repository according to AWS documentation

That's it. No need for additional Hadoop libraries or settings.
Good luck!

No comments:

Post a Comment