Thursday, December 3, 2015

Creating a standalone HiveMetastore (Not in Hadoop cluster)

When benchmarking Presto database on top of S3 files, I found out that I have to install a Hive metastore instance.

Image result for lonely bee

(A standalone bee)

I didn't need HiveServer, Mapreduce, or Hadoop cluster. So how do you do that?

Here are the steps:


  1. Install Hive Metastore Repository - an instance of one of the dbs that hive metastore works with (MySql, PostgresSql, MsSql, Oracle .. check documentation)
  2. Install Java
  3.  Download Vanilla Hadoop http://hadoop.apache.org/releases.html and unpack on the hive metastore instance  (let's say that you unpacked to /apps/hadoop-2.6.2)
  4. Set environment variables :
    1.  export HADOOP_PREFIX=/apps/hadoop-2.6.2
    2.   export HADOOP_USER_CLASSPATH_FIRST=true
  5. Download hive http://www.apache.org/dyn/closer.cgi/hive/ and upack on you instace
  6. Create a schema (user) for hive user and build the hive schema in the the hive metastore repository db using hive scripts (a sample script for mysql): 
    1.    /apps/apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql
  7. configure hive-site.xml with the right parameters for:
    1. ConnectionUrl   (jdbc:mysql://localhost:3666/hive  for example)
    2. ConnectionDriverName 
    3. ConnectionUserName  (the created database user)
    4. ConnectionPassword  (the created database user password)
    5. hive.metastore.warehouse.dir - set it to a local path (file:///home/presto/ for example)
  8. Copy the required jar for jdbc connection to the metastore repository in the hive class path. (Ojdbc6 for oracle, mysql-jdbc-connector for mysql and so on)
  9. Start hive metastore -   /apps/apache-hive-1.2.1-bin/bin/hive --service metastore
  10. For accessing S3:
    1. copy these jars to the classpath:
      1. aws-java-sdk-1.6.6.jar        (http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.6.6)
      2. hadoop-aws-2.6.0.jar    (http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar)
    2.  you can specify these parameters on the hadoop core-site.xml
      1. fs.s3.awsAccessKeyId
      2. fs.s3.awsSecretAccessKey
      3. <property>
           <name>fs.s3n.impl</name>
           <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
        </property>
    3. for secured access to s3, use S3A connection in your URL, 
      1. add fs.s3a.connection.ssl.enabled to haddop_home/etc/hadoop/core-site.xml
      2. you also need to set these parameters for s3 access in the hadoop core-site.xml file:
        1. fs.s3a.secret.key
        2. fs.s3a.access.key
      3. Unfortunately, There is no support currently for temporary S3 credentials



Finally, when running presto, we will use the thrift address and port of the hive metastore service.

If you are running EMR from time to time, you are able to use that external metastore repository according to AWS documentation

That's it. No need for additional Hadoop libraries or settings.
Good luck!

Tuesday, December 1, 2015

Secured ELK (including filebeat)

It is not so difficult to start an ELK topology with an application server that forwards logs to a logstash server and a logstash server that sends logs in the right schema to Elastisearch.

Then, it's also quite straight forward to start an Elasticserach cluster with Kibana for some visualizations.

The problem started for us when we wanted to secure the whole pipeline - from the app server to the client through Logstash, Elasticsearch and Kibana.

We also got to know filebeat tand we almost immediately felt in love.

I will explain here, step by step, the whole installation and deploying process that we have gone through. All the configuration files are attached. Enjoy.

General Architecture:


















  1. Components:
    1. Filebeat  - An evolution of the old forwarder. Light and an easy to use tool for sending data from log files to Logstash.
    2. Logstash - A ready to use tool for sending logs data to Elasticsearch. It supports many other outputs, inputs and manipulations of its input log records. I will show the integration with filebeat as input and Elasticsearch and s3 as output.
    3. Elaticsearch - Our back-end for storing and indexing the logs
    4. Kibana - The visualization tool on top of elasticsearch
  2. Filebeat agent installation (talking with HTTPS to logstash)
    1. As for the project time, the newest version of filebeat (1.0.0 -rc1) had a bug when sending data to logstash. It's automatically creating the "@timestamp" field, which also get created by logstash, and makes it fail. We worked with 1.0.0 -beta4 version (the first one) of filebeat.
    2. configuration file
      1. Take a look at the Logstash as output scope.
      2.  We want to make filebeat trust logstash server. If you signed the certificate that logstash is using with some CA, you can give filebeat  a certificate that is signed by that CA. Configuration - "certificate_authorities"
      3. If you didn't use a sign certificate for the logstash server, you can use the certificate and the public key of logstash server (I'll explain later how to create them) as demonstrated in the configuration file
      4. Keep in mind that we don't use java keystores for that filebeat - logstash connection.
    3. Starting command:   filebeat_location/filebeat -v  -c filebeat_location/filebeat.yml
  3. Logstash (Https communication to Elasticsearch)
    1. Download and unpack
    2. generate certificate:
      1. openssl req -new -text -out server.req
      2. ###When asking for common name, enter the server's ip   : --Common Name (eg, your name or your server's hostname) []:10.184.2.232
      3. openssl rsa -in privkey.pem -out server.key
      4.  rm privkey.pem
      5. sign the req file with the CA certificate
    3. Logstash configuration
      1. The input is configured to get data from filebeat
      2. In the Filer and groke scope:
        1.  we are creating the json documents out of the "message" field that we get from filebeat.
        2. duplicating our "msg" field. We call it  msg_tokenized - that's important for Elasticsearch later on. From one hand, we want to search the logs so we save it as an analyzed field ("msg") but we also want to visualize the logs so we also save the "msg" as an unanalysed field (the whole messages). I will explain later how to set an Elasticsearch template for that. 
        3. creating a new field with the file_name of the source log file
        4. removing the "logtime" field
        5. removing the "message" field
      3. Output scope:
        1.  S3 and Elasticsearch outputs
        2. For ssl between Logstash and Elasticsearch, you can use the CA name in the cacert configuration under elasticsearch output.
        3. If we specify a new name for our index in the logstash elasticsearch output, the index would automatically get created.  We decided to create a new index for each log file and then we created an abstraction layer for all application logs with Kibana with queries on top of couple of indices.
        4. The s3 bucket path is not dynamic currently
    4. Elasticseach (open for  HTTPS only)
      1. Download Elasticsearch, and Shield (shieldlicense)
      2. install the 2 addons for shield:
        1. bin/plugin install file:///path/to/file/license-2.0.0.zip
        2. bin/plugin install file:///path/to/file/shield-2.0.0.zip
      3. Create Certificate for elasticsearch
        1. openssl req -new -text -out server.req
        2. openssl rsa -in privkey.pem -out server.key
        3. rm privkey.pem
        4. sign the req file (organisation ca sign)  and save the signed certificate as signed.cer
        5. Create a java keystore  with the sign ertificate and the private key (2 steps)
          1. openssl pkcs12 -export -name myservercert  -in signed.cer -inkey server.key -out keystore.p12
          2. keytool -importkeystore -destkeystore node01.jks -srckeystore keystore.p12 -srcstoretype pkcs12 -alias myservercert
        6. If you are not using a signing CA you can use that command in roder to create the keystore:      keytool -genkey -alias node01_s -keystore node01.jks -keyalg RSA -keysize 2048 -validity 712 -ext san=ip:10.184.2.238
      4. Configuration File
        1. Take a look at the security shield options. We are using the local java keystore that holds our signed certificate and the private key
      5. Start Elasticsearch: /apps/elasticsearch-2.0.0/bin/elasticsearch
      6. Add user:  elasticsearch/bin/esusers useradd alog -p alog123 -r admin
      7. Add Template for elasticsearch - We want all the indexes that are created by logstash to contain both "msg" field for search and "msg" field for visualization. The template would help us get one analyse field and one that is not.
        1. curl -XPUT   https://alog:alog123@elastic-server:9200/_template/template_1 -d '
          {
              "template" : "alog-*",
              "mappings" : {
                  "_default_" : {
                                          "properties":  {
                                           "msg":{"type":"string", "index" : "not_analyzed"}}
                  }
              }
          }'
        2.  This template will catch for all indices that start with "alog" in their name.
    5. Kibana (HTTPS communication to Elasticsearch,  users logon via HTTPS  )
        1. Download and unpack
        2. Start command: bin/kibana
        3. Configuration
          1. In order to enable https connection of users we create another pair of certificate and key with openssl tool and set the certificate and key in the configuration
          2. We set the  elasticsearch.username and password for Shield authentication
          3. We didn't supply Kibana the Elasticsearch certificate and key becuase we used a signed certificate

And that's it :)
I hope that all of the above would help you go through some of the obstacles that we encountered, whether with the SSL setup, with Groke command or with anything else.

Good Luck !