Go Data Infras: Migrating to Yarn On a Mapreduce and Impala Cluster

Hey

I would like to share with you some insights from the yarn 2.5 migration we have gone through.
We are using cdh 5.2.1 with Impala and Mapreduce as the main workloads on the cluster.
Don't take it as a technical guide but as a technical experience with some interesting points we share.

Let The Dogs Out: The cluster is a multi-tenant cluster that is used for Mapreduce and Cloudera Impala. Before migrating to yarn everything worked great: we were allocating static resources to Impala (X g RAM on a daemon, no limit for cpu) and y and z mappers and reducers for each Mapreduce tasktracker while setting max java heap size for mappers and reducers.We had enough memory for both Impala and Mapreduce jobs and no problem with the cpu eaither: We set 14 mappers and 14 reducers per 32 cores node. The map and reduce slots are always full so it didn't leave much cpu for the Impala process, but The good thing is that the Impala didn't really care and it always had enough cpu. The charts showed that the Impala uses 1/10 cpu in compare to the Mapreduce jobs. We said that moving to yarn is like letting the dogs out - and we don't know whether they will bite each other or play with each other.

The Dogs Bite Each Other: After migrating there were X available VCORES for all applications (Impala and Mapreduce) on the cluster. The mapreduce jobs behaved as expected and asked for 1 VCORE per mapper/reducer. The problem was that the Impala asked for lots of VCORES - much more than we expected and much higher than the Minimum configuration of the fair scheduled we set for users. Simple query on a parquet 100K rows table with stats, asked for X/2 VCORES (half of the cluster capacity). We have around 2 queries per second (requires huge amount of vcores). It all resulted in a 50 % Impala failures because of 5 minutes time-out of waiting for resources. In the other hand, important Mapreduce jobs didn't get lots of vcores and spent lot of time waiting for resources. We saw that users that are running important mapreduce jobs gets 1/10 vcores than a simple Impala user that decided to run some queries. That is an undesirable situation that YARN brought.

An Impala CPU is not a Mapreduce CPU: Why doe's Impala ask for so many vcores? Cloudera is talking about lack of table statistics. In addition, according to Nimrod Parasol Impala opens 1 thread per disk. We have 360 disks on cluster so every query was asking for 360 vcores from yarn, and it is not acceptable in a total of 720 vcores cluster. You can take this as an immature implementation of LLAMA - the yarn and impala mediator. We see that when all the cpu slots is used by mappers, the load average is extremely high. But when they are used by impala queries, the load average is quite low. The conclusion is that impala is using the cpu in a lighter way than the mapreduce. So why should we allocate the same vcores to both mapreduce and impala, and let the mapreduce jobs wait for a cpu that is being used by Impala, but could have serve each other at the same time.

Solutions? Lets say that we have 1000 vcores on cluster. We were willing to set 1,000 vcores for mapreduce, because setting it higher would result in a high load average on servers (The mappers are usually cpu intensive). We were also willing to set 10,000 vcores for the impala because we know that when it asks for 100 vcores, it is probably physically using much less cpus. That allocation would create a situation where we give virtual cpus to impala, but it's ok cause we see that before yarn we gave all the cpus to Mapreduce, and the Impala did great. The problem is that yarn and llama won't allow us setting hierarchic queues - upper limits for mapreduce and impala, and more allocations for applicative users inside each pool (In order to use hierarchic queues, each user have to set it manually. eg. set "mapreduceQueue.userA") .

Eventually, we gave up on managing Impala resource with llama and yarn, and stayed with Mapreduce 2 (YARN) by itself.

2 More things you should be aware of:

High Load Average on cluster: another thing that we saw is that after moving to mapredcue 2, the servers load average got higher by almost 100% (250 with yarn, 150 without yarn). We think that the reason is that more mappers are running simultaneously (because we only set number of containers per tasktracker, but not number of mappers and reducers ), in compare to when we set y mappers and z reducers per servers, and mappers are usually heavier than reducers.

Hive on oozie issue: After migrating to yarn, hive steps in oozie workflows failed with "hive-site.xml (permission denied)". The message has lots of references, but nothing that is related to yarn. Eventually the solution was to create a new share lib that contains yarn jars, cancelling the oozie share lib mapping parameter (that points to a custom share lib directory ), and using the default shar elib path: The latest /user/oozie/share/lib/libyyymmdd directory.

Bottom line: I don't find Impala mature enough to work with mapreduce under YARN. A better scheduler and queues options (hierarchic pools), or a better LLAMA implementation (So impala would ask for its real vcores need) is required.

Good luck everybody.

Go Data Infras

Friday, May 22, 2015

Migrating to Yarn On a Mapreduce and Impala Cluster

2 More things you should be aware of:

No comments:

Post a Comment