Big Data and related stuff.
Things that took me hours and days to implement, and that
would hopefully take you less.
Friday, May 22, 2015
Migrating to Yarn On a Mapreduce and Impala Cluster
Hey I would like to share with you some insights from the yarn 2.5 migration we have gone through. We are using cdh 5.2.1 with Impala and Mapreduce as the main workloads on the cluster. Don't take it as a technical guide but as a technical experience with some interesting points we share.
Let The Dogs Out: The cluster is a multi-tenant cluster that is used
for Mapreduce and Cloudera Impala. Before migrating to yarn everything
worked great: we were allocating static resources to Impala (X g RAM on
a daemon, no limit for cpu) and y and z mappers and reducers for each
Mapreduce tasktracker while setting max java heap size for mappers and
reducers.We had enough memory for both Impala and Mapreduce jobs and no
problem with the cpu eaither: We set 14 mappers and 14 reducers per 32
cores node. The map and reduce slots are always full so it didn't
leave much cpu for the Impala process, but The good thing is that the
Impala didn't really care and it always had enough cpu. The charts
showed that the Impala uses 1/10 cpu in compare to the Mapreduce jobs.
We said that moving to yarn is like letting the dogs out - and we
don't know whether they will bite each other or play with each other.
The Dogs Bite Each Other: After migrating there were X
available VCORES for all applications (Impala and Mapreduce) on the
cluster. The mapreduce jobs behaved as expected and asked for 1 VCORE
per mapper/reducer. The problem was that the Impala asked for lots of
VCORES - much more than we expected and much higher than the Minimum
configuration of the fair scheduled we set for users. Simple query on a
parquet 100K rows table with stats, asked for X/2 VCORES (half of the
cluster capacity). We have around 2 queries per second (requires huge
amount of vcores). It all resulted in a 50 % Impala failures because of
5 minutes time-out of waiting for resources. In the other hand,
important Mapreduce jobs didn't get lots of vcores and spent lot of
time waiting for resources. We saw that users that are running
important mapreduce jobs gets 1/10 vcores than a simple Impala user
that decided to run some queries. That is an undesirable situation that
An Impala CPU is not a Mapreduce CPU: Why
doe's Impala ask for so many vcores? Cloudera is talking about lack of
table statistics. In addition, according to Nimrod ParasolImpala opens 1 thread per disk. We have 360
disks on cluster so every query was asking for 360 vcores from yarn, and
it is not acceptable in a total of 720 vcores cluster.You can take this as an immature
implementation of LLAMA - the yarn and impala mediator. We see that
when all the cpu slots is used by mappers, the load average is
extremely high. But when they are used by impala queries, the load
average is quite low. The conclusion is that impala is using the cpu in
a lighter way than the mapreduce. So why should we allocate the same
vcores to both mapreduce and impala, and let the mapreduce jobs wait
for a cpu that is being used by Impala, but could have serve each other
at the same time.
Solutions? Lets say that we have 1000
vcores on cluster. We were willing to set 1,000 vcores for mapreduce,
because setting it higher would result in a high load average on
servers (The mappers are usually cpu intensive). We were also willing
to set 10,000 vcores for the impala because we know that when it asks
for 100 vcores, it is probably physically using much less cpus. That
allocation would create a situation where we give virtual cpus to
impala, but it's ok cause we see that before yarn we gave all the cpus
to Mapreduce, and the Impala did great. The problem is that yarn and
llama won't allow us setting hierarchic queues - upper limits for
mapreduce and impala, and more allocations for applicative users inside
each pool (In order to use hierarchic queues, each user have to set it manually. eg. set "mapreduceQueue.userA") .
Eventually, we gave up on managing Impala resource with llama and yarn, and stayed with Mapreduce 2 (YARN) by itself.
2 More things you should be aware of:
High Load Average on cluster: another thing that we saw is that after
moving to mapredcue 2, the servers load average got higher by almost
100% (250 with yarn, 150 without yarn). We think that the reason is
that more mappers are running simultaneously (because we only set
number of containers per tasktracker, but not number of mappers and
reducers ), in compare to when we set y mappers and z reducers per
servers, and mappers are usually heavier than reducers.
on oozie issue: After migrating to yarn, hive steps in oozie workflows
failed with "hive-site.xml (permission denied)". The message has lots
of references, but nothing that is related to yarn. Eventually the
solution was to create a new share lib that contains yarn jars,
cancelling the oozie share lib mapping parameter (that points to a
custom share lib directory ), and using the default shar elib path: The
latest /user/oozie/share/lib/libyyymmdd directory.
Bottom line: I don't find Impala mature enough to work with mapreduce
under YARN. A better scheduler and queues options (hierarchic pools),
or a better LLAMA implementation (So impala would ask for its real
vcores need) is required.