50 Real Time Scenario (Problems & Solutions)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

50 Real Time Scenario (Problems & Solutions)

Troubleshooting Installation and Upgrade Problems


===============================================
General Advice
=============
 If you are having problems, check the logs in the logs directory to see if there are any
Hadoop errors or Java Exceptions.
 Logs are named by machine and job they carry out in the cluster, and this can help you
figure out which part of your configuration is giving you trouble.
 Even if you were very careful, the problem is probably with your configuration. Try running
the grep example from the Quick Start. If it doesn't run, then you need to check your
configuration.
 If you can't get it to work on a real cluster, try it on a single-node.
 Sometimes it can just take some time and sweat to make complex systems run; but, it never
hurts to ask for help so please ask the TA and your fellow students ASAP if you are having
trouble making Hadoop run.
1) How to utilize hive buckets in spark?
Answer: Bucketing is an optimization technique that uses buckets (and bucketing columns) to
determine data partitioning and avoid data shuffle.
Bucketing is enabled by default. Spark SQL uses spark.sql. sources. bucketing. Enabled
configuration property to control whether bucketing should be enabled and used for query
optimization or not.
Bucketing is used exclusively in FileSourceScanExec physical operator (when it is requested for
the input RDD and to determine the partitioning and ordering of the output).
2) How does hive work with Spark?
Answer: That means instead of Hive storing data in Hadoop it stores it in Spark.
The reason people use Spark instead of Hadoop is it is an all-memory database.
So Hive jobs will run much faster there. Plus, it moves programmers toward using a common
database if your company runs predominately Spark.
15. What happens if you get a ‘connection refused java exception’ when you type hadoop fsck /?
If you get a ‘connection refused java exception’ when you type hadoop fsck, it could mean that
the NameNode is not working on your VM
3) Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block
size configuration and default replication factor. Then, how many blocks will be created
in total and what will be the size of each block?
Answer: Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into
5 blocks (514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will
be of 2 MB only. Since, we are using the default replication factor i.e. 3, each block will be
replicated thrice. Therefore, we will have 15 blocks in total where 12 blocks will be of size 128
MB each and 3 blocks of size 2 MB each.
4) How to copy a file into HDFS with a different block size to that of existing block size
configuration?
Answer: You should start the answer with the command for changing the block size and then,
you should explain the whole procedure with an example. This is how you should answer this
question:
Yes, one can copy a file into HDFS with a different block size by using ‘-
Ddfs.blocksize=block_size’ where the block_size is specified in Bytes.
Let me explain it with an example: Suppose, I want to copy a file called test.txt of size, say of 120
MB, into the HDFS and I want the block size for this file to be 32 MB (33554432 Bytes) instead
of the default (128 MB). So, I would issue the following command:
hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
Now, I can check the HDFS block size associated with this file by:
hadoop fs -stat %o /sample_hdfs/test.txt
Else, I can also use the NameNode web UI for seeing the HDFS directory.
5) Failed to start server reported by cloudera-manager-installer.bin
"Failed to start server" reported by cloudera-manager-installer.bin.
/var/log/cloudera-scm-server/cloudera-scm-server.logcontains a message beginning Caused
by:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver...
Possible Reasons
You might have SELinux enabled.
Possible Solutions
Disable SELinux by running sudo setenforce 0 on the Cloudera Manager Server host. To disable
it permanently,
edit /etc/selinux/config.
6) Cloudera Manager Server fails to start with MySQL
Cloudera Manager Server fails to start and the Server is configured to use a MySQL database to
store information about service configuration.
Possible Reasons
Tables might be configured with the ISAM engine. The Server does not start if its tables are
configured with the MyISAM engine, and an error such as the following appears in the log file:
Tables ... have unsupported engine type .... InnoDB is required.
Possible Solutions
Make sure that the InnoDB engine is configured, not the MyISAM engine. To check what engine
your tables are using,
run the following command from the MySQL shell: mysql> show table status; For more
information, see MySQL Database on page 89.
• It is important that the datadir directory, which, by default, is /var/lib/mysql, is on a partition
that has sufficient free space.
• Cloudera Manager installation fails if GTID-based replication is enabled in MySQL.
7) Agents fail to connect to Server
Agents fail to connect to Server. You get an Error 113 ('No route to host') in /var/log/cloudera-
scm-agent/cloudera-scm-agent.log.
Possible Reasons
You might have SELinux or iptables enabled.
Possible Solutions
Check /var/log/cloudera-scm-server/cloudera-scm-server.log on the Server host and
/var/log/cloudera-scm-agent/cloudera-scm-agent.log on the Agent hosts. Disable SELinux and
iptables.
8) Cluster hosts do not appear
Some cluster hosts do not appear when you click Find Hosts in install or update wizard.
Possible Reasons
You may have network connectivity problems.
Possible Solutions
• Make sure all cluster hosts have SSH port 22 open.
• Check other common causes of loss of connectivity such as firewalls and interference from
SELinux.
9) Databases fail to start.
Activity Monitor, Reports Manager, or Service Monitor databases fail to start.
Possible Reasons
MySQL binlog format problem.
Possible Solutions
Set binlog_format=mixed in /etc/my.cnf.
10) Cannot start services after upgrade
You have upgraded the Cloudera Manager Server, but now cannot start services.
Possible Reasons
You may have mismatched versions of the Cloudera Manager Server and Agents.
Possible Solutions
Make sure you have upgraded the Cloudera Manager Agents on all hosts. (The previous version
of the Agents will
heartbeat with the new version of the Server, but you cannot start HDFS and MapReduce with
this combination.)
11) Activity Monitor displays a status of BAD
The Activity Monitor displays a status of BAD in the Cloudera Manager Admin Console. The log
file contains the following
message: ERROR 1436 (HY000): Thread stack overrun: 7808 bytes used of a 131072 byte stack,
and 128000 bytes needed.
Use 'mysqld -O thread_stack=#' to specify a bigger stack.
Possible Reasons
The MySQL thread stack is too small.
Possible Solutions
1. Update the thread_stack value in my.cnf to 256KB. The my.cnf file is normally located in /etc
or /etc/mysql.
2. Restart the mysql service: $ sudo service mysql restart
3. Restart Activity Monitor.
12) Activity Monitor fails to start
The Activity Monitor fails to start. Logs contain the error read-committed isolation not safe for
the statement binlog format.
Possible Reasons
The binlog_format is not set to mixed.
Possible Solutions
Modify the mysql.cnf file to include the entry for binlog format as specified in MySQL Database
13) Attempts to reinstall lower version of Cloudera Manager fail
Attempts to reinstall lower versions of CDH or Cloudera Manager using yum fails.
Possible Reasons
It is possible to install, uninstall, and reinstall CDH and Cloudera Manager. In certain cases, this
does not complete as expected. If you install Cloudera Manager 5 and CDH 5, then uninstall
Cloudera Manager and CDH, and then attempt
to install CDH 4 and Cloudera Manager 4, incorrect cached information may result in the
installation of an incompatible version of the Oracle JDK.
Possible Solutions
Clear information in the yum cache:
1. Connect to the CDH host.
2. Execute either of the following commands:
$ yum --enablerepo='*'clean all
or
$ rm -rf /var/cache/yum/cloudera*
3. After clearing the cache, proceed with installation.
14) Create Hive Metastore Database Tables command fails
The Create Hive Metastore Database Tables command fails due to a problem with an escape
string.
Possible Reasons
PostgreSQL versions 9 and higher require special configuration for Hive because of a backward-
incompatible change in the default value of the standard_conforming_strings property. Versions
up to PostgreSQL 9.0 defaulted to off, but starting with version 9.0 the default is on.
Possible Solutions
As the administrator user, use the following command to turn standard_conforming_strings off:
ALTER DATABASE <hive_db_name> SET standard_conforming_strings = off;
15) HDFS Data Nodes fail to start
After upgrading to CDH 5, HDFS DataNodes fail to start with exception:
Exception in secureMainjava.lang.RuntimeException: Cannot start datanode because the
configured max locked memory size (dfs.datanode.max.locked.memory) of 4294967296 bytes
is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
Possible Reasons
HDFS caching, which is enabled by default in CDH 5, requires new memlock functionality from
Cloudera Manager Agents.
Possible Solutions
Do the following:
1. Stop all CDH and managed services.
2. On all hosts with Cloudera Manager Agents, hard restart the Agents. Before performing this
step, ensure you understand the semantics of the hard_restart command by reading Hard
Stopping and Restarting Agents.
• Packages
– RHEL-compatible 7 and higher:
$ sudo service cloudera-scm-agent next_stop_hard
$ sudo service cloudera-scm-agent restart
– All other Linux distributions:
sudo service cloudera-scm-agent hard_restart
• Tarballs
– To stop the Cloudera Manager Agent, run this command on each Agent host:
– RHEL-compatible 7 and higher:
$ sudo tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
$ sudo tarball_root/etc/init.d/cloudera-scm-agent restart
– All other Linux distributions:
$ sudo tarball_root/etc/init.d/cloudera-scm-agent hard_restart
– If you are running single user mode, start Cloudera Manager Agent using the user account you
chose.
For example, to run the Cloudera Manager Agent as cloudera-scm, you have the following
options:
– Run the following command:
– RHEL-compatible 7 and higher:
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent restart
– All other Linux distributions:
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent hard_restart
– Edit the configuration files so the script internally changes the user, and then run the script as
root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
export CMF_SUDO_CMD=" "
2. Change the user and group in tarball_root/etc/init.d/cloudera-scm-agent to the
user you want the Agent to run as. For example, to run as cloudera-scm, change the user
and group as follows:
USER=cloudera-scm
GROUP=cloudera-scm
3. Run the Agent script as root:
• RHEL-compatible 7 and higher:
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent restart
• All other Linux distributions:
$ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent hard_restart
3. Start all services.
16) Error: “Error in dispatcher thread java.util.concurrent.RejectedExecutionException"
when running heavy load of job from YARN Resource Manager
Problem Description:
The YARN Resource Manager (RM) with HA configured is failing when experiencing heavy loads
of jobs.
Even the standby RM is crashing. Both the Standby RM and the previously active RMs are failing
as well.
The following error is displayed in the Resource Manager log at the moment of shutdown:

1. 2018-10-23 18:50:42,552 FATAL event.AsyncDispatcher


(AsyncDispatcher.java:dispatch(190)) -
2. Error in dispatcher thread
3. java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask@5407c4c8 rejected from
4. java.util.concurrent.ThreadPoolExecutor@74d60fd0[Terminated, pool size = 14147,
active threads = 0, queued tasks = 0, completed tasks = 32283]
5. at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.ja
va:2063)
6. at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
7. at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
8. at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
9. at
org.apache.hadoop.registry.server.services.RegistryAdminService.submit(RegistryAdminServic
e.java:176)
10. at
org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.purgeRecordsAsyn
c(RMRegistryOperationsService.java:200)
11. at
org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.purgeRecordsAsyn
c(RMRegistryOperationsService.java:170)
12. at
org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.onContainerFinish
ed(RMRegistryOperationsService.java:146)
13. at
org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService.handleAppAttemp
tEvent(RMRegistryService.java:156)
14. at
org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService$AppEventHandler
.handle(RMRegistryService.java:188)
15. at
org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService$AppEventHandler
.handle(RMRegistryService.java:182)
16. at
org.apache.hadoop.yarn.event.AsyncDispatcher$MultiListenerHandler.handle(AsyncDispatcher.
java:279)
17. at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
18. at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
19. at java.lang.Thread.run(Thread.java:748)
20. 2018-10-23 18:50:42,552 INFO capacity.ParentQueue
(ParentQueue.java:assignContainers(475)) -
21. assignedContainer queue=root usedCapacity=0.78571427
absoluteUsedCapacity=0.78571427
22. used=<memory:3914240, vCores:1076> cluster=<memory:4981760, vCores:2318>
23. 2018-10-23 18:50:42,559 INFO rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(422)) -
24. container_e173_1540320252022_0085_02_002570 Container Transitioned from
ALLOCATED to ACQUIRED
25. 2018-10-23 18:50:42,559 INFO rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(422)) -
26. container_e173_1540320252022_0085_02_002571 Container Transitioned from
ALLOCATED to ACQUIRED
27. 2018-10-23 21:49:24,484 INFO resourcemanager.ResourceManager
(LogAdapter.java:info(45)) - STARTUP_MSG:
28. /************************************************************
29. STARTUP_MSG: Starting ResourceManager
30. STARTUP_MSG: user = yarn
31. STARTUP_MSG: host = ustsmascmsp920.prod/10.86.128.54
32. STARTUP_MSG: args = []
33. STARTUP_MSG: version = 2.7.3.2.6.1.0-129
Possible Reasons:
Resource Manager has to purge the records under Zookeeper for every container that
completes.
While doing this, it scans almost all znodes from the root path. An increased number of znodes
will lead to Zookeeper client session drop and causes AsyncDispatcher queue to get
overwhelmed.
Resource Manager might be shutting down due to a race condition.
Possible Solution:
This issue is resolved in HDP-2.6.5. For versions prior to HDP-2.6.5, fo the following to disable
Resource Manager registry:
1. Log into Ambari UI.
2. Click YARN service.
3. Click Config > Advanced tab.
4. Expand Advanced yarn-site section.
5. Set hadoop.registry.rm. enabled to false.
6. Restart all affected.
17) Error:"(auth: KERBEROS) is not authorized for protocol interface" occurs on Node
Manager logs as the Node Manager goes down even after restarting
Problem Description:
The following error occurs in Node Manager logs as the Node Manager goes down even after
restarting:
1. Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationEx
ception):
2. User nm/********ore.local@AI*****.LOCAL (auth:KERBEROS) is not authorized for
protocol interface
3. org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by
4. nm/******@*****.LOCAL
5. at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
6. at org.apache.hadoop.ipc.Client.call(Client.java:1498)
7. at org.apache.hadoop.ipc.Client.call(Client.java:1398)
8. at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
9. at com.sun.proxy.$Proxy87.registerNodeManager(Unknown Source)
10. at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeM
anager
11. (ResourceTrackerPBClientImpl.java:68)
12. ... 16 more
13. 2018-05-04 12:05:32,356 INFO nodemanager.NodeManager (LogAdapter.java:info(45))
- SHUTDOWN_MSG:
14. SHUTDOWN_MSG: Shutting down NodeManager at <nodemanager_host/10.##.3.###
Possible Reasons:
This issue occurs when the Node Manager address is resolved incorrectly. The authorization
fails in a Kerberized cluster. It results in Node Manager registration failing, as the Resource
Manager does not accept requests from an unauthorized host.
Possible Solution:
To resolve this issue, do the following:
1. If a DNS server is being used, the network team should ensure the Node Manager address
resolved.
2. If a hosts file has been used, ensure the Resource Manager has the /etc/hosts file updated
with all the Node Manager addresses and is mapped properly to IP
18) ERROR: “Logs not available for attempt_1542239646200_2941_m_000000_0.
Aggregation may not be complete, check back later or try the node manager at :45454"
when running Hive jobs
Problem Description:
Container job history is not available on Resource Manager UI while running Hive jobs.
Resource Manager UI displays the following error:
1. Logs not available for attempt_1542239646200_2941_m_000000_0. Aggregation may
not be complete,
2. Check back later or try the node manager at <NODEMANAGER_HOST>:45454

The log aggregation status for the application ID on Resource Manager UI is:
1. Log Aggregation Status TIME_OUT

The Node Manager log where the YARN container is running, displays the following error:
1. 2018-11-14 11:49:46,745 ERROR filecontroller.LogAggregationFileController
(LogAggregationFileController.java:run(363))
2. - Failed to setup application log directory for application_1520644472079_1567017
3.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitExcep
tion$MaxDirectoryItemsExceededException):
4. The directory item limit of /apps/opt/hdp/logs/app-logs/pcjaapp/logs-ifile is
exceeded: limit=1048576 items=1048576
5. at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:113
2)
6. at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1177)
7. at
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.unprotectedMkdir(FSDirMkdirOp.java:
237)
8. at
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.createSingleDirectory(FSDirMkdirOp.j
ava:191)
9. at
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.createChildrenDirectories(FSDirMkdir
Op.java:166)
10. at
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:97)
11. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4181)
12. at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.ja
va:1109)
13. at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs
14. (ClientNamenodeProtocolServerSideTranslatorPB.java:645)
15. at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
ol$2.callBlockingMethod
16. (ClientNamenodeProtocolProtos.java)
17. at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngin
e.java:640)
18. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
19. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
20. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
21. at java.security.AccessController.doPrivileged(Native Method)
22. at javax.security.auth.Subject.doAs(Subject.java:422)
23. at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
24. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
Possible Reasons:
This issue occurs when the Log aggregation directory is full. Typically and by default, this
directory is /hadoop/yarn/log. This can be checked in YARN configuration on
yarn.nodemanager.log-dirs property.
Possible Solution:
To resolve this issue, do the following:
1. Log into Ambari UI.
2. Click on YARN service > Configs > Advanced tab.
3. Scroll down to Node Manager section.
4. Check for path value on yarn.nodemanager.log-dirs property.
5. Go to Node Manager host and clean older entries in path described in the property.
6. Resubmit the job and check container log in Resource Manager UI.
19) Error: “status": 500, "message": "Server Error" }" when opening or d files from HDFS
encrypted folders through Files Views
Problem Description
When opening or downloading files from encrypted folders through FilesViews, it fails with t
following error:
{
"status": 500,
"message": "Server Error"
}
Possible Reasons:
The Data Node gets token only for HDFS users. The REST call from WebHdfsFileSystem (D to
Ranger KMS uses 'hdfs' proxy. Hence, for HDFS in Ranger KMS, the proxy user settings added.
Possible Solution:
To resolve this issue, add the following properties in the custom kms-site and restart Ra service:
hadoop.kms.proxyuser.hdfs.groups= (http://hadoop.kms.proxyuser.hdfs.g
hadoop.kms.proxyuser.hdfs.hosts=*
hadoop.kms.proxyuser.hdfs.user= (http://hadoop.kms.proxyuser.hdfs.use
20) You get an error that you cluster is in "safe mode"
Possible Reasons:
Your cluster enters safe mode when it hasn't been able to verify that all the data nodes
necessary to replicate your data are up and responding.
Check the documentation to learn more about safe mode.
Possible Solution:
1. First, wait a minute or two and then retry your command. If you just started your cluster, it's
possible that it isn't fully initialized yet.
2. If waiting a few minutes didn't help and you still get a "safe mode" error, check your logs to
see if any of your data nodes didn't start correctly (either they have Java exceptions in their
logs or they have messages stating that they are unable to contact some other node in your
cluster).
If this is the case you need to resolve the configuration issue (or possibly pick some new nodes)
before you can continue.
21) You get a NoRouteToHostException in your logs or in stderr output from a command.
Possible Reasons:
One of your nodes cannot be reached correctly. This may be a firewall issue, so you should
report it to me.
Possible Solution:
The only workaround is to pick a new node to replace the unreachable one.
Currently, I think that creusa is unreachable, but all other Linux boxes should be okay. None of
the Macs will currently work in a cluster.
22) You get an error that "remote host identification has changed" when you try to ssh to
localhost.
Possible Reasons:
You have moved your single node cluster from one machine in the Berry Patch to another.
The name localhost thus is pointing to a new machine, and your ssh client thinks that it might be
a man-inthe-middle attack.
Possible Solution:
- You can ask your login to skip checking the validity of localhost.
- You do this by setting NoHostAuthenticationForLocalhost to yes in ~/.ssh/config.
- You can accomplish this with the following command:
echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
23) Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir,
but you get an error message when you try to put files into the HDFS (e.g., when you run a
command like bin/hadoop dfs -put).
Possible Reasons:
Creating directories is only a function of the NameNode, so your DataNode is not exercised until
you actually want to put some bytes into a file.
If you are sure that the DataNode is started, then it could be that your DataNodes are out of disk
space.
Possible Solution:
- Go to the HDFS info web page (open your web browser and go to
http://namenode:dfs_info_port where namenode is the hostname of
your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart
on your personal computer
then this URL will be http://localhost:50070).
- Once at that page click on the number where it tells you how many DataNodes you have to look
at a list of the DataNodes in your cluster.
- If it says you have used 100% of your space, then you need to free up room on local disk(s) of
the DataNode(s).
24) You try to run the grep example from the QuickStart but you get an error message
like this:
java.io.IOException: Not a file:
hdfs://localhost:9000/user/ross/input/conf
Possible Reasons:
You may have created a directory inside the input directory in the HDFS.
For example, this might happen if you run bin/hadoop dfs -put conf input twice in a row (this
would create a subdirectory in input... why?).
Possible Solution:
The easiest way to get the example run is to just start over and make the input a new.
bin/hadoop dfs -rmr input
bin/hadoop dfs -put conf input
25) Your DataNodes won't start, and you see something like this in logs/*datanode*:
Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data
Possible Reasons:
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat
the HDFS.
Possible Solution:
You need to do something like this:
bin/stop-all.sh
rm -Rf /tmp/hadoop-your-username/*
bin/hadoop namenode -format
26) When you try the grep example in the QuickStart, you get an error like the following:
org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist : /user/ross/input
Possible Reasons:
You haven't created an input directory containing one or more text files.
Possible Solution:
bin/hadoop dfs -put conf input
27) When you try the grep example in the QuickStart, you get an error like the following:
org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory /user/ross/output already exists
Possible Reasons:
You might have already run the example once, creating an output directory. Hadoop doesn't like
to overwrite files.
Possible Solution:
Remove the output directory before rerunning the example:
bin/hadoop dfs -rmr output
Alternatively you can change the output directory of the grep example, something like this:
bin/hadoop jar hadoop-*-examples.jar \
grep input output2 'dfs[a-z.]+'
28) You can run Hadoop jobs written in Java (like the grep example), but your Hadoop
Streaming jobs (such as the Python example that fetches web page titles) won't work.
Possible Reasons:
You might have given only a relative path to the mapper and reducer programs.
The tutorial originally just specified relative paths, but absolute paths are required if you are
running in a real cluster.
Possible Solution:
Use absolute paths like this from the tutorial:
bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
-mapper $HOME/proj/hadoop/multifetch.py \
-reducer $HOME/proj/hadoop/reducer.py \
-input urls/* \
-output titles
29) Hive Server2 Crashes / It was Exited Unexpectedly
Possible Reasons:
It Was OOM [Out of Memory] JVM heap Size Insufficient for HS2
Error message Location: From Stdout/Stderr [Stdout logs] In hs2->process->logs->stdout[OOM
Err]
Possible Solution:
We Changed heap size of HS2 From Hive configuration [heap size]
Note: For 2-10 conn. keep heap size 4 – 10GB in our case it was 8GB we did 12GB for resolution
[OOM]
30) Hive Server2 Crashes with “No Error Message” [It’s tough to findout when No msg]
Error Description: Suddenly Exiting from Hue Browser with “No Error Message and No Logs”
Possible Reasons:
Security Issue [ Authorization]
Possible Solution:
After going into granular level We figured out that it was security issue that it was case only
when one hive user has created one tables and other hive user was accessing that table [other
hive user does not have permission on that first hive users table] ... But when other user was
trying to access diff. hive users table that from “cloudera there was no error message issued in
hue browser" they just exited from " hive query suddenlty without any error msg " and also it
was not showing logs related to this incident.
31) Sqoop Stuck at the end
Possible Reasons:
It Was weired issue using sqoop to pulling data from mysql to hdfs. I have 3 jobs running total
record is 25k but it was stop n hang on 24k record n status showing running no log no update
Possible Solution:
We Changed java heap size of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
increase 1GB or 2GB From yarn configuration [heap size] or mapreduce.site.xml and
yarn.site.xml
32) Sqoop Connectivity issue. “The Network Adapter could not establish the connection”
Description: It Was Firewall Issue. So I was unable to complete import tables
Possible Reasons:
It Was Firewall Issue. The port number on that mysql is running is blocked on gateway node.
Its Optional Point u can add “what issue u faced after upgrade” [we can say when we are
upgrading we have stop all services of cluser and we said to network team to block port of
mysql in prerequisite and it was still block after upgrade also we have not open mysql port on
gateway node]
Possible Solution:
We have open the port of mysql on gateway node than issue resolved of connectivity
33) Upgrade Issue: We face situation where ANN n SBNN both went in Standby State
Possible Reasons:
One Faulty JN
Description: ANN n SBNN are actually connected to QJN’s what happen in our case one of JN
had actually become faulty. What happen the edits log which were synching over they were not
syncing properly from ANN to Quoram in that case both NN went to SBNN
Possible Solution:
Check faulty JN we fault Faulty Metadata[corrupted]. So what we did we transfer that edits logs
into somewhere else and then after we had 3 nodes what we did we transfer edit logs from
working node on the faulty node and then we do restart n after that the ANN came up with the
ANN and SBNN remains the SBNN was there.
34) Missing and Under Replicated Blocks
Possible Reasons:
Hdfs unable to find or Some Blocks are missed when writing [ May be when writing on that
node other nodes may be busy so unable to write]
Error message: 2 missing blocks in the cluster.
Possible Solution:
There are many ways to solve n avoid this problem
1) if you have a data under-replicated it should just automatically replicate the blocks to other
data nodes to match the replication factor
2) if it is not replicating on your own run a balancer
3) you can also set replication on a specific file which is under replicated
Monitor: Chart Builder and Namenode UI
35) Job Related Issue: Job Issued by user is failing
Possible Reasons:
Job Failing due to hitting/User cross maximum limit than assigned quota
Description: When user hits more than allocated quotas [space quotas]it cannot do any activity
Possible Solution:
- User Should delete old data to make room for new data or file
- Increase the space quota of user by setSpaceQuota
To Check Space Quota: hdfs dfs -count -q -h /user/*
36) Job Stuck in Accepted State
Possible Reasons:
1) Lack of Sufficient Resource to launch the job Eg. Suppose a job submitted requires 56GB of
Container and u don’t have enough resource than this time job is stuck n it will not start ur job
in this case u have to wait until sufficient memory is freed up over time to launch your job.
2) It has stuck in accepted state becoz u have limited a Property of Yarn Max._No_Appl. To run in
Fair Scheduler.
Possible Solution:
1) Increate Resource or Wait for resource to Free or Kill job which is less required or which is
ideal
2) Increate the parameter value of Max. No. Applications
37) Kerberos Issue
Description: After Enable Kerberos Hdfs Data Node Service was Not starting Up they were
unable to authenticate themselves becoz all services also requires Service Principals unlike User
Principle
Possible Reasons:
Data node was unable to start due to missing principles in key tab
RCA: Check logs /var/logs
Possible Solution:
We Regenerate the Missing Principles in Security Option
Key Tab File Name: dn.service.keytab
Key Tab Location: /var/run/cloudera-scm-agent/process/
Error Message: Kerberos error messages about which principal is not present in Kerberos
database.
38) Issues with Generate Credentials with Active Directory
Description: Role is missing Kerberos key tab; it means the Generate Credentials command
failed.
ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)
Possible Reasons:
The Domain Controller specified is incorrect or LDAPS has not been enabled for it.
Possible Solution:
Verify the KDC configuration by going to the Cloudera Manager Admin Console and go to
Administration> Settings> Kerberos.
Also check that LDAPS is enabled for Active Directory.
39) Issues with Generate Credentials with Active Directory
Description: Role is missing Kerberos keytab, it means the Generate Credentials command
failed.
ldap_add: Insufficient access (50)
Possible Reasons:
The Active Directory account you are using for ClouderaManager does not have permissions to
create other accounts.
Possible Solution:
Use the Delegate Control wizard to grant permission to the Cloudera Manager account to create
other accounts.
You can also login to Active Directory as the Cloudera Manager user to check that it can create
other accounts in your Organizational Unit.
40) Issues with Generate Credentials with MIT
Description: kadmin: Cannot resolve network address for admin server in requested realm
while initializing kadmin interface.
Possible Reasons:
The hostname for the KDC server is incorrect.
Possible Solution:
Check the kdc field for your default realm in krb5.conf and make sure the hostname is correct.
41) Description: A user must have a valid Kerberos ticket to interact with a secure
Hadoop cluster.
Running any Hadoop command (such as hadoop fs -ls) will fail if you do not have a valid
Kerberos ticket in your credentials cache.
If you do not have a valid ticket, you will receive an error such as:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed
on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid
credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Possible Solution:
You can examine the Kerberos tickets currently in your credentials cache by running the klist
command.
You can obtain a ticket by running the kinit command and either specifying a keytab file
containing credentials, or entering the password for your principal.
42) A cluster fails to run jobs after security is enabled.
Description: A cluster that was previously configured to not use security may fail to run jobs
for certain users on certain TaskTrackers
(MRv1) or Node Managers (YARN) after security is enabled due to the following sequence of
events:
1. A cluster is at some point in time configured without security enabled.
2. A user X runs some jobs on the cluster, which creates a local user directory on each
TaskTracker or NodeManager.
3. Security is enabled on the cluster.
4. User X tries to run jobs on the cluster, and the local user directory on (potentially a subset of)
the TaskTrackers
or NodeManagers is owned by the wrong user or has overly-permissive permissions.
Possible Solution:
Delete the mapred.local.dir or yarn.nodemanager.local-dirs directories for that user across the
cluster.
The NameNode does not start and KrbException Messages (906) and (31) are displayed.
43) The NameNode starts but clients cannot connect to it and error message contains
enctype code 18.
Description:
The NameNode keytab file does not have an AES256 entry, but client tickets do contain an
AES256 entry. The NameNode
starts but clients cannot connect to it. The error message does not refer to "AES256", but does
contain an enctype
code "18".
Possible Solution:
Make sure the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File" is
installed or remove
aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file.
44) Users are unable to obtain credentials when running Hadoop jobs or commands.
Description:
This error occurs because the ticket message is too large for the default UDP protocol. An error
message similar to the
following may be displayed:
13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Fail to create credential.
(63) - No service creds)]
Possible Solution:
Force Kerberos to use TCP instead of UDP by adding the following parameter to libdefaults in
the krb5.conf file
on the client(s) where the problem is occurring.
[libdefaults]
udp_preference_limit = 1
If you choose to manage krb5.conf through Cloudera Manager, this will automatically get added
to krb5.conf.
45) Cloudera Manager cluster services fail to start
Possible Reason:
Check that the encryption types are matched between your KDC and krb5.conf on all hosts.
Possible Solution:
If you are using AES-256, follow the instructions at Step 2: If You are Using AES-256 Encryption,
Install
the JCE Policy File on page 50 to deploy the JCE policy file on all hosts.
• If the version of the JCE policy files does not match the version of Java installed on a node, then
services will not
start. This is because the cryptographic signatures of the JCE policy files cannot be verified if the
wrong version is
installed. For example, if a DataNode does not start, you will see the following error in the logs
to show that
verification of the cryptographic signature within the JCE policy files failed
Exception in secureMain
java.lang.ExceptionInInitializerError
at javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:324)
at javax.crypto.KeyGenerator.<init>(KeyGenerator.java:157)
Caused by: java.lang.SecurityException: The jurisdiction policy files are not signed by
a trusted signer!
at javax.crypto.JarVerifier.verifyPolicySigned(JarVerifier.java:289)
at javax.crypto.JceSecurity.loadPolicies(JceSecurity.java:316)
at javax.crypto.JceSecurity.setupJurisdictionPolicies(JceSecurity.java:261)
Solution: Download the correct JCE policy files for the version of Java you are running:
• Java 6 & • Java 7
Download and unpack the zip file. Copy the two JAR files to the $JAVA_HOME/jre/lib/security
directory on each node within the cluster.
46) Retrieval of encryption keys fails
Description
You see the following error when trying to list encryption keys
user1@example-sles-4:~> hadoop key list
Cannot list keys for KeyProvider: KMSClientProvider[https:
//example-sles-2.example.com:16000/kms/v1/]: Retrieval of all keys failed.
Possible Solution:
Make sure your truststore has been updated with the relevant certificate(s), such as the Key
Trustee server certificate.
47) DistCp between unencrypted and encrypted locations fails
Description
By default, DistCp compares checksums provided by the filesystem to verify that data was
successfully copied to the
destination. However, when copying between unencrypted and encrypted locations, the
filesystem checksums will
not match since the underlying block data is different.
Possible Solution:
Specify the -skipcrccheck and -update distcp flags to avoid verifying checksums.
48) (CDH 5.6 and lower) Cannot move encrypted files to trash
Description
In CDH 5.6 and lower, with HDFS encryption enabled, you cannot move encrypted files or
directories to the trash directory.
Possible Solution:
To remove encrypted files/directories, use the following command with the -skipTrash flag
specified to bypass trash.
rm -r -skipTrash /testdir
49) NameNode - KMS communication fails after long periods of inactivity
Description
Encrypted files and encryption zones cannot be created if a long period of time (by default, 20
hours) has passed since
the last time the KMS and NameNode communicated.
Possible Solution:
For lower CDH 5 releases, there are two possible workarounds to this issue:
• You can increase the KMS authentication token validity period to a very high number. Since
the default value is
10 hours, this bug will only be encountered after 20 hours of no communication between the
NameNode and the
KMS. Add the following property to the kms-site.xmlSafety Valve:
<property>
<name>hadoop.kms.authentication.token.validity</name>
<value>SOME VERY HIGH NUMBER</value>
</property>
• You can switch the KMS signature secret provider to the string secret provider by adding the
following property
to the kms-site.xml Safety Valve:
<property>
<name>hadoop.kms.authentication.signature.secret</name>
<value>SOME VERY SECRET STRING</value>
</property>
50) What is my namenode is down and standby namenode is also not coming up,
what can be the issue?
Answer: Standby namenode and journal node configurations were in a corrupted state,
so that when the cluster tried to switch to the standby, you encountered the error that
you reported.
Initially we have to put the primary namenode into safemode and saved the namespace
with the following commands:
hdfs dfsadmin -safemode enter hdfs dfsadmin -saveNamespace
su - hdfs -c "hdfs namenode -bootstrapstandby -force"
This was to make sure that the namenode was in a consistent state before we attempted
to restart the HDFS components one last time to make sure all processes started cleanly
and that HDFS would automatically leave safemode
OR
1. Put Active NN in safemode
sudo -u hdfs hdfs dfsadmin -safemode enter
2. Do a save namespace operation on Active NN
sudo -u hdfs hdfs dfsadmin -saveNamespace
3. Leave Safemode
sudo -u hdfs hdfs dfsadmin -safemode leave
4. Login to Standby NN
5. Run below command on Standby namenode to get latest fsimage that we saved in
above steps.
sudo -u hdfs hdfs namenode -bootstrapStandby -force

You might also like