常见问题 | Get busy living

环境信息

使用的 hadoop 完全分布式集群

1
2
3

192.168.2.241 hadoop01 
192.168.2.242 hadoop02 
192.168.2.243 hadoop03

下线一个 datanode 节点

/etc/hadoop/conf/hdfs-site.xml 添加


<property>     
    <name>dfs.hosts.exclude</name>      
    <value>/etc/hadoop/conf/hosts-exclude</value>  
</property>

/etc/hadoop/conf/hosts-exclude 添加待下线的节点

hadoop03

刷新hadoop 配置

1	hdfs dfsadmin -refreshNodes

查看, 已下线

[hadoop@hadoop01 ~]$ hdfs dfsadmin -report
Configured Capacity: 72955723776 (67.95 GB)
Present Capacity: 33702436598 (31.39 GB)
DFS Remaining: 32456507392 (30.23 GB)
DFS Used: 1245929206 (1.16 GB)
DFS Used%: 3.70%
Replicated Blocks:
        Under replicated blocks: 412
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 412
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 192.168.2.241:9866 (hadoop01)
Hostname: hadoop01
Decommission Status : Normal
Configured Capacity: 36477861888 (33.97 GB)
DFS Used: 657819584 (627.35 MB)
Non DFS Used: 22519461952 (20.97 GB)
DFS Remaining: 13300580352 (12.39 GB)
DFS Used%: 1.80%
DFS Remaining%: 36.46%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Wed May 25 05:03:23 EDT 2022
Last Block Report: Wed May 25 04:20:17 EDT 2022
Num of Blocks: 871


Name: 192.168.2.242:9866 (hadoop02)
Hostname: hadoop02
Decommission Status : Normal
Configured Capacity: 36477861888 (33.97 GB)
DFS Used: 588109622 (560.87 MB)
Non DFS Used: 16733825226 (15.58 GB)
DFS Remaining: 19155927040 (17.84 GB)
DFS Used%: 1.61%
DFS Remaining%: 52.51%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Wed May 25 05:03:23 EDT 2022
Last Block Report: Wed May 25 04:20:17 EDT 2022
Num of Blocks: 599

上线, 修改配置

<property>
  <name>dfs.hosts</name>
  <value>/etc/hadoop/conf/hosts</value>
</property>

写入 /etc/hadoop/conf/hosts

1
2
3

hadoop01
hadoop02
hadoop03

刷新hadoop 配置

1	hdfs dfsadmin -refreshNodes

hadoop03 手动启动 datanode

1	hdfs --daemon start datanode

某个 datanode 节点磁盘坏掉

在故障节点上查看 /etc/hadoop/conf/hdfs-site.xml 文件中对应的 dfs.datanode.data.dir 参数设置，去掉故障磁盘对应的目录挂载点；
在故障节点上查看 /etc/hadoop/conf/yarn-site.xml 文件中对应的 yarn.nodemanager.local-dirs 参数设置，去掉故障磁盘对应的目录挂载点；
重启该节点的 DataNode 服务和 NodeManager 服务即可。

Hadoop 进入安全模式

Hadoop 的启动和验证都正常，那么只需等待一会儿，Hadoop 便将自动结束安全模式

或者手动执行

1	hdfs dfsadmin -safemode leave

krb 调试

KRB5_TRACE=/dev/stdout

正常返回

[root@test-152 keytabs]#  KRB5_TRACE=/dev/stdout kinit -kt test.keytab test
[3737241] 1731657026.743873: Getting initial credentials for [email protected]
[3737241] 1731657026.743874: Looked up etypes in keytab: aes256-cts, aes128-cts
[3737241] 1731657026.743876: Sending unauthenticated request
[3737241] 1731657026.743877: Sending request (175 bytes) to example.COM
[3737241] 1731657026.743878: Resolving hostname test-152
[3737241] 1731657026.743879: Sending initial UDP request to dgram 172.20.1.152:88
[3737241] 1731657026.743880: Received answer (692 bytes) from dgram 172.20.1.152:88
[3737241] 1731657026.743881: Sending DNS URI query for _kerberos.example.COM.
[3737241] 1731657026.743882: No URI records found
[3737241] 1731657026.743883: Sending DNS SRV query for _kerberos-master._udp.example.COM.
[3737241] 1731657026.743884: Sending DNS SRV query for _kerberos-master._tcp.example.COM.
[3737241] 1731657026.743885: No SRV records found
[3737241] 1731657026.743886: Response was not from master KDC
[3737241] 1731657026.743887: Processing preauth types: PA-ETYPE-INFO2 (19)
[3737241] 1731657026.743888: Selected etype info: etype aes256-cts, salt "example.COMtest", params ""
[3737241] 1731657026.743889: Produced preauth for next request: (empty)
[3737241] 1731657026.743890: Getting AS key, salt "example.COMtest", params ""
[3737241] 1731657026.743891: Retrieving [email protected] from FILE:test.keytab (vno 0, enctype aes256-cts) with result: 0/Success
[3737241] 1731657026.743892: AS key obtained from gak_fct: aes256-cts/C03C
[3737241] 1731657026.743893: Decrypted AS reply; session key is: aes256-cts/0610
[3737241] 1731657026.743894: FAST negotiation: available
[3737241] 1731657026.743895: Initializing FILE:/tmp/krb5cc_0 with default princ [email protected]
[3737241] 1731657026.743896: Storing [email protected] -> krbtgt/[email protected] in FILE:/tmp/krb5cc_0
[3737241] 1731657026.743897: Storing config in FILE:/tmp/krb5cc_0 for krbtgt/[email protected]: fast_avail: yes
[3737241] 1731657026.743898: Storing [email protected] -> krb5_ccache_conf_data/fast_avail/krbtgt\/example.COM\@example.COM@X-CACHECONF: in FILE:/tmp/krb5cc_0

[root@test-152 keytabs]#  KRB5_TRACE=/dev/stdout klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: [email protected]

Valid starting       Expires              Service principal
11/15/2024 15:50:26  11/16/2024 15:50:26  krbtgt/[email protected]
[root@test-152 keytabs]#  KRB5_TRACE=/dev/stdout kdestroy
[3737246] 1731657059.396333: Destroying ccache FILE:/tmp/krb5cc_0
[root@test-152 keytabs]#

yarn 日志查看

yarn application -list # yarn app -list
yarn app -list -appStates ALL # yarn application -list -appStates FINISHED,FAILED,KILLED
yarn application -status <application_id>
yarn logs -applicationId <application_id>
yarn logs -applicationId <application_id> -containerId <container_id> > container_logs.txt


yarn queue -status default