hive 部署

环境信息

使用的 hadoop 完全分布式集群 用户 hadoop

1
2
3
192.168.2.241 hadoop01 # hive db
192.168.2.242 hadoop02 # hive client
192.168.2.243 hadoop03 # mysql db

MySQL安装以及授权参考hadoop部署

HIVE 安装

官网 https://www.apache.org/dyn/closer.cgi/hive/

hadoop01, hadoop02 执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
wget --no-check-certificate https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
mkdir /opt/bigdata/hive
tar -zxvf apache-hive-*.tar.gz -C /opt/bigdata/hive
cd /opt/bigdata/hive
ln -s apache-hive-* current

chown -R hadoop:hadoop /opt/bigdata/hive

cat >/etc/profile.d/hive_env.sh<<-eof
export HIVE_HOME=/opt/bigdata/hive/current
export PATH=\$PATH:\$HIVE_HOME/bin
eof

source /etc/profile

HIVE 配置

/opt/bigdata/hive/current/conf/hive-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.2.243:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>

<!-- mysql8 配置改为 com.mysql.cj.jdbc.Driver -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop01:9083</value>
</property>

<!-- 指定 hiveserver2 连接的 host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop01</value>
</property>


<!-- 指定 hiveserver2 连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
</configuration>

需要使用到 mysql-connector-java

下载链接 https://dev.mysql.com/downloads/connector/j/

复制 jar包到 /opt/bigdata/hive/current/lib (hadoop01,hadoop02) HiveDB 和 Hiveclient 两个机器

使用

hadoop01 (HiveDB)

1
2
schematool -dbType mysql  -initSchema # 初始化 Hive 元数据, 会在 mysql 中创建数据库和表等
nohup hive --service metastore 1> /opt/bigdata/hive/current/metastore.log 2>/opt/bigdata/hive/current/metastor_err.log & # 启动 Metastore 服务

hadoop02 (Hiveclient)

执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ hive
which: no hbase in (/opt/bigdata/java/current/bin:/opt/bigdata/java/current/bin:/opt/bigdata/java/default/bin:/opt/bigdata/java/current/bin:/opt/bigdata/java/current/bin:/opt/bigdata/java/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/bigdata/hadoop/current/bin:/opt/bigdata/hadoop/current/sbin:/root/bin:/bin:/opt/bigdata/hive/current/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/bigdata/hive/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/hadoop-3.3.2/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = be148795-b9a0-4a33-abce-b13042eb5d35

Logging initialized using configuration in jar:file:/opt/bigdata/hive/apache-hive-3.1.3-bin/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 0dc1e75e-3121-486e-ada9-81b7a693a979
hive> show databases;
OK
default
Time taken: 0.374 seconds, Fetched: 1 row(s)
hive> use default;
OK
Time taken: 0.03 seconds
hive> show tables;
OK
Time taken: 0.032 seconds
hive> exit;

使用JDBC方式访问Hive

需要修改 /etc/hadoop/conf/core-site.yml

添加

1
2
3
4
5
6
7
8
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>

然后重启 dfs yarn 等服务, 即可连接

使用 beeline 访问,需要启动 hiveserver2

1
nohup  hive --service hiveserver2  1>/opt/bigdata/hive/current/hiveserver.log 2> /opt/bigdata/hive/current/hiveserver.err &

使用 beeline 连接

1
2
3
4
5
6
7
8
9
10
11
$ bin/beeline -u jdbc:hive2://hadoop01:10000 -n hadoop

.......
Connecting to jdbc:hive2://hadoop01:10000
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://hadoop01:10000> show databases;

.............

HIVE 改用 tez 引擎

编译

因为 当前 tez 当前版本 不兼容 3.3.2 需要手动编译。
maven 官网 https://maven.apache.org/download.cgi

tez 官网 https://tez.apache.org/

github https://github.com/apache/tez

先配置环境 (需要联网下载,最好在 另外的服务器上编译, 编译完成后的包传输过来)

1
2
3
4
5
6
7
8
9
10
11
12
13
mkdir /opt/bigdata/maven
wget https://dlcdn.apache.org/maven/maven-3/3.8.5/binaries/apache-maven-3.8.5-bin.tar.gz --no-check-certificate

tar -zxvf apache-maven-3.8.5-bin.tar.gz -C /opt/bigdata/maven
cd /opt/bigdata/maven
ln -s apache-maven-3.8.5 current

cat >/etc/profile.d/maven_env.sh<<-eof
MAVEN_HOME=/opt/bigdata/maven/current
export PATH=\$PATH:\$MAVEN_HOME/bin
eof
source /etc/profile
mvn -v

编译

需要使用到 protobuf-2.5.0

1
2
3
4
5
6
7
8
$ wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
$ tar zxvf protobuf-2.5.0.tar.gz
$ cd rotobuf-2.5.0
$ ./configure
$ make
$ make install
$ protoc --version
libprotoc 2.5.0

需要修改源码

1
2
3
4
wget https://github.com/apache/tez/archive/refs/tags/rel/release-0.10.1.tar.gz

tar zxvf release-0.10.1.tar.gz
cd tez-rel-release-0.10.1

pom.xml 修改两个位置

1
2
3
<hadoop.version>3.3.2</hadoop.version>

<!-- <module>tez-ui</module> -->

tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java

1
2
3
4
5
6
7
8
9
注释 
//import com.google.protobuf.ByteString;


.setIdentifier(ByteString.copyFrom(jobToken.getIdentifier()))
.setPassword(ByteString.copyFrom(jobToken.getPassword()))
替换为
.setIdentifier(TokenProto.getDefaultInstance().getIdentifier().copyFrom(jobToken.getIdentifier()))
.setPassword(TokenProto.getDefaultInstance().getPassword().copyFrom(jobToken.getPassword()))

tez-plugins/tez-aux-services/findbugs-exclude.xml

1
2
3
4
5
6
7
<FindBugsFilter>
<Match>
<Class name="org.apache.tez.auxservices.ShuffleHandler"/>
<Method name="recordJobShuffleInfo"/>
<Bug pattern="RV_RETURN_VALUE_IGNORED_NO_SIDE_EFFECT"/>
</Match>
</FindBugsFilter>

修改完后编译

1
mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true

然后将编译完成后的 文件 拷贝到几台 hadoop集群上

1
2
tez-dist/target/tez-0.10.1-minimal.tar.gz
tez-dist/target/tez-0.10.1.tar.gz

安装配置(所有节点执行)

安装

1
2
3
4
5
6
mkdir -p /opt/bigdata/tez/tez-0.10.1
tar zxvf tez-0.10.1-minimal.tar.gz -C /opt/bigdata/tez/tez-0.10.1
cd /opt/bigdata/tez
ln -s tez-0.10.1 current

chown -R hadoop:hadoop /opt/bigdata/tez

修改 /opt/bigdata/hive/current/conf/hive-env.sh

1
2
3
4
5
6
7
8
9
10
export HIVE_HOME=/opt/bigdata/hive/current
export TEZ_HOME=/opt/bigdata/tez/current
export TEZ_JARS=""
for jar in `ls $TEZ_HOME |grep jar`; do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
done
for jar in `ls $TEZ_HOME/lib`; do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
done
export HADOOP_CLASSPATH=${TEZ_JARS}/*:${TEZ_JARS}/lib/*

修改 /opt/bigdata/hive/current/conf/hive-site.xml , 添加

1
2
3
4
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>

新增 /opt/bigdata/hive/current/conf/tez-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/tez/tez-0.10.1,${fs.defaultFS}/tez/tez-0.10.1/lib</value>
</property>
<property>
<name>tez.lib.uris.classpath</name>
<value>${fs.defaultFS}/tez/tez-0.10.1,${fs.defaultFS}/tez/tez-0.10.1/lib</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>true</value>
</property>
<property>
<name>tez.am.resource.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>tez.am.resource.cpu.vcores</name>
<value>2</value>
</property>
</configuration>

上传文件(一个节点执行就行)

1
2
3
4
5
mkdir ~/tez-0.10.1
tar zxvf tez-0.10.1.tar.gz -C ~/tez-0.10.1
cd ~
hadoop fs -mkdir /tez
hadoop fs -put tez-0.10.1 /tez

验证

生成测试文件

1
2
3
4
5
6
7
cat > /tmp/demo.txt <<-EOF
Linux Unix windows
hadoop Linux spark
hive hadoop Unix
MapReduce hadoop Linux hive
windows hadoop spark
EOF

重启 metastore, hiveserver2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ beeline -u jdbc:hive2://hadoop01:10000 -n hadoop
0: jdbc:hive2://hadoop01:10000> create table ad1 (id string, ip string,pt string) partitioned by (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
0: jdbc:hive2://hadoop01:10000> LOAD DATA LOCAL INPATH '/tmp/demo.txt' OVERWRITE INTO TABLE ad1 PARTITION (dt='2022-04-22');
0: jdbc:hive2://hadoop01:10000> set hive.execution.engine=mr;
No rows affected (0.029 seconds)
0: jdbc:hive2://hadoop01:10000> select count(1) from ad1;
+------+
| _c0 |
+------+
| 10 |
+------+
1 row selected (15.729 seconds)
0: jdbc:hive2://hadoop01:10000> set hive.execution.engine=tez;
No rows affected (0.003 seconds)
0: jdbc:hive2://hadoop01:10000> select count(1) from ad1;
+------+
| _c0 |
+------+
| 10 |
+------+
1 row selected (3.494 seconds)

可以看到 mr 用了 15.7s, tez 用了 3.4s

遇到的问题

1
2
3
0: jdbc:hive2://hadoop01:10000> select count(1) from ad1;
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex's TaskResource is beyond the cluster container capability,Vertex=vertex_1652942203850_0014_1_00 [Map 1], Requested TaskResource=<memory:10240, vCores:1>, Cluster MaxContainerCapability=<memory:8192, vCores:4> (state=08S01,code=2)
0: jdbc:hive2://hadoop01:10000> Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex's TaskResource is beyond the cluster container capability,Vertex=vertex_1652942203850_0014_1_00 [Map 1], Requested TaskResource=<memory:10240, vCores:1>, Cluster MaxContainerCapability=<memory:8192, vCores:4> (state=08S01,code=2

Hive on tez 执行作业时报错请求内存大于允许内存

yarn 的 nodemanager.resource.memory-mb 设置的过小,需要将调整改值,或者改 hive.tez.container.size 的值

当前先 临时修改 hive.tez.container.size 测试

1
2
0: jdbc:hive2://hadoop01:10000> set hive.tez.container.size=1024;
No rows affected (0.025 seconds)

日志文件

/tmp/root/hive.log

1
2
3
4
5
property.hive.log.level = INFO
property.hive.root.logger = DRFA
property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
property.hive.log.file = hive.log
property.hive.perflogger.log.level = INFO