爬取百度百科词条写入数据库

这是在把百度百科上从一个词条中,随机选择一个关键字,然后从这个关键词的词条中,继续这样一个步骤,同时保存到数据库。
如果遇到某词条下没有其他关键字,就会返回到上一个关键字处,目前有一个问题,就是两条关键词都只有一个的话,会进入循环。
数据库中可以设置url唯一
数据库这一块的操作,得先安装pymysql。直接使用pip安装即可。随机数random是python自带的。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

import pymysql.cursors # 数据库


base_url = "https://baike.baidu.com"
his = ["/item/%E5%8F%B2%E8%AE%B0"]

for i in range(1000):
# dealing with Chinese symbols
url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

print(i, soup.find('h1').get_text(), ' url: ', url)

# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()

# 链接数据库
connection = pymysql.connect(host = 'localhost',
user = 'root',
password = 'password',
db = 'baikeurl',
charset = 'utf8mb4',
)

try:
# 获取会话指针
with connection.cursor() as cursor:
# 创建sql 语句
sql = 'insert into `urls`(`urlname`,`urlhref`)values(%s,%s)'
# 执行sql 语句
cursor.execute(sql,(soup.find('h1').get_text(),url))
# 提交
connection.commit()
except:
pass
finally:
connection.close()

读取数据库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pymysql.cursors

connection = pymysql.connect(host = 'localhost',
user = 'root',
password = 'password',
db = 'baikeurl',
charset = 'utf8mb4',
)
try:
# 获取会话指针
with connection.cursor() as cursor:
# 查询sql 语句
sql = 'select `urlname` , `urlhref` from `urls` where `id` is not null'
# 执行sql 语句
conut = cursor.execute(sql)
print(conut)

# result = cursor.fetchall()
# print(result)

finally:
connection.close()