基于Python的小说自动采集脚本
最近突然想起来看一看高中时期未看完的小说,心血来潮就去网上找了找资源,发现不论下载下来的
txt
文件还是专门用来看小说的app
不是有广告,就是缺少章节,观看体验十分不爽,为了解决这个问题,自己动手,丰衣足食,于是这个采集脚本应运而生
技术要点:
- BeautifulSoup4:解析标签
- Requests:模拟http请求
- Python3
脚本使用步骤:
- 安装
BeautifulSoup4
1 pip3 install beautifulsoup4
- 安装
requests
1 pip3 install requests
- 保存以下代码为
book.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117 import re import sys
import requests
import time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
class BookSpider():
'''爬取顶点小说网小说'''
def __init__(self):
self.headers = {
'Host':'www.dingdiann.com',
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
}
self.chapter_url_list = list()
self.chapter_name_list = list()
self.book_info = dict()
def get_book_url(self, book_name, author_name):
'''获取要爬取书籍的详情页'''
url = 'https://www.dingdiann.com/searchbook.php'
data = {
'keyword':book_name
}
result = requests.get(url, headers=self.headers, params=data, verify=False).text
soup = BeautifulSoup(result, 'lxml')
book_name_list = soup.find_all(name='span', attrs={'class':'s2'})
book_author_list = soup.find_all(name='span', attrs={'class':'s4'})
book_name_list.pop(0)
book_author_list.pop(0)
for candidate_name in book_name_list:
book_info_list = list()
name = str(candidate_name.a.string)
book_url = str(candidate_name.a.get('href'))
book_info_tuple = (name, book_url)
book_info_list.append(book_info_tuple)
author = str(book_author_list[0].string)
if author in self.book_info.keys():
self.book_info[author].append(book_info_tuple)
book_author_list.pop(0)
else:
self.book_info[author] = book_info_list
book_author_list.pop(0)
if self.book_info[author_name]:
for info in self.book_info[author_name]:
if info[0] == book_name:
url = info[1]
print('书籍已经找到,您要找的书籍为 ' + book_name + ':' + author_name)
print('3S 后将开始下载~~~')
time.sleep(3)
return url
else:
print('抱歉,书籍未找到,请确认书籍作者及名称是否正确~~~')
def get_book_info(self, url):
'''获取书籍的章节列表和地址'''
all_url = 'https://www.dingdiann.com' + url
result = requests.get(all_url, headers=self.headers, verify=False).text
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all(id='list')[0]
chapter_list = div.dl.contents
for text in chapter_list :
text = str(text)
content = re.findall('<a href="' + url + '(.*?)" style="">(.*?)</a>.*?', text)
if content:
chapter_url = all_url + content[0][0]
chapter_name = content[0][1]
self.chapter_url_list.append(chapter_url)
self.chapter_name_list.append(chapter_name)
for i in range(12):
self.chapter_url_list.pop(0)
self.chapter_name_list.pop(0)
def get_chapter_content(self, name, url):
'''获取小说每章内容'''
try:
result = requests.get(url, headers=self.headers, verify=False).text
except:
print(name + "下载失败~~~")
return False
else:
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all(id='content')[0]
div = str(div)
result = re.findall('<div id="content">(.*?)<script>', div, re.S)[0].strip()
result = re.sub('<br/>', '\n', result)
return result
def save_book(self, book_name):
'''保存小说'''
for chapter_name in self.chapter_name_list:
while True:
chapter_content = self.get_chapter_content(chapter_name, self.chapter_url_list[0])
if chapter_content:
with open(book_name + ".txt", 'a') as f:
f.write(chapter_name)
f.write("\n")
f.write(chapter_content)
f.write("\n")
self.chapter_url_list.pop(0)
print(chapter_name + "已经下载完成")
break
def run(self, book_name, url):
self.get_book_info(url)
self.save_book(book_name)
def main(book_name, author_name):
book = BookSpider()
url = book.get_book_url(book_name, author_name)
book.run(book_name, url)
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
- 使用说明:脚本需要输入两个参数,参数1为
小说名称
,参数2为作者名称
,之后便会将采集到的内容保存在本地,例如:
1 python3 book.py 天珠变 唐家三少
享用
免责声明
本脚本采集的小说数据来自
顶点小说网
,只提供数据采集服务,不提供任何贩卖服务数据采集自
https://www.dingdiann.com/
,感谢网站管理员的慷慨支持,爬了很多次也没有ban我的ip,希望大家多多支持正版
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 tyrantlucifer!
评论