因为Py3的胶水特性,所以老朽我还是……我受到了高等级魔法使——吉多·范罗苏姆施加下来的诅咒:蛇切绳(bushi

不修边幅的结果

这里使用requests库,而非自带urllib

1
2
import requests
r = requests.get('https://www.kawashiros.club/index.html')

可能会遇到:ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接的问题,有时重试数次都无法解决。

或许,爬虫的第一道门槛是Cloudflare一类的服务。

破:重启路由器或调制解调器(更换IP地址)

如果只是在家用用,那这个方法适用于那些家里头没有分配公网IP的人。调制解调器重启,运营商就会分配一个新的IP地址。也就是说,用VPS是不行的,开公网IP也是不行的。

而且这个只适合临时解决在自家局域网由于爬虫多次访问造成对面服务器拉黑的尴尬局面。

也可以试试手机开热点,用流量。

破:User-Agent

1
2
3
import requests
header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"}
r = requests.get('https://www.kawashiros.club/index.html', header=header)

一般来说,通过伪装用户代理可以骗过Cloudflare,瞒天过海。可以试着准备多个UA。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import random
user_agent = [
'Mozilla/5.0.html (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.95 Safari/537.36 OPR/26.0.html.1656.60',
'Mozilla/5.0.html (compatible; MSIE 9.0.html; Windows NT 6.1; WOW64; Trident/5.0.html; SLCC2; .NET CLR 2.0.html.50727; .NET CLR 3.5.30729; .NET CLR 3.0.html.30729; Media Center PC 6.0.html; .NET4.0C; .NET4.0E; QQBrowser/7.0.html.3698.400)',
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36',
'Mozilla/5.0.html (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.html.1271.64 Safari/537.11',
'Mozilla/5.0.html (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.html.648.133 Safari/534.16',
'Mozilla/5.0.html (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0.html (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8C148 Safari/6533.18.5',
'Mozilla/5.0.html (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0.html (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
'Mozilla/5.0.html (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
'MQQBrowser/26 Mozilla/5.0.html (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
'Mozilla/5.0.html (Linux; U; Android 3.0.html; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0.html Safari/534.13',
'Mozilla/5.0.html (Windows NT 6.1; WOW64; rv:34.0.html) Gecko/20100101 Firefox/34.0.html',
'Mozilla/5.0.html (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.html.963.84 Safari/535.11 SE 2.X MetaSr 1.0.html',
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.html.1599.101 Safari/537.36',
'Mozilla/5.0.html (Windows NT 6.1; WOW64; Trident/7.0.html; rv:11.0.html) like Gecko',
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.html.2125.122 UBrowser/4.0.html.3214.0.html Safari/537.36',
]
def random_ua():
return random.choice(user_agent)

简单项目不建议使用fake-useragent库,需要联网,而且浪费。

破:避免多个请求同时访问,失败重试几次

就算准备了伪装用用户代理头,有时也难免会被Connection Reset或者403 Access Denied。首先应当避免多个请求同时访问,被判定为攻击行为,可以加入适当间隔。

而有时这类Connection Reset也有些运气成分,简单一次请求失败的风险太高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests
import random
from time import sleep
user_agent = ['Mozilla/5.0.html (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10']
# 此处省略其余伪装UA
def random_ua():
return random.choice(user_agent)

# 发送请求
# trial: 已尝试次数
def send(trial: int = 0):
try:
sleep(1)
r = requests.get('https://www.kawashiros.club/index.html', headers={"User-Agent": random_ua()})
except Exception as e:
print(str(e))
# 三次重试机会
if(trial < 3):
print("Retrying...")
return send(trial + 1)
else:
print("Failed")
return -1
else:
return r.text
print(send())

破:使用代理

这个貌似比较烧钱,实在逼不得已可以试试。

暑假的快活的追番的时光总是会遭到某些屑人的破坏。在睿站,这是再正常不过的了。虽说目前国内动漫算是Bilibili一家独大,但还是那句话——“林子大了什么鸟都有”。况且,许多原本免费的动画到现在却开始收费了……也就是这样,我花了半天时间,把我的追番列表手动录入到Bangumi,并计划以此作为我的新的据点。自此,我在多个平台上追番看动画,记录于单独一个平台,岂不美哉?麻烦的确是麻烦了些,但至少要比吊死在一棵树上要好。

爬  网  页

自然,hexo-bilibili-bangumi是用不成了,也找不到hexo的类似插件,无可奈何,自己动手丰衣足食。Bangumi官方API的开发文档不知道为何荒废了两年,我也知道从何开始入手,没办法,爬网页。

测试的时候刚好就遇到这类问题了。

整出来了!

也遇到了种种问题,最后还是整出来了。

源码

ver.202107120100 :使用了多线程来获取番剧封面,效率++;

原先是安置在我那可悲的土豆服务器上,还是因为FRP经常掉线,就索性像hexo-bilibili-bangumi那样改成在渲染页面部署前向博客站点的某个目录下现场爬取输出新鲜的JSON文件,让git带着它去远方的托管服务。

需要更改:

  • output: 输出目录;
  • username: 番组计划的用户名(字符串);
  • siteRootURL: “https://bangumi.tv/“,保持不变即可;
  • SaveLog: 日志保存位置(字符串)。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
#!/usr/bin/python3
__version__ = 202107120100
##
## GetBangumi - 番组计划爬虫
##
## (C) 2021 非科学のカッパ,License under MIT
##

##
## The MIT License (MIT)
## Copyright 2021 非科学のカッパ
##
## Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
##
## The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
##
## THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
##
import requests
from io import BytesIO
from lxml import etree
import sys
import os
import traceback
import json
import _thread
import signal
import time
from PIL import Image
import base64
import random
import platform

# 配置 --------------------------------------
# 输出JSON目录
__output__:str = sys.path[0]
# 用户名
__username__:str = "bkryofu"
# 主站URL
__siteRootURL__:str = "https://bangumi.tv"
# 日志目录
__SaveLog__: str = os.path.join(sys.path[0],'getbangumilog')
# -------------------------------------------

# 用于日志的宏(确信
INFO = 0
WARN = 1
ERR = 2

# 信号处理
def signal_handler(signum, frame):
log("Received Signal:" + str(signum) + ", Exit", WARN)
os._exit(0)

ostype = platform.system()

if(ostype == "Linux"):
signal.signal(signal.SIGHUP, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGQUIT, signal_handler)
signal.signal(signal.SIGALRM, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGCONT, signal_handler)
elif(ostype == "Windows"):
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGABRT, signal_handler)
signal.signal(signal.SIGFPE, signal_handler)
signal.signal(signal.SIGILL, signal_handler)
signal.signal(signal.SIGSEGV, signal_handler)
else:
pass

# 日志
# 日志函数
def log(info: str, stat: int = 0):
try:
time.sleep(1)
tmpTimeStruct = time.localtime(time.time())
timestat = [
str(tmpTimeStruct.tm_year)
+ "-"
+ str(tmpTimeStruct.tm_mon)
+ "-"
+ str(tmpTimeStruct.tm_mday),
str(tmpTimeStruct.tm_hour)
+ ":"
+ str(tmpTimeStruct.tm_min)
+ ":"
+ str(tmpTimeStruct.tm_sec),
]

sign = ["[i]", "<!>", "(x)"]

consOutMark = ["\033[;32m", "\033[;33m", "\033[;31m"]
print(
"["
+ timestat[0]
+ " "
+ timestat[1]
+ "]"
+ consOutMark[stat]
+ sign[stat]
+ info
+ "\033[;0m"
)

open(os.path.join(__SaveLog__, "LOG-" + timestat[0] + ".log"), "a+").write(
"[" + timestat[0] + " " + timestat[1] + "]" + sign[stat] + info + "\n"
)
except:
print("无法记录日志!请检查日志保存路径及其写入权限。")
sys.exit(-1)


# bgmtype[ptype] 想看 看过 在看 搁置 抛弃
bgmtype = ["wish", "collect", "do", "on_hold", "dropped"]

import random
# pc端的user-agent
user_agent_pc = [
# 谷歌
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36',
'Mozilla/5.0.html (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.html.1271.64 Safari/537.11',
'Mozilla/5.0.html (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.html.648.133 Safari/534.16',
# 火狐
'Mozilla/5.0.html (Windows NT 6.1; WOW64; rv:34.0.html) Gecko/20100101 Firefox/34.0.html',
'Mozilla/5.0.html (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
# opera
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.95 Safari/537.36 OPR/26.0.html.1656.60',
# qq浏览器
'Mozilla/5.0.html (compatible; MSIE 9.0.html; Windows NT 6.1; WOW64; Trident/5.0.html; SLCC2; .NET CLR 2.0.html.50727; .NET CLR 3.5.30729; .NET CLR 3.0.html.30729; Media Center PC 6.0.html; .NET4.0C; .NET4.0E; QQBrowser/7.0.html.3698.400)',
# 搜狗浏览器
'Mozilla/5.0.html (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.html.963.84 Safari/535.11 SE 2.X MetaSr 1.0.html',
# 360浏览器
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.html.1599.101 Safari/537.36',
'Mozilla/5.0.html (Windows NT 6.1; WOW64; Trident/7.0.html; rv:11.0.html) like Gecko',
# uc浏览器
'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.html.2125.122 UBrowser/4.0.html.3214.0.html Safari/537.36',
]
# 移动端的user-agent
user_agent_phone = [
# IPhone
'Mozilla/5.0.html (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
# IPAD
'Mozilla/5.0.html (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8C148 Safari/6533.18.5',
'Mozilla/5.0.html (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
# Android
'Mozilla/5.0.html (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
'Mozilla/5.0.html (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
# QQ浏览器 Android版本
'MQQBrowser/26 Mozilla/5.0.html (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
# Android Opera Mobile
'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
# Android Pad Moto Xoom
'Mozilla/5.0.html (Linux; U; Android 3.0.html; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0.html Safari/534.13',
]

def get_user_agent_pc():
return random.choice(user_agent_pc)

def get_user_agent_phone():
return random.choice(user_agent_phone)

# getpage 抓取页面
# user 用户名
# ptype 类型
# pager 页码
def getpage(user: str, ptype: int, pager: int, retryturn:int = 0):
try:
url = (
__siteRootURL__
+ "/anime/list/"
+ user
+ "/"
+ bgmtype[ptype]
+ "?page="
+ str(pager)
)
r = requests.get(
url,
headers={
"User-Agent": get_user_agent_pc()
},
)

except requests.HTTPError:
if r.status_code == 404:
return 0
else:
log("获取:" + url + " - 失败:HTTP状态码:" + r.status_code, ERR)
return -1
except Exception as e:
log("未知错误:" + str(e), ERR)
if(retryturn >= 10):
log("结果未更新!", WARN)
sys.exit(-1)
else:
log("重试:少女祈祷中……")
return getpage(user, ptype, pager, retryturn = retryturn + 1)
else:
log(
"获取 用户:"
+ user
+ " 分类:"
+ bgmtype[ptype]
+ " 第"
+ str(pager)
+ "页 "
+ "- 完成"
)
r.encoding = "utf-8"
text = r.text
r.close()
return text


# 解析页面
# ptype bgmtype对应下标
# pager 分页
def parse_page(ptype: int, pager: int):
text = getpage(__username__, ptype, pager)
# 获取页面时发生错误
if type(text) != str and text <= -1:
log("结果未更新!", WARN)
sys.exit(text)
# 空页
elif type(text) != str and text == 0:
log("空页面", WARN)
return []
else:
# 解析页面
log(
"解析页面 用户:"
+ __username__
+ " 分类:"
+ bgmtype[ptype]
+ " 第"
+ str(pager)
+ "页 开始"
)

try:
html = etree.HTML(text)
bgmul = html.xpath('//ul[@id="browserItemList"]')
l = etree.tostring(bgmul[0], encoding="utf-8")
bglhtml = etree.HTML(l.decode("utf-8")).xpath("//li")
# 当获得的页面没有 id 值为'browserItemList'的列表元素时抛出此错误
except IndexError:
log("解析错误:找不到列表!", WARN)
return []
except Exception as e:
log("未知错误:" + str(e), ERR)
log("结果未更新!", WARN)
sys.exit(-1)

log(
"解析页面 用户:"
+ __username__
+ " 分类:"
+ bgmtype[ptype]
+ " 第"
+ str(pager)
+ "页 完成"
)

result: list = []
for a in range(len(bglhtml)):

r = etree.tostring(bglhtml[a], encoding="utf-8").decode("utf-8")
rh = etree.HTML(r)

# getfh - 对获取到页面中没有的元素的异常的简单处理
# elementOfXpath: 元素的xpath路径
# attribute: 属性 空则获取元素中字符串
def getfh(elementOfXpath: str, attribute: str = ""):
if len(attribute) == 0:
try:
result = rh.xpath(elementOfXpath)[0].text
except IndexError:
result = ""
finally:
return result
else:
try:
result = rh.xpath(elementOfXpath)[0].get(attribute)
except IndexError:
result = ""
finally:
return result

info_tip = getfh('//li/div[@class="inner"]/p[@class="info tip"]').replace(
" ", ""
)

# 处理info_tip 的多余空格
info_tip = info_tip.replace("\n", "")
info_tip = info_tip.replace("/", " / ")

# 处理图片为base64
def picsave(url:str, retryturn:int = 0):
# 下载图像
try:
log("将图像 " + url + " 保存 - 开始")
filename = url.split('/')[-1].split('.')[0]+'.webp'
r = requests.get(url, headers={"User-Agent": get_user_agent_phone()})
img = Image.open(BytesIO(r.content))
r.close()
imgsavepath = os.path.join(__output__, "images", "bangumi" ,filename)
img.save(imgsavepath)

except Exception as e:
log("未知错误:" + str(e), ERR)
if(retryturn >= 10):
log("图像 " + url + " 保存失败!", WARN)

else:
log("重试:少女祈祷中……")
return picsave(url, retryturn = retryturn + 1)

else:
#base64img += str(base64.b64encode(f.read()).decode())
#print(base64img)
#input()
log('将图像 ' + url + " 保存 - 完成")

coverurl = 'https:' + getfh('//li/a/span[@class="image"]/img', 'src')
#picsave(coverurl)
_thread.start_new_thread( picsave, (coverurl , ) )
coversrc = 'https://img.bkryofu.xyz/bangumi/' + coverurl.split('/')[-1].split('.')[0]+'.webp'

bgm: dict = {
"title": getfh('//li/div[@class="inner"]/h3/a[@class="l"]'),
"subtitle": getfh('//li/div[@class="inner"]/h3/small[@class="grey"]'),
"cover": coversrc,
"info_tip": info_tip,
"collect": {
"date": getfh(
'//li/div[@class="inner"]/p[@class="collectInfo"]/span[@class="tip_j"]'
),
"tags": getfh(
'//li/div[@class="inner"]/p[@class="collectInfo"]/span[@class="tip"]'
),
},
"url": __siteRootURL__ + getfh("//li/a", "href"),
}

result.append(bgm)

return result


if __name__ == "__main__":
log("GetBangumi Version " + str(__version__))
log("少女祈祷中……")

# 所有抓取到的数据
alldat: dict = {}

# 整理
for a in range(len(bgmtype)):
alldat[bgmtype[a]] = []
page = 1
endpage: bool = False
while not endpage:
r = parse_page(a, page)
page += 1
if len(r) == 0:
endpage = True
log(
"完成抓取 用户:" + __username__ + " 分类:" + bgmtype[a] + " 页数:" + str(page-1)
)
else:
alldat[bgmtype[a]].append(r)

t:str = ""
t += time.strftime('%Y-')
t += time.strftime('%m-').replace('0','')
t += time.strftime('%d').replace('0','')
t += time.strftime(' %H:%M:%S')
alldat['update_time'] = t
del(t)
log("输出列表至" + os.path.join(__output__, "bangumi.json"))
try:
outpath = os.path.join(__output__, "bangumi.json")
# 输出为JSON文件
json.dump(
alldat,
open(outpath, "w", encoding="utf-8"),
ensure_ascii=False,
)
f = open(outpath, 'rb')
bgmjson = json.loads(f.read().decode('utf-8'))
f.close()

bgmpageinfo: dict = {}
for r in range(len(bgmtype)):
bgmpageinfo[bgmtype[r]] = [len(bgmjson[bgmtype[r]]), len(bgmjson[bgmtype[r]][len(bgmjson[bgmtype[r]]) -1])]

json.dump(bgmpageinfo, open(os.path.join(__output__, "bgmpage.json"), 'w', encoding='utf-8'), ensure_ascii=False)
log("输出页面信息至" + os.path.join(__output__, "bangumi.json"))
except Exception as e:
log("未知错误:" + str(e), ERR)
log("结果未更新!", WARN)
sys.exit(-1)
else:
log("完成")

每执行一次,都会生成bangumi.jsonbgmpage.json

bangumi.json的大致内容框架:

bgmpage.json大致内容:

之所以会有bgmpage.json,是因为iframe标签的需要。根据项目数自动调整iframe框架的大小,以及分页器控制,具体参照