前言
2020是一个不太平的年头,篮球标杆的意外去世,新型冠状病毒的肆虐横行,给本来喜庆的新年蒙上了厚厚的阴霾,为此,国家呼吁减少出行,安心呆在家里为国家做贡献,之前想象中的躺在床上有吃的有WiFi的生活彻底实现了,但是躺的时间太久了,难免有些厌倦,朋友圈里的有些朋友丧心病狂的去晒自己吃小橘子的步骤,简直惨不忍睹,为了打发时间,想了想要不爬一下最近全国肺炎感染信息,小小分析一下,画几个图出来看看感染情况究竟如何。
采集步骤
圈里的朋友最近都会分享丁香医生的接口https://3g.dxy.cn/newh5/view/pneumonia,就从这个网址入手吧
首先浏览器打开网址,熟练的操作F12,看到以下些个请求:

有没有异步ajax请求呢?发现没有,看来页面开发的哥们没有直接开发接口,继续看页面数据怎么渲染上去的,点开主要的页面请求,看一看response,发现以下内容:

原来开发页面哥们直接用js渲染json到页面上了,那就更简单了,直接get请求发送,解析标签获取内容即可,获取json直接保存到数据库,思路有了,接下来设置一下表结构吧:
1 2 3 4 5 6 7 8 9
| create table sars_real_num( parent_id varchar(20), name varchar(20), confirmedCount integer, deadCount integer, suspectedCount integer, curedCount integer, update_time datetime );
|
之后直接上代码:
run.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
| import requests import json import re import datetime from pymysql import * from bs4 import BeautifulSoup
url = 'https://3g.dxy.cn/newh5/view/pneumonia'
result = requests.get(url, headers=headers)
update_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
result.encoding='utf-8' content = result.text soup = BeautifulSoup(content, 'lxml') script_tags = soup.find_all(id='getAreaStat') r = script_tags[0].text[27:-11] province_list = json.loads(r)
host = '*.*.*.*' port = 3306 database = 'sars' user = 'sars' password = 'sars' charset='utf8' connection = connect(host=host, port=port, database=database, user=user, password=password, charset=charset) cursor = connection.cursor() insert_sql = 'insert into sars_real_num(name, parent_id, confirmedCount, deadCount, suspectedCount, curedCount, update_time) values(%s, %s, %s, %s, %s, %s, %s)'
for province in province_list: params = list() params.append(province['provinceName']) params.append('NULL') params.append(province['confirmedCount']) params.append(province['deadCount']) params.append(province['suspectedCount']) params.append(province['curedCount']) params.append(update_time) cursor.execute(insert_sql, params) for city in province['cities']: params = list() params.append(city['cityName']) params.append(province['provinceName']) params.append(city['confirmedCount']) params.append(city['deadCount']) params.append(city['suspectedCount']) params.append(city['curedCount']) params.append(update_time) cursor.execute(insert_sql, params) connection.commit()
cursor.close() connection.close()
|
为了使数据具有时效性,我们可以使用schedule这个模块,设置定时任务,每小时收集一次:
get_sars_real_num.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
| import requests import json import re import datetime import schedule import time from pymysql import * from bs4 import BeautifulSoup
def run():
url = 'https://3g.dxy.cn/newh5/view/pneumonia'
headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'Accept-Encoding':'gzip, deflate, br', }
result = requests.get(url, headers=headers)
update_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
result.encoding='utf-8'
content = result.text
soup = BeautifulSoup(content, 'lxml')
script_tags = soup.find_all(id='getAreaStat')
r = script_tags[0].text[27:-11]
province_list = json.loads(r)
host = '*.*.*.*' port = 3306 database = 'sars' user = 'sars' password = 'sars' charset='utf8'
connection = connect(host=host, port=port, database=database, user=user, password=password, charset=charset) cursor = connection.cursor()
insert_sql = 'insert into sars_real_num(name, parent_id, confirmedCount, deadCount, suspectedCount, curedCount, update_time) values(%s, %s, %s, %s, %s, %s, %s)'
for province in province_list: params = list() params.append(province['provinceName']) params.append('NULL') params.append(province['confirmedCount']) params.append(province['deadCount']) params.append(province['suspectedCount']) params.append(province['curedCount']) params.append(update_time) cursor.execute(insert_sql, params) for city in province['cities']: params = list() params.append(city['cityName']) params.append(province['provinceName']) params.append(city['confirmedCount']) params.append(city['deadCount']) params.append(city['suspectedCount']) params.append(city['curedCount']) params.append(update_time) cursor.execute(insert_sql, params) connection.commit() cursor.close() connection.close()
if __name__ == "__main__": schedule.every(1).hours.do(run) while True: schedule.run_pending() time.sleep(1)
|
在Linux服务器后台设置后台任务,让代码自己跑起来:
1
| nohup python3 get_sars_real_num.py >> get_sars_real_num.log 2>&1 &
|
在后台数据库我们可以看见数据一点一点的保存了进来:

数据简单分析
作为典型,我们拿出湖北省的数据来进行画图分析,在这里我们用到了Python比较经典的画图库matplotlib
,有兴趣的小伙伴可以去官网学习库的用法,在这里不多过赘述。
大体流程:查询数据->画图
analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| import matplotlib.pyplot as plt from matplotlib.pyplot import MultipleLocator plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['font.serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False plt.rcParams['figure.figsize'] = (14.0, 10.0) from pymysql import * host = '60.205.181.33' port = 3306 database = 'sars' user = 'sars' password = 'sars' charset='utf8' connection = connect(host=host, port=port, database=database, user=user, password=password, charset=charset) cursor = connection.cursor() select_sql = 'select * from sars_real_num where name = %s' params = list() params.append('湖北省') count = cursor.execute(select_sql, params) results = cursor.fetchall() confirmedCount = list() deadCount = list() suspectedCount = list() curedCount = list() update_time = list() for result in results: update_time.append(result[-1].strftime('%Y-%m-%d %H:%M:%S')) confirmedCount.append(result[2]) deadCount.append(result[3]) suspectedCount.append(result[4]) curedCount.append(result[5]) plt.plot(update_time,confirmedCount,linewidth=3, linestyle='-.', label='确诊人数') plt.title("湖北省肺炎确诊情况统计图",fontsize=20) plt.tick_params(axis='both',labelsize=15) plt.xticks(update_time, rotation=90) y_major_locator=MultipleLocator(200) ax=plt.gca() ax.yaxis.set_major_locator(y_major_locator) plt.ylim(0,5000) plt.legend() plt.show()
|

1 2 3 4 5 6 7 8 9 10 11 12
| plt.plot(update_time,deadCount,linewidth=3, linestyle='-.', label='死亡人数') plt.plot(update_time,suspectedCount,linewidth=3, linestyle='-.', label='疑似人数') plt.plot(update_time,curedCount,linewidth=3, linestyle='-.', label='治愈人数') plt.title("湖北省肺炎死亡治愈情况统计图",fontsize=20) plt.tick_params(axis='both',labelsize=15) plt.xticks(update_time, rotation=90) y_major_locator=MultipleLocator(10) ax=plt.gca() ax.yaxis.set_major_locator(y_major_locator) plt.ylim(0,200) plt.legend() plt.show()
|

结论
丁香医生的接口大概在早上7点左右更新数据,且湖北省的整体感染人数、死亡人数都在不断的增加,即将要达到一个爆发期,特殊时期,谨慎出行,出门戴口罩,减少疾病的传播,保持一个健康的身体才是最重要的,祝愿祖国能早日度过这个难关,闲来无事,水文一篇,求各位大佬轻喷,溜了溜了,继续躺着去了~~~