第一步:分析页面源码 直接pass了,源码中没有音频的下载链接,当然有的网站是有的,比如荔枝FM,有的话直接解析就好了,没有话,就看下一步了 第二步:chrome调试 打开chrome调试工具,逐一分析打开音频之后的网络请求和结果 上面两张图是我抓取到的getURL和返回的json,可以看出返回的内容里确实有三条音频地址,复制到浏览器可以直接播放,听一听就是我们要的宝贝。

html解析

观察json的路径,可以看到其中的数字编号,阅读对比网页源代码可以发现其数字编号的规律,即每一个MP3都有其对应的sound_id,我们需要提取其sound_id就可以知道其对应的.json,然后从其.json下载对应的mp3.
        <a class="title" href="/4228109/sound/64686514/" hashlink title="毛宗岗评_0-读三国志法-1">
<a class="title" href="/4228109/sound/64689648/" hashlink title="毛宗岗评_0-读三国志法-2">
<a class="title" href="/4228109/sound/64695831/" hashlink title="毛宗岗评_0-读三国志法-3">
print(html.text)
#导入requests模块
import requests    
#导入正则表达式 
import re   
#解决反爬问题,导入UA
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}         

for i in sound_ids:
# 每个音频的URL
url = 'http://www.ximalaya.com/tracks/'+str(i)+'.json'  
html = requests.get(url,headers=header) #获取网页源代码()
#print(html.text)

def get_find_url():
reg = '"play_path_64":"(.*?)"' #正则匹配ID和对应的URL
sound_url = re.findall(reg,html.text)  #最终的音频URL数列
#print(sound_url)#打印音频URL数列
reg = '"title":"(.*?)",'
title_url = re.findall(reg,html.text)#title_url是一个list,title_url[0]是一个str;  

list_tuple = [(sound_url[0], title_url[0])] #注意传递的形式
return list_tuple

#音频URL,title单独取出来
for url_finall,title in get_find_url():
#获取音频详细内容,把最后一个可能出现的?丢掉,否则调用open函数会出现问题
title_name = title.encode('utf-8').decode('unicode_escape')
if title_name[-1] == '?':
title_name = title_name[0:-1]
print(title_name)
m4a = requests.get(url_finall)
#取音频最后4位数,即就是.m4a作为后缀名
m4a_name = url_finall[-4:]
#        print('<正在下载第',1 )
#音频内容存储到本地
with open(title_name+m4a_name,'wb') as f:
f.write(m4a.content)
其中的一个json路径为:http://www.ximalaya.com/tracks/64689648.json 打印print(sound_url) 输出[('64686514', 'http://audio.xmcdn.com/group36/M00/74/9C/wKgJTVpFmu6Tq1DtAL4sx4J66PE455.m4a')]

优化2

 ###2019005-01
import requests
import re
import time

#获取网页源代码&解决反爬
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
#html = requests.get('http://www.ximalaya.com/4228109/album/268522/',headers=header)
html = requests.get('http://www.ximalaya.com/83432108/album/8475135/',headers=header)
# print(html.text)
reg = '<a class="title" href="/83432108/sound/(.*?)/" hashlink title="(.*?)">'
name_url = re.findall(reg,html.text)
#print(type(name_url)) ###<class 'list'>
for sound_id,title in name_url:
# print(sound_id,title)

#获取每个音频的json URL
json_url = 'http://www.ximalaya.com/tracks/'+str(sound_id)+'.json'
# print(json_url)

#正则匹配出ID和音频URL
result = requests.get(json_url,headers=header)
reg1 = '"play_path_64":"(.*?)"'

#最终的音频URL数列
sound_url = re.findall(reg1,result.text)
#打印音频URL数列
#    print(sound_url) #['http://audio.xmcdn.com/group37/M05/7B/EB/wKgJoFroFaKRys2VANDG6MbiohU865.m4a']
# print(title,sound_url[0])

data = requests.get(sound_url[0])
with open(title+'.m4a','wb') as f:
f.write(data.content)
print('下载完成:',title)
time.sleep(1)
通过findall函数返回的列表是print(name_url) ##输出的是('85929831', 'No.88 文化与创新'),而该单项列表与其对应的reg一致,同样sound_url也是这样,返回(.*?)中的内容,这个具有一般性规律吗?

20180608

发现网站的sound_id已经改了,需要重新找新的方式来提取它的ID,经观察发现了ID在html中隐藏的位置:

20180620 新的程序下载喜马拉雅的mp3


import re import requests import time header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'} #html = requests.get('http://www.ximalaya.com/shangye/8475135/p4/',headers=header) html = requests.get('https://www.ximalaya.com/renwen/3475911/p5/',headers=header) reg = r'<a title="(.*?) href="/renwen/.*?/(.*?)">' name_url = re.findall(reg,html.text) for title,sound_id in name_url: json_url = 'http://www.ximalaya.com/tracks/'+sound_id+'.json' if len(sound_id) > 3: result = requests.get(json_url,headers=header) reg1 = '"play_path_64":"(.*?)"' sound_url = re.findall(reg1,result.text) if title[-1] == '!' or title[-1] == '?'or title[-1] == '"': title = title[0:-1] data = requests.get(sound_url[0]) with open(title +'.m4a','wb') as f: f.write(data.content) print('下载完成:',title) time.sleep(1)