使用Python编写超简单爬虫,从tobu.io下载全部免费歌曲

概要

本文包括简单教程, 也有相关技术的介绍, 具体板块有

  • 代码
  • tobu.io下载流程分析
  • 关键模块处理
  • 相关Python知识介绍

    依赖和环境

  • Python 3 REQUIRED (我使用3.8.1 64-bit)

    • pip install AdvancedHTMLParser: 极其便利的HTML解析, JavaScript中用到的方法, 其中大部分均有类似实现
    • pip install requests: HTTP请求, 主要使用其cookiejar和opener
  • Windows 10 Home Chinese, Windows Powershell OPTIONAL
    • 使用的终端, 需支持ANSI EscapeCode

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import urllib.request
import http.cookiejar
import AdvancedHTMLParser
import requests
from contextlib import closing
class Song:
def __init__(self):
pass
def parseTag(self, tag: AdvancedHTMLParser.AdvancedTag):
parser = AdvancedHTMLParser.AdvancedHTMLParser()
parser.parseStr(tag.innerHTML)
# print(tag.innerHTML)
self.url = 'https:' + tag.getAttribute('href')
self.title = parser.getElementsByClassName('title')[0].innerHTML
self.artist = parser.getElementsByClassName('artist')[0].innerHTML
self.date = parser.getElementsByClassName('date')[0].innerHTML.strip()
self.isFree = len(parser.getElementsByClassName('free')) > 0
self.artwork = parser.getElementsByTagName('img')[0].getAttribute('src')
pass
def getSong(self):
# Step 0: build opener
cookiejar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookiejar))
accept = http.cookiejar.Cookie(version=0, name='accept', value='1', port=None, port_specified=False, domain='.tobu.io', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1681201372, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
response = opener.open(self.url + '/download')
cookiejar.set_cookie(accept)
# Step 2: Go to download page with the right cookie
with closing(requests.get(url = self.url + '/download/mp3', headers=headers, cookies = cookiejar, stream=True)) as response:
chunk_size = 1024 # 单次请求最大值
content_size = int(response.headers['content-length']) # 内容体总大小
self.title = response.headers['content-disposition'].split(sep='"')[1]
file_path = 'download/' + self.title
data_count = 0
with open(file_path, "wb+") as file:
for data in response.iter_content(chunk_size=chunk_size):
file.write(data)
data_count = data_count + len(data)
now_jd = (data_count / content_size) * 100
print('\033[K[\033[33mWORK\033[0m]' + ' ' + 'Downloading Song:' + " %d%%(%d/%d) - %s" % (now_jd, data_count, content_size, file_path), end="\r")

def __str__(self):
return self.title + ' ' + self.artist + ' ' + self.date + ' ' + self.isFree.__str__() + ' ' + self.artwork + ' ' + self.url

class TobuGrab:
url = 'https://tobu.io'
prefix = 'https:'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}

def __init__(self):
pass

def getPageList(self):
request = urllib.request.Request(url=self.url, headers=self.headers)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
parser = AdvancedHTMLParser.AdvancedHTMLParser()
parser.parseStr(html)
tagList = parser.getElementsByClassName('pg').getAllNodes().filterCollection(lambda tag: type(tag.getAttribute('href')) == type(str()))
self.pageList = map(lambda node: node.getAttribute('href'), tagList.getAllNodes())
pass

def getSongList(self):
songList = list()
for page in self.pageList:
request = urllib.request.Request(url=self.prefix + page, headers=self.headers)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
parser = AdvancedHTMLParser.AdvancedHTMLParser()
parser.parseStr(html)
tagList = parser.getElementsByClassName('track')
for tag in tagList:
newSong = Song()
newSong.parseTag(tag)
songList.append(newSong)
self.songList = songList
# Load list
x = TobuGrab()
x.getPageList()
x.getSongList()
print('[\033[36mINFO\033[0m]' + ' ' + 'List Loaded')
# Download songs in the list
i = 1 # index starts at 1
for song in x.songList[i-1::]:
print('[\033[36mINFO\033[0m]' + ' ' + 'Loading Song:' + ' ' + song.title, end="\r")
if song.isFree == True:
try:
song.getSong()
except:
print('[%03d]\033[K[\033[31mFAIL\033[0m]' % (i) + ' ' + 'Bad Song:' + ' ' + song.title + ' ')
else:
print('[%03d]\033[K[\033[32mSUCC\033[0m]' % (i) + ' ' + 'Downloaded Song:' + ' ' + song.title + ' ')
else:
print('[%03d]\033[K[\033[33mSKIP\033[0m]' % (i) + ' ' + 'Paid Song:' + ' ' + song.title + ' ')
i += 1
  • 代码的质量相当糟糕, 因为:
    • 博主学艺不精
    • 编写时想着能用就行, 反正就100行, 注释逻辑随便写写
    • 异常处理写的稀烂, 甚至连Ctrl+C都没单独处理, 根本停不下来
    • 没有多线程, web访问直接阻塞, 一次只能下一首歌, 更别提多线程下载了
    • 面向对象乱写, 能放init的全部硬编码, 成员不初始化
  • 能WORK, 有歌听就完事了噢

下载流程分析

tobu.io歌曲

该网站的歌曲主要有以下几种

  • 没戏: 5个, 付费歌曲, 不提供下载链接, 只有到其他音乐商店的链接
  • 自动: 84个, 83个, 免费单曲(MP3), 站内页面, dropbox存储
    • 删去的1个, 文件名包含特殊字符, 很气, 也手动
  • 自动: 4个, 免费专辑(ZIP), 站内页面, dropbox存储
  • 手动: 1个, 免费单曲(MP3), NCS页面

    tobu.io流程

仅分析站内页面, dropbox存储这一类的获取流程

  • 进入网站, 翻页查看歌曲
  • 访问https://tobu.io/<track>/download, 查看歌曲详情
  • 一系列点击事件, 最终访问到https://tobu.io/<track>/download/mp3, 该请求处理时, 会重定向至dropbox
  • 得到来自dropbox的response, 其中就是带下载的文件

    tobu.io机制

这个网站为了防止直连下载, 做了如下对策

  • Cookies
    • 设置Cookie
      • 访问https://tobu.io/<track>/download时, 将设置track_dl = <trackID>的Cookie
      • 访问https://tobu.io/<track>/download#open-dl时, 会进行一系列点击事件, 其中<btn class='unlock'>按钮将设置accept = 1的Cookie
    • 检查Cookie: 访问https://tobu.io/<track>/download/mp3时, 将检查上述两个Cookie
      • track_dl必须匹配于<track>
      • 必须存在accept = 1
      • Cookie不能过期, 但不检查期限是否合理(自己添加Cookie时, 可以将期限设置的离谱大, 几乎永不过期)
  • 重定向
    • dropbox的重定向由后台进行, 无法直接得到直连

      爬虫设计与关键模块分析

超烂代码再放送!

  1. 访问https://tobu.io, 元素<div class='pg'>得到歌曲列表全部页面
    简单的爬虫, 其中getAllNodes方法将返回包含根节点的列表, 需要扔掉根节点, 根据href属性进行判断, 需要含有页面的url

    1
    2
    3
    4
    5
    6
    7
    8
    9
    def getPageList(self):
    request = urllib.request.Request(url=self.url, headers=self.headers)
    response = urllib.request.urlopen(request)
    html = response.read().decode('utf-8')
    parser = AdvancedHTMLParser.AdvancedHTMLParser()
    parser.parseStr(html)
    tagList = parser.getElementsByClassName('pg').getAllNodes().filterCollection(lambda tag: type(tag.getAttribute('href')) == type(str()))
    self.pageList = map(lambda node: node.getAttribute('href'), tagList.getAllNodes())
    pass
  2. 访问每个页面, https://tobu.io/page/<pg_no>, 元素<div class='track>为单曲
    简单的爬虫, 对元素的解析交给Song类实现parseTag方法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    def getSongList(self):
    songList = list()
    for page in self.pageList:
    request = urllib.request.Request(url=self.prefix + page, headers=self.headers)
    response = urllib.request.urlopen(request)
    html = response.read().decode('utf-8')
    parser = AdvancedHTMLParser.AdvancedHTMLParser()
    parser.parseStr(html)
    tagList = parser.getElementsByClassName('track')
    for tag in tagList:
    newSong = Song()
    newSong.parseTag(tag)
    songList.append(newSong)
    self.songList = songList
    pass
    1. 分析单曲, 主要是该元素的href属性, 将得到歌曲详情页, 其余部分没用
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      def parseTag(self, tag: AdvancedHTMLParser.AdvancedTag):
      parser = AdvancedHTMLParser.AdvancedHTMLParser()
      parser.parseStr(tag.innerHTML)
      # print(tag.innerHTML)
      self.url = 'https:' + tag.getAttribute('href')
      self.title = parser.getElementsByClassName('title')[0].innerHTML
      self.artist = parser.getElementsByClassName('artist')[0].innerHTML
      self.date = parser.getElementsByClassName('date')[0].innerHTML.strip()
      self.isFree = len(parser.getElementsByClassName('free')) > 0
      self.artwork = parser.getElementsByTagName('img')[0].getAttribute('src')
      pass
  3. 访问每个歌曲<track>, 其下载页面https://tobu.io/<track>/download/, 得到需要的cookie

    1
    response = opener.open(self.url + '/download')
  4. 添加伪造的accept=1, 绕过点击事件
    这里需要注意, 将cookie的expires搞久一点, 如果反复遇到不允许下载的情况, 可以调高expires. 该cookie于2020.02.09未过期

    1
    2
    accept = http.cookiejar.Cookie(version=0, name='accept', value='1', port=None, port_specified=False, domain='.tobu.io', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1681201372, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)
    cookiejar.set_cookie(accept)
  5. 携带上述cookie, 访问https://tobu.io/<track>/download/mp3, 静候佳音
    参考了chen~先生的cnblog

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    with closing(requests.get(url = self.url + '/download/mp3', headers=headers, cookies = cookiejar, stream=True)) as response:
    chunk_size = 1024 # 单次请求最大值
    content_size = int(response.headers['content-length']) # 内容体总大小
    self.title = response.headers['content-disposition'].split(sep='"')[1]
    file_path = 'download/' + self.title
    data_count = 0
    with open(file_path, "wb+") as file:
    for data in response.iter_content(chunk_size=chunk_size):
    file.write(data)
    data_count = data_count + len(data)
    now_jd = (data_count / content_size) * 100
    print('\033[K[\033[33mWORK\033[0m]' + ' ' + 'Downloading Song:' + " %d%%(%d/%d) - %s" % (now_jd, data_count, content_size, file_path), end="\r")

相关知识介绍

写给自己的, 以备后用, 先留空吧