爬取个人知乎收藏夹内容生成网站

xgnic · 发表于 2018-2-20 23:04:58

无聊的时候，习惯刷刷知乎，但是新的有价值的内容也不多，多的是不断涌入的营销号、推广和某些知乎live。于是乎，想着不如刷自己的收藏夹吧，很多优秀回答其实看了不久就忘了，静静地躺在收藏夹里，一直没被翻过，何况我收藏频率虽然不高，几年下来积累也不少，这样刷起来也能消磨不少时光了，还能美其名曰温故而知新了。虽然前端改版，但是知乎的收藏夹用起来感觉还是不那么方便。自己动手，丰衣足食。

效果

利用python爬虫爬取了自己的所有收藏夹，利用flask做后端api和vuejs做前端显示，前后端分离，实现效果如下

电脑效果1

电脑效果2

电脑效果3

手机效果

爬虫

一开始想着github上有许多开源的知乎爬虫，可以省去不少麻烦，结果找了一下，高赞的多已不再维护，知乎又改版了，新的项目有一点，但是功能不太完善，只有自己上手，毕竟需求很简单明确，就是收集自己的所有收藏夹内容。（使用python3）
针对此次需求，爬虫的逻辑很简单。知乎在个人常用机上直接post用户名和密码无需验证码就可以登录，利用request.Session保存请求状态，按照url中?page=num的页码规则直接爬取所有收藏夹页面，解析出所有收藏夹的url，然后依次请求获取所有收藏夹下的问答列表，解析出相关信息。由于内容不多，为了方便，直接存为json文件。而且由于收藏夹内容不会很多，直接使用requests库单线程爬取即可。
以下为爬虫代码，生成两个json文件，一个是所有收藏夹及其下问答的相关信息知乎收藏文章.json，一个是所有问题的回答数据url_answer.json，这样处理，在前端请求时可以先获取前者，在要阅读某个问题的回答时再异步请求后者，只获取对应的答案即可。
使用了requests_cache库，仅两行代码，使得请求在意外中断后要重新开始时，直接就从缓存数据库中提取已经请求过的页面，节省时间，省去了自己编码处理请求失败的麻烦。

import osimport jsonfrom bs4 import BeautifulSoupimport requestsfrom requests.packages.urllib3.exceptions import InsecureRequestWarningrequests.packages.urllib3.disable_warnings(InsecureRequestWarning)# 参考 http://stackoverflow.com/questio ... being-made-in-pythoimport requests_cacherequests_cache.install_cache('demo_cache')Cookie_FilePlace = r'.'Default_Header = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36", 'Host': "www.zhihu.com", 'Origin': "http://www.zhihu.com", 'Pragma': "no-cache", 'Referer': "http://www.zhihu.com/"}Zhihu_URL = 'https://www.zhihu.com'Login_URL = Zhihu_URL + '/login/email'Profile_URL = 'https://www.zhihu.com/settings/profile'Collection_URL = 'https://www.zhihu.com/collection/%d'Cookie_Name = 'cookies.json'os.chdir(Cookie_FilePlace)r = requests.Session()#--------------------Prepare--------------------------------#r.headers.update(Default_Header)if os.path.isfile(Cookie_Name): with open(Cookie_Name, 'r') as f: cookies = json.load(f) r.cookies.update(cookies)def login(r): print('====== zhihu login =====') email = input('email: ') password = input("password: ") print('====== logging.... =====') data = {'email': email, 'password': password, 'remember_me': 'true'} value = r.post(Login_URL, data=data).json() print('====== result:', value['r'], '-', value['msg']) if int(value['r']) == 0: with open(Cookie_Name, 'w') as f: json.dump(r.cookies.get_dict(), f)def isLogin(r): url = Profile_URL value = r.get(url, allow_redirects=False, verify=False) status_code = int(value.status_code) if status_code == 301 or status_code == 302: print("未登录") return False elif status_code == 200: return True else: print(u"网络故障") return False if not isLogin(r): login(r) #---------------------------------------------------------------------#url_answer_dict= {}# 单独生成一个答案的url和答案文本之间的字典，便于后台提供api服务，与123行相关#-----------------------get collections-------------------------------#def getCollectionsList(): collections_list = [] content = r.get(Profile_URL).content soup = BeautifulSoup(content, 'lxml') own_collections_url = 'http://' + soup.select('#js-url-preview')[0].text + '/collections' page_num = 0 while True: page_num += 1 url = own_collections_url + '?page=%d'% page_num content = r.get(url).content soup = BeautifulSoup(content, 'lxml') data = soup.select_one('#data').attrs['data-state'] collections_dict_raw = json.loads(data)['entities']['favlists'].values() if not collections_dict_raw: # if len(collections_dict_raw) == 0: break for i in collections_dict_raw: # print(i['id'],' -- ', i['title']) collections_list.append({ 'title': i['title'], 'url': Collection_URL % i['id'], }) print('====== prepare Collections Done =====') return collections_list#-------------------------def getQaDictListFromOneCollection(collection_url = 'https://www.zhihu.com/collection/71534108'): qa_dict_list = [] page_num = 0 while True: page_num += 1 url = collection_url + '?page=%d'% page_num content = r.get(url).content soup = BeautifulSoup(content, 'lxml') titles = soup.select('.zm-item-title a') # .text ; ['href'] if len(titles) == 0: break votes = soup.select('.js-vote-count') # .text answer_urls = soup.select('.toggle-expand') # ['href'] answers = soup.select('textarea') # .text authors = soup.select('.author-link-line .author-link') # .text ; ['href'] for title, vote, answer_url, answer, author \ in zip(titles, votes, answer_urls, answers, authors): author_img = getAthorImage(author['href']) qa_dict_list.append({ 'title': title.text, 'question_url': title['href'], 'answer_vote': vote.text, 'answer_url': answer_url['href'], #'answer': answer.text, 'author': author.text, 'author_url': author['href'], 'author_img': author_img, }) url_answer_dict[ answer_url['href'][1:] ] = answer.text # print(title.text, ' - ', author.text) return qa_dict_listdef getAthorImage(author_url): url = Zhihu_URL+author_url content = r.get(url).content soup = BeautifulSoup(content, 'lxml') return soup.select_one('.AuthorInfo-avatar')['src']def getAllQaDictList(): ''' 最终结果要是列表和字典的嵌套形式，以便前端解析''' all_qa_dict_list = [] collections_list = getCollectionsList() for collection in collections_list: all_qa_dict_list.append({ 'ctitle': collection['title'], 'clist': getQaDictListFromOneCollection(collection['url']) }) print('====== getQa from %s Done =====' % collection['title']) return all_qa_dict_listwith open(u'知乎收藏文章.json', 'w', encoding='utf-8') as f: json.dump(getAllQaDictList(), f)with open(u'url_answer.json', 'w', encoding='utf-8') as f: json.dump(url_answer_dict, f)#---------------------utils------------------------------## with open('1.html', 'w', encoding='utf-8') as f: # f.write(soup.prettify())# import os# Cookie_FilePlace = r'.'# os.chdir(Cookie_FilePlace)# import json# dict_ = {}# with open(u'知乎收藏文章.json', 'r', encoding='utf-8') as f:# dict_ = json.load(f)前端

前端要求不高，就是单页显示，要简洁漂亮，而且要便于我查找和翻看问题和答案。其次是对于我这种html和css战五渣，js列表遍历代码都要现谷歌的人来说，一定要简单好操作，我选择了vuejs前端框架（因为简单，也没有使用webpack）。
前端发展很快，框架和工具让人应接不暇，从我个人经验看，首先是不要害怕，框架和工具是为了帮助我们解决问题的，也就是说，使用框架和工具可以让我们更简单更快地开发，不少有效的框架和工具的学习成本并不高，掌握了基础部分，加上利用开源代码，可以方便地解决不少问题。此外，搜集好工具真是必备技能，大家面对的困难相似，说不定就有人开发了工具来解决你的痛点呢。
首先网站的基本构图采用bootstrap的一个基本模板，省了不少麻烦。vuejs的组件化特性使得我可以轻松地利用各种开源UI组件，像搭积木一样把他们拼接起来构成我的页面。在awesome-vue上我找到了符合我审美且简单易用的UI框架iView，虽然它暂时还只适用于vue1.x，不过由于我的应用简单，差异不大，就是它了。
以下为html代码，使用vue-resource异步请求数据，同步到页面。为了开发方便，直接采用了jsonp跨域请求的形式。代码质量仅供参考。组件里的template查看不方便，可以复制出来使用去掉两边单引号和对单引号的转义，利用美化html代码的工具查看。这样写是权宜之计。

<!DOCTYPE html><html lang="zh-CN"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>知乎个人收藏</title> <link rel="stylesheet" href="https://cdn.bootcss.com/bootstrap/3.3.7/css/bootstrap.min.css"> <link rel="stylesheet" href="http://v3.bootcss.com/examples/jumbotron-narrow/jumbotron-narrow.css"> <link rel="stylesheet" type="text/css" href="http://unpkg.com/iview/dist/styles/iview.css"></head><body> <div id="app"> <div class="container"> <div class="header clearfix"> <h3 class="text-muted">知乎个人收藏</h3> </div> <div class="jumbotron"> <h1>栏目总览</h1> <p class="lead">{{ description }}</p> <my-carousel></my-carousel> </div> <div class="row marketing"> <div class="col-lg-6"> <my-card :collection="collection" v-for="collection in left"></my-card> </div> <div class="col-lg-6"> <my-card :collection="collection" v-for="collection in right"></my-card> </div> </div> <i-button @click="showLeave" style: "" long>That's all!</i-button> <Modal :visible.sync="visible" :title="modalTitle"> {{ modalMessage }} <div v-html="rawHtml" id="inner-content"></div> </Modal> <footer class="footer"> <p>© 2017 treelake.</p> </footer> </div>  </div> <script type="text/javascript" src="http://v1.vuejs.org/js/vue.min.js"></script> <script src="https://cdn.jsdelivr.net/vue.resource/1.2.0/vue-resource.min.js"></script> <script type="text/javascript" src="http://unpkg.com/iview/dist/iview.min.js"></script> <script> Vue.component('my-carousel', { template: '<div class="showimage"><Carousel arrow="never" autoplay><Carousel-item>![](https://n2-s.mafengwo.net/fl_progressive,q_mini/s10/M00/74/B6/wKgBZ1irpQ-Afw_uAAepw3nE8w884.jpeg)</Carousel-item><Carousel-item>![](https://c4-q.mafengwo.net/s10/M0 ... gr2%2Finterlace%2F1)</Carousel-item></Carousel></div>' }) Vue.component('my-ul', { template: '<ul id="list"><li v-for="item in items | limitBy limitNum limitFrom"><Badge :count="item.answer_vote" overflow-count="9999"> <a @click="simpleContent(item)" class="author-badge" :style="{ background: \'url(\'+ item.author_img +\') no-repeat\', backgroundSize:\'cover\'}"></a></Badge> <a :href=" \'https://www.zhihu.com\' + item.answer_url" target="_blank" style="font-size: 10px"> {{ item.title }}</a><a :href=" \'https://www.zhihu.com\' + item.question_url" target="_blank"><Icon type="chatbubbles"></Icon></a><hr> </li></ul>', props: ['items'], methods: { changeLimit() { if (this.limitFrom > this.items.length - this.limitNum) { this.limitFrom = 0; } else { this.limitFrom += this.limitNum; } if (this.limitFrom == this.items.length) { this.limitFrom = 0 } console.log(this.limitFrom) }, simpleContent(msg) { this.$dispatch('child-msg', msg) // 使用 $dispatch() 派发事件，事件沿着父链冒泡 }, }, data() { return { limitNum: 5, limitFrom: 0, } }, events: { 'parent-msg': function () { this.changeLimit() } }, }) Vue.component('my-card', { template: '<Card style="width:auto; margin-bottom:15px" ><p slot="title"><Icon type="ios-pricetags"></Icon>{{ collection.ctitle }}</p><a v-if="collection.clist.length>5" slot="extra" @click="notify"><Icon type="ios-loop-strong"></Icon>换一换</a> <my-ul :items="collection.clist"></my-ul> </Card>', props: ['collection'], methods: { notify: function () { this.$broadcast('parent-msg') // 使用 $broadcast() 广播事件，事件向下传导给所有的后代 } } }) var shuju, answer; new Vue({ el: '#app', data: { description: '', visible: false, // ctitle: '', allqa: [], collection: { 'clist': [], 'ctitle': '', }, left: [], right: [], modalMessage: '旧时光回忆完毕！', modalTitle: 'Welcome!', rawHtml: '<a href="https://treeinlake.github.io"> treelake </a>' }, methods: { show() { this.visible = true; }, showLeave() { this.rawHtml = ''; this.modalMessage = '旧时光回忆完毕！'; this.show(); } }, events: { 'child-msg': function (msg) { this.$http.jsonp('/find' + msg.answer_url, {}, { // 单文件测试：http://localhost:5000/find headers: {}, emulateJSON: true }).then(function (response) { // 这里是处理正确的回调 answer = response.data; this.rawHtml = answer.answer; }, function (response) { // 这里是处理错误的回调 console.log(response); }); this.modalMessage = ''; this.modalTitle = msg.title; this.show(); } }, ready: function () { this.$http.jsonp('/collections', {}, { // 单文件测试 http://localhost:5000/collections/ headers: {}, emulateJSON: true }).then(function (response) { // 这里是处理正确的回调 shuju = response.data for (i in shuju) { this.description += (shuju.ctitle + ' '); // console.log(shuju) } // this.ctitle = shuju[0].ctitle // this.collection = shuju[0] this.allqa = shuju half = parseInt(shuju.length / 2) + 1 this.left = shuju.slice(0, half) this.right = shuju.slice(half, shuju.length) console.log(this.collection) }, function (response) { // 这里是处理错误的回调 console.log(response); }); } }) </script> <style> #list { padding: 10px } #list li { margin-bottom: 10px; padding-bottom: 10px; } .jumbotron img { width: 100%; } .author-badge { width: 38px; height: 38px; border-radius: 6px; display: inline-block; } #inner-content img { width: 100%; } </style></body></html>后端
后端主要提供api，使用了简洁易用的Flask，但是返回jsonp还需要一层封装，不过开源世界就是强大，直接找到了Flask-Jsonpify库，一句话搞定。主要逻辑就是先从本地加载之前爬下来的数据，然后提供api服务。/find/<path:answer_url>路由提供了根据回答的url查找回答文本内容的服务。
最后，想让flask在根目录提供html文件，直接访问ip就可以在手机上使用。为了不让flask本身的模板渲染和vuejs的模板特性冲突，直接返回了原本的html文件，避过了flask的模板渲染。
以下为服务端代码，连同上面两个文件放在一起，在爬取资料完毕后，python xxx.py运行服务即可。

# -*- coding: utf-8 -*-from flask import Flaskimport jsonfrom flask_jsonpify import jsonpifyapp = Flask(__name__)collections = []with open(u'知乎收藏文章.json', 'r', encoding='utf-8') as f: collections = json.load(f)qa_dict = {}with open('url_answer.json', 'r', encoding='utf-8') as f: qa_dict = json.load(f)# print(qa_dict['question/31116099/answer/116025931'])index_html = ''with open('zhihuCollection.html', 'r', encoding='utf-8') as f: index_html = f.read()@app.route('/')def index(): return index_html@app.route('/collections')def collectionsApi(): return jsonpify(collections)@app.route('/find/<path:answer_url>') # 使用path修正斜杠的副作用，参见http://flask.pocoo.org/snippets/76/def answersApi(answer_url): # show the post with the given id, the id is an integer return jsonpify({'answer': qa_dict[answer_url]})@app.route('/test')def test(): # show the post with the given id, the id is an integer return jsonpify(qa_dict)if __name__ == '__main__': app.run(host='0.0.0.0')

作者：treelake
链接：https://www.jianshu.com/p/e1f039c8d945
來源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

文章来源：https://www.jianshu.com/p/e1f039c8d945?winzoom=1

爬取个人知乎收藏夹内容生成网站

相关帖子

联系我们