脚本宝典收集整理的这篇文章主要介绍了爬虫笔记3,脚本宝典觉得挺不错的,现在分享给大家,也给大家做个参考。
目录
一,Requests库练习
1,用百度 360 搜索关键字
2,图片爬取并保存本地
二,网络爬虫之信息提取——Beautiful soup库学习
1,安装Beautiful soup
2,运用Beautiful soup获取源代码
3, beautifulsoup使用格式
4,beautiful的基本使用元素
beatiful soup库 解析器
beautiful soup类基本元素
5,基于bs4库的HTML内容遍历方法
标签树的下行遍历
标签树的上行遍历
标签树的平行遍历
总结
6,基于bs4库的HTML格式输出
import requests
kv={'wd':'Python'}
r=requests.get("http://www.baidu.com/s",params=kv)
r.status_code
>>>200
r.request.url
>>>'http://www.baidu.com/s?wd=Python'
print(r.request.url)
>>>http://www.baidu.com/s?wd=Python
print(r.text[1000:2000])
当链接返回的非常多的时候,r.text可能会导致idle失效,所以尽量约束一个范围空间
要考虑一切可能会发生的情况
import requests
import os
root = 'E://pictures//'
url = 'https://cj.jj20.com/2020/down.html?picurl=/up/allimg/tp03/1Z9211U233AA-0.jpg'
path = root+url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url=url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("该文件保存成功")
else:
print('文件已存在')
except:
print("爬取失败")
pip install beautifulsoup4
用来解析html和xml文件的功能库
from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser') # html.parser是html解析器,使代码能看懂
print(soup.prettify())#打印源代码
成功,beatifulsoup成功解析demo页面
from bs4 import BeautifulSoup
soup=BeautifulSoup('<p>data<p>','html.parser')
from bs4 import BeautifulSoup import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo, 'html.parser') soup.title .>>><title>This is a python demo page</title> tag=soup.a //只会返回第一个 tag >>><a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
soup.a.parent.name >>>'p' soup.a.name >>>'a'
soup.a.parent.parent.name >>>'body' tag=soup.a tag.attrs >>>{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} tag.attrs['href'] >>>'http://www.icourse163.org/course/BIT-268001'
标签树的上行遍历soup.head.contents >>>[<title>This is a python demo page</title>] soup.body.contents >>>['n', <p class="title"><b>The demo python introduces several python courses.</b></p>, 'n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, 'n'] len(soup.body.contents) >>>5 soup.body.contents[1] >>><p class="title"><b>The demo python introduces several python courses.</b></p>
//可用循环进行遍历 for child in soup.body.children: print(child)
for parent in soup.a.parents: if parent is None://遍历父辈会遍历soup本身,但是soup父辈是空,所以用判断 print(parent) else: print(parent.name)
>>> p body html [document]
soup.a.next_sibling >>>' and ' soup.a.next_sibling.next_sibling >>><a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
for sibling in soup.a.previous_siblings:
print(sibling)
print(soup.a.prettify())
>>> <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
soup.a.prettify() >>>'<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">n Basic Pythonn</a>n'
以上是脚本宝典为你收集整理的爬虫笔记3全部内容,希望文章能够帮你解决爬虫笔记3所遇到的问题。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。