心得技巧

html5 HTML/Xhtml CSS XML/XSLT Dreamweaver教程 Frontpage教程心得技巧

上一篇: python爬虫详解（二）——爬取bi... 下一篇:元旦到了，手把手教你用 Python ...

爬虫笔记3

发布时间：2022-06-26 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了爬虫笔记3，脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

一，Requests库练习

1，用百度 360 搜索关键字

2，图片爬取并保存本地

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

2,运用Beautiful soup获取源代码

3， beautifulsoup使用格式

4,beautiful的基本使用元素

beatiful soup库解析器

beautiful soup类基本元素

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

6,基于bs4库的HTML格式输出

一，Requests库练习

raise_for_status():若在返回的代码是200的情况下，是不会产生异常，否则产生异常
每次爬取前检查能否访问

爬虫笔记3

1，用百度 360 搜索关键字

百度关键词搜索 http://www.baidu.com/s?wd=keyword
360关键字搜索 http://www.so.com/s?q=keyword

import requests
kv={'wd':'Python'}
r=requests.get("http://www.baidu.com/s",params=kv)
r.status_code
>>>200
r.request.url
>>>'http://www.baidu.com/s?wd=Python'
print(r.request.url)
>>>http://www.baidu.com/s?wd=Python
print(r.text[1000:2000])

爬虫笔记3

当链接返回的非常多的时候，r.text可能会导致idle失效,所以尽量约束一个范围空间

2，图片爬取并保存本地

要考虑一切可能会发生的情况

import requests
import os
root = 'E://pictures//'
url = 'https://cj.jj20.com/2020/down.html?picurl=/up/allimg/tp03/1Z9211U233AA-0.jpg'
path = root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url=url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("该文件保存成功")
    else:
        print('文件已存在')
except:
    print("爬取失败")

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

pip install beautifulsoup4

用来解析html和xml文件的功能库

2,运用Beautiful soup获取源代码

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')  # html.parser是html解析器，使代码能看懂
print(soup.prettify())#打印源代码

成功，beatifulsoup成功解析demo页面

3， beautifulsoup使用格式

from bs4 import BeautifulSoup

soup=BeautifulSoup('data','html.parser')

4,beautiful的基本使用元素

beatiful soup库解析器

爬虫笔记3

beautiful soup类基本元素

爬虫笔记3

from bs4 import BeautifulSoup import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo, 'html.parser') soup.title .>>><title>This is a python demo page</title> tag=soup.a //只会返回第一个 tag >>><a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

soup.a.parent.name >>>'p' soup.a.name >>>'a'

soup.a.parent.parent.name >>>'body' tag=soup.a tag.attrs >>>{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} tag.attrs['href'] >>>'http://www.icourse163.org/course/BIT-268001'

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

爬虫笔记3

soup.head.contents >>>[<title>This is a python demo page</title>] soup.body.contents >>>['n', The demo python introduces several python courses., 'n', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., 'n'] len(soup.body.contents) >>>5 soup.body.contents[1] >>>The demo python introduces several python courses.

//可用循环进行遍历 for child in soup.body.children: print(child)

标签树的上行遍历

for parent in soup.a.parents: if parent is None://遍历父辈会遍历soup本身，但是soup父辈是空，所以用判断 print(parent) else: print(parent.name)

>>> p body html [document]

标签树的平行遍历

爬虫笔记3

平行遍历发生在同一个父节点下的各节点间
平行遍历获得的下一个节点不一定是标签类型

soup.a.next_sibling >>>' and ' soup.a.next_sibling.next_sibling >>><a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

遍历前续节点(循环)

for sibling in soup.a.previous_siblings:

print(sibling)

总结

6,基于bs4库的HTML格式输出

print(soup.prettify())
print(soup.a.prettify())

>>> <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

Basic Python

</a>
soup.a.prettify() >>>'<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">n Basic Pythonn</a>n'

脚本宝典总结

以上是脚本宝典为你收集整理的爬虫笔记3全部内容，希望文章能够帮你解决爬虫笔记3所遇到的问题。

如果觉得脚本宝典网站内容还不错，欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ：384754419，请注明来意。

标签：

上一篇: python爬虫详解（二）——爬取bi... 下一篇:元旦到了，手把手教你用 Python ...

猜你在找的心得技巧相关文章

clion结合vcpkg以及GTest的使用 2022-07-07
EGF 2022-06-06
ExtJS 布局-Column布局（Column layout） 2022-06-05
颜色之ARGB与RGB、RGBA的区别与介绍 2022-04-15
rgba中的a是什么意思 CSS之RGBA颜色指南 2022-04-15
rootfs -根文件系统制作 2022-07-07
网页简单布局之结构与表现原则分享 2022-04-15
小项目中怎么防止Vue的闪现画面效果 2022-04-15
隐藏 Web 中的元素方法及优缺点教程详解 2022-04-15
告别硬编码让你的前端表格自动计算的实例代码 2022-04-15

全站导航更多

爬虫笔记3

一，Requests库练习

1，用百度 360 搜索关键字

2，图片爬取并保存本地

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

2,运用Beautiful soup获取源代码

3， beautifulsoup使用格式

4,beautiful的基本使用元素

beatiful soup库 解析器

beautiful soup类基本元素

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

6,基于bs4库的HTML格式输出

脚本宝典总结

beatiful soup库解析器