decode_contents 获取的是 unicode 字符串。
简介
注意,文本内容中不会有 HTML 标签。即使在嵌套的情况下,也不会有。
示例1:无嵌套
代码:
from bs4 import BeautifulSoup
html_content = '''
<div id="content" data="你好">测试01</div>
<div>测试03</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
content_div = soup.select_one("#content")
print(content_div.decode_contents())
执行结果:
测试01
示例2:有嵌套
代码:
from bs4 import BeautifulSoup
html_content = '''
<div id="content" data="你好">
<p>测试01</p>
<span>测试02</span>
</div>
<div>测试03</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
content_div = soup.select_one("#content")
print('text:', content_div.decode_contents())
执行结果:
<p>测试01</p>
<span>测试02</span>