IT Learning

実践形式でITのお勉強

Python

【Python】Not getting all rows with BeautifulSoup

投稿日:

Overview

When Scraping with Beautiful Soup a problem occurred like not getting all rows of the table but a few of them. This example shows how to fix it

Environment

Python 3.7.3

Problem occurred example

This is a example table you want to scrape.

NumberName
1Sato
2Kato
3Ito
4Goto

I saw the html code in a web browser by pushing F12 key.

<table class="test_table">
<thead>
<tr>
<th>Number</th>
<th>Name</th>
</tr>
</thead>
<tbody>
    <tr>
    <td>1</td>
    <td>Sato</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>2</td>
    <td>Kato</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>3</td>
    <td>Ito</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>4</td>
    <td>Goto</td>
    </tr>
</tbody>
</table>

When I use the following code to scrape the table with BeautifulSoup, the result is just a few of rows. I didn’t know why.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "html.parser")

table = soup.findAll('table',{'class':"test_table"})[0]

rows = table.findAll('tr')
for row in rows:
    print(row)

Result

<tr>
<th>Number</th>
<th>Name</th>
</tr>
</thead>
<tbody>
    <tr>
    <td>1</td>
    <td>Sato</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>2</td>
    <td>Kato</td>
    </tr>
</tbody>

The way to fix it

I used lxml for Beautiful Soup parser instead of html.parser, then it works.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "lxml")

table = soup.findAll('table',{'class':"test_table"})[0]

rows = table.findAll('tr')
for row in rows:
    print(row)

I don’t understand completely about the difference between lxml and html.parser, but I will remember this way if I encounter the same problem in the future.

Related

-Python

執筆者:


comment

Your email address will not be published. Required fields are marked *