Loading [MathJax]/extensions/tex2jax.js

【Python】Not getting all rows with BeautifulSoup

Overview

When Scraping with Beautiful Soup a problem occurred like not getting all rows of the table but a few of them. This example shows how to fix it

Environment

Python 3.7.3

Problem occurred example

This is a example table you want to scrape.

NumberName
1Sato
2Kato
3Ito
4Goto

I saw the html code in a web browser by pushing F12 key.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<table class="test_table">
<thead>
<tr>
<th>Number</th>
<th>Name</th>
</tr>
</thead>
<tbody>
    <tr>
    <td>1</td>
    <td>Sato</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>2</td>
    <td>Kato</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>3</td>
    <td>Ito</td>
    </tr>
</tbody>
<tbody>
    <tr>
    <td>4</td>
    <td>Goto</td>
    </tr>
</tbody>
</table>

When I use the following code to scrape the table with BeautifulSoup, the result is just a few of rows. I didn’t know why.

01
02
03
04
05
06
07
08
09
10
11
import requests
from bs4 import BeautifulSoup
 
r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
 
table = soup.findAll('table',{'class':"test_table"})[0]
 
rows = table.findAll('tr')
for row in rows:
    print(row)

Result

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
&lt;tr&gt;
&lt;th&gt;Number&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
    &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;Sato&lt;/td&gt;
    &lt;/tr&gt;
&lt;/tbody&gt;
&lt;tbody&gt;
    &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;Kato&lt;/td&gt;
    &lt;/tr&gt;
&lt;/tbody&gt;

The way to fix it

I used lxml for Beautiful Soup parser instead of html.parser, then it works.

01
02
03
04
05
06
07
08
09
10
11
import requests
from bs4 import BeautifulSoup
 
r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "lxml")
 
table = soup.findAll('table',{'class':"test_table"})[0]
 
rows = table.findAll('tr')
for row in rows:
    print(row)

I don’t understand completely about the difference between lxml and html.parser, but I will remember this way if I encounter the same problem in the future.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *