【Python】Not getting all rows with BeautifulSoup
Overview
When Scraping with Beautiful Soup a problem occurred like not getting all rows of the table but a few of them. This example shows how to fix it
Environment
Python 3.7.3
Problem occurred example
This is a example table you want to scrape.
Number | Name |
1 | Sato |
2 | Kato |
3 | Ito |
4 | Goto |
I saw the html code in a web browser by pushing F12 key.
<table class="test_table">
<thead>
<tr>
<th>Number</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Sato</td>
</tr>
</tbody>
<tbody>
<tr>
<td>2</td>
<td>Kato</td>
</tr>
</tbody>
<tbody>
<tr>
<td>3</td>
<td>Ito</td>
</tr>
</tbody>
<tbody>
<tr>
<td>4</td>
<td>Goto</td>
</tr>
</tbody>
</table>
When I use the following code to scrape the table with BeautifulSoup, the result is just a few of rows. I didn’t know why.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findAll('table',{'class':"test_table"})[0]
rows = table.findAll('tr')
for row in rows:
print(row)
Result
<tr>
<th>Number</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Sato</td>
</tr>
</tbody>
<tbody>
<tr>
<td>2</td>
<td>Kato</td>
</tr>
</tbody>
The way to fix it
I used lxml for Beautiful Soup parser instead of html.parser, then it works.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://example.com',headers = headers)
soup = BeautifulSoup(r.content, "lxml")
table = soup.findAll('table',{'class':"test_table"})[0]
rows = table.findAll('tr')
for row in rows:
print(row)
I don’t understand completely about the difference between lxml and html.parser, but I will remember this way if I encounter the same problem in the future.