【Python】Not getting all rows with BeautifulSoup
Overview
When Scraping with Beautiful Soup a problem occurred like not getting all rows of the table but a few of them. This example shows how to fix it
Environment
Python 3.7.3
Problem occurred example
This is a example table you want to scrape.
Number | Name |
1 | Sato |
2 | Kato |
3 | Ito |
4 | Goto |
I saw the html code in a web browser by pushing F12 key.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | < table class = "test_table" > < thead > < tr > < th >Number</ th > < th >Name</ th > </ tr > </ thead > < tbody > < tr > < td >1</ td > < td >Sato</ td > </ tr > </ tbody > < tbody > < tr > < td >2</ td > < td >Kato</ td > </ tr > </ tbody > < tbody > < tr > < td >3</ td > < td >Ito</ td > </ tr > </ tbody > < tbody > < tr > < td >4</ td > < td >Goto</ td > </ tr > </ tbody > </ table > |
When I use the following code to scrape the table with BeautifulSoup, the result is just a few of rows. I didn’t know why.
01 02 03 04 05 06 07 08 09 10 11 | import requests from bs4 import BeautifulSoup soup = BeautifulSoup(r.content, "html.parser" ) table = soup.findAll( 'table' ,{ 'class' : "test_table" })[ 0 ] rows = table.findAll( 'tr' ) for row in rows: print (row) |
Result
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 | <tr> <th>Number</th> <th>Name</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>Sato</td> </tr> </tbody> <tbody> <tr> <td>2</td> <td>Kato</td> </tr> </tbody> |
The way to fix it
I used lxml for Beautiful Soup parser instead of html.parser, then it works.
01 02 03 04 05 06 07 08 09 10 11 | import requests from bs4 import BeautifulSoup soup = BeautifulSoup(r.content, "lxml" ) table = soup.findAll( 'table' ,{ 'class' : "test_table" })[ 0 ] rows = table.findAll( 'tr' ) for row in rows: print (row) |
I don’t understand completely about the difference between lxml and html.parser, but I will remember this way if I encounter the same problem in the future.