python web scraping an unstructured table

You can use beautifulsoup to parse the HTML. For example:

import pandas as pd
from bs4 import BeautifulSoup


txt = '''<table class="table-detail">
            <tbody>
                <tr>
                    <td colspan="4" class="noborder">General Information
                    </td>
                </tr>
                <tr>
                    <th>Full name</th>
                    <td>
                        James Smith
                    </td>
                    <th>Year of birth</th>
                    <td>1992</td>
                </tr>
                <tr>
                    <th>Gender</th>
                    <td>Male</td>
                </tr>
                <tr>
                    <th>Place of birth</th>
                    <td>TTexas, USA</td>
                    <td>&nbsp;</td>
                    <td>&nbsp;</td>
                </tr>
                <tr>
                    <th>Address</th>
                    <td>Texas, USA</td>
                    <td>&nbsp;</td>
                    <td></td>
                </tr>'''


soup = BeautifulSoup(txt, 'html.parser')

row = {}
for h in soup.select('th:has(+td)'):
    row[h.text] = h.find_next('td').get_text(strip=True)

df = pd.DataFrame([row])
print(df)

Prints:

     Full name Year of birth Gender Place of birth     Address
0  James Smith          1992   Male    TTexas, USA  Texas, USA

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top