python3 – store a specific link of output from soupfindall in a variable

By latest I assume you want the link with the most recent date. As such you need to capture both the URL and the date given for each link. This can then by converted into a datetime object and added to a list.

After all URLs are found, the list can be easily sorted into date order, with the newest first. The latest URL can then be used to download the img file.

For example:

from bs4 import BeautifulSoup
import requests
from datetime import datetime

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        date_text = a.find_next('em').get_text(strip=True)
        date_dt = datetime.strptime(date_text, "%Y-%m-%d %H:%M:%S %Z")
        urls.append([date_dt, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url

# Download the latest img file
url_img = base_url + latest.split('.html')[0]
filename = url_img.split('/')[-1]

with requests.get(url_img, stream=True, headers={'Referer' : base_url + latest}) as req_img:
    with open(filename, 'wb') as f_img:
        for chunk in req_img.iter_content(chunk_size=2**15): 
            f_img.write(chunk)

This approach might have the advantage of still working if the naming or numbering scheme is changed. A referer header is added to avoid the website returning the HTML for the download page.

This results in a download of 131,072 KB


Note, if you prefer to ignore the date and just sort on the version number, use the following approach:

from bs4 import BeautifulSoup
import requests
from datetime import datetime
import re

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        version = re.findall(r'(\d+)', link)
        urls.append([version, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top