when i try to scrape data when i check the data it looks like the html does not match what i see on the page

Try below approach using python – requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.

What exactly below script is doing:

  1. First it will take the API URL which is created using, query string parameters, headers, form data and its dynamic parameters(all in caps) and do POST request.

  2. Form data is dynamic you can pass any valid value in the params and the data will be created for you every time you want to fetch something from the site.(!Important do not change value of Page_No parameter).

  3. After getting the data script will parse the JSON data using json.loads library.

  4. Finally it will iterate all over the list of addresses fetched in each iteration or page for ex:- Address, Name, Agent Code, Map URL, Phone etc, you can modify these attributes as per your need.

    def scrape_addresses():
    
      url = "https://bri.co.id/en/lokasi" # API URL
    
      querystring = {"p_p_id":"location_display_ortlet","p_p_lifecycle":"2",
               "p_p_state":"normal","p_p_mode":"view",
               "p_p_resource_id":"/location/ui/search",
               "p_p_cacheability":"cacheLevelPage"}  # API URL query string parameters !Important to add
      headers = {
           'content-type': "application/x-www-form-urlencoded",
           'cache-control': "no-cache"
                }    # headers and type !Important to add
      #Parameters to create form data (Change as per your need except Page_No parameter)
      PAGE_NO = 1
      LOCATION_TYPE = ''
      PROVINCE = ''
      SERVICE = ''
    
      while True:
          print('Creating new form data for page no : ' + str(PAGE_NO))
          # Request payload or form-data !Important to add
          payload = '_location_display_ortlet_page=' + str(PAGE_NO) + 
                   '&_location_display_ortlet_locationType=' + LOCATION_TYPE + 
                   '&_location_display_ortlet_province=' + PROVINCE + 
                   '&_location_display_ortlet_service=' + SERVICE
    
          # POST request with provided URL
          response = requests.post(url, data=payload, headers=headers, 
                       params=querystring,verify = False)
    
          print('Created new form data going to fetch data...')
    
          result = json.loads(response.text) #Parse result using JSON loads
    
          if len(result) == 0:
             break
          else:
             extracted_data = result['data']
             for data in extracted_data:
                 print('-' * 100)
                 print('Fetching data for : ' , data['name'])
                 print('Address : ', data['address'])
                 print('Agent Code : ',data['agentCode'])
                 print('ID : ', data['id'])
                 print('Latitude : ',data['latitude'])
                 print('Longitude : ',data['longitude'])
                 print('Name : ',data['name'])
                 print('Opening Hours : ',data['openingHours'])
                 print('Phone : ', data['phone'])
                 print('Service Offered : ',data['serviceOffered'])
                 print('Type : ',data['type'])
                 print('Maps URL : ',data['urlMaps'])
                 print('-' * 100)
        PAGE_NO += 1 #increment page number after each iteration to scrap more data
     scrape_addresses()
    

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top