How to Use Selenium
What is Selenium?
- It is actually a test framework that is able to interact with dynamically generated websites
- Unlike beautifulsoup4 that only works on static (or websites rendered on serverside) websites, Selenium can work on any website as long as you use a webdriver
- advantages
- extremely advanced with many possibilities
- can interact with javascript
- disadvantages
- quite slow, and computationally expensive
- not the best idea for extremely large datasets if scraping data
Setting up Selenium
- first activate your environment
- Then
pip install selenium
- Next step is to install your webdrivers:
-
pip install webdrivermanager webdrivermanager firefox chrome --linkpath /usr/local/bin
Using Selenium
Interacting with elements
- now lets try out the webdriver:
- in your terminal just enter
python
- in your python shell try the following:
from selenium import webdriver wd = webdriver.Chrome()
- The above should have opened up a test browser, next enter the following:
wd.get('https://www.google.com')
- Now to actually interact with with a webpage we will go to a simpler example:
wd.get('https://health-infobase.canada.ca/covid-19/covidtrends/?HR=1&mapOpen=false')
- Now inspect element your browser, and more specifically the search bar and search button (results below, first is seach bar, second is search button):
-
<input id="postalCode" type="text" name="locationInput" placeholder="e.g. K0A or Cold Lake" list=""> <button id="searchPostal" class="btn btn-info"> <i class="fa fa-search" aria-hidden="true"></i><span class="wb-inv">Search</span> </button>
- Now that we see the css id of both, we can now interact with them:
-
input_elem = wd.find_element_by_id('postalCode') search_button = wd.find_element_by_id('searchPostal') # alternatively one can also use: from selenium.webdriver.common.by import By input_elem = wd.find_element(By.ID, 'postalCode') search_button = wd.find_element(By.ID, 'searchPostal') # To send a value to the search bar input_elem.send_keys('M1P4P5') # can be anything you want # To click the search button search_button.click()
- The above selects the first occurence of those IDs in the webpage
-
After running the above, you should notice that on your test browser it entered the string into the search bar and then searched that value
-
now what if we want to clear everything in the search bar (one useful instance is if you need to search multiple things and the search bar does not clear itself after a query):
-
input_elem.clear()
- what if we have multiple elements?
-
#instead of find_element, use find_findelements input_elem = wd.find_elements_by_id('postalCode') search_button = wd.find_elements_by_id('searchPostal') input_elem = wd.find_elements(By.ID, 'postalCode') search_button = wd.find_elements(By.ID, 'searchPostal')
- what if the element we are interacting with does not have an id?
- Just replace id with the element you want to use
- complete list of options here
-
#simply replace id with the element you are looking for input_elem = wd.find_elements_by_class_name('postalCode') search_button = wd.find_elements_by_xpath('searchPostal') input_elem = wd.find_elements(By.TAG_NAME, 'postalCode') search_button = wd.find_elements(By.NAME, 'searchPostal')
-
you are also not limited to just clicks and sending keys, you can even drag and drop elements using seleniums ActionsChain class (from either one element to another or simply using an x and y value, more about ActionChain can be read here) or even execute javascript code on the browser using execute_script (on a webdriver object), you can also scroll down webpages (which is useful for scraping images or posts on infinitely scrolling websites like facebook or twitter)
- full list of possibilities here
-
What if you want to wait for something to appear before beginning to scrape? (for example, a table to load or an image to load) You can use seleniums wait functionality. The below example highlights waiting for the presence of an element before beginning:
-
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC wait = WebDriverWait(driver, 10) # gives the browswer a 10 second time limit before calling out timeout exception wait.until(EC.presence_of_element_located((By.ID, "health-region-row-hr1"))) # once the element with the id of "health-region-row-hr1" pops up it will begin scraping (given its within 10 seconds)
-
waiting does not only have to be until an element is located, you can also wait until an element is clickable, text to appear or for it to be visible
- full list of all possibilities here
-
when done with interacting with all elements, and want to end your session, simply run:
-
# if you only want to close the window that the webdriver is focusing on (for example an ad popup) then run: wd.close() #if you want to terminate everything, i.e. tabs, pop-ups, browser itself etc. run: wd.quit()
Getting data
- now that we interacted with the browser and got a response, we want to get some data out of that response
- We get the data similar to how we find the buttons and other such elements from the previous section
- first take a look at the css element of the item with the text in it, in our example it will be:
-
<span class="main-title">City of Toronto Health Unit</span>
- we can see that the class is 'main-title', and from interacting with the website ourselves, we know that this is the first occurence, and since we want the first occurence, we simply run what we have run before:
-
health_region_element = wd.find_element_by_class_name('main-title')
- to get the text from this, we simply run:
-
health_region = health_region_element.text
- you are not only limited to text however, you can extract much more, even screenshots of that specific element!
- full list here
Selenium Options
- Your webdriver has many 'options' that may help you better with how you want to run your scraping program, for instance if you want to run it without a gui, then there is an option for running things headlessly
- to add an option, you create an options object before creating the webdriver object
- the below example shows how to make a browser run headlessly
-
from selenium.webdriver.chrome.options import Options options = Options() options.add_argument('--headless') driver = webdriver.Chrome(chrome_options=options)
- there are more options however, such as adding extensions (unfortunately only available for chrome)
Putting it all together
- Using the same website as always for the example
- Lets say we have a list of postalcodes as such:
python ['
M1P4P5', 'M1S4C3', 'V6G3H5', 'R8A1C6', 'S7N3Y3', 'V0N1B4', 'S7N4X3', 'M1B2C3']
- If we were to scrape normally (i.e. just feed these through a loop, only collecting text and clearing the textbox) we encounter an issue where when searing for one in the health region of toronto, then one in a different region (whose health region popup comes back to the top), when searching one that is back in toronto we have an issue where the city of toronto's health region does not come to the top, thus we need to interact with many close elements after we scrape
- We also want to run this headlessly
- We want to extract the health region result
- Before we begin extraction we need to wait for the page to finish loading, which we can see by looking for the presence of the health-region-row-hr1 id
- after we are done we want to close the entire browser
- we want to write our result to a JSON
- This leaves us with:
-
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from tqdm import tqdm import json data = ['M1P4P5', 'M1S4C3', 'V6G3H5', 'R8A1C6', 'S7N3Y3', 'V0N1B4', 'S7N4X3', 'M1B2C3'] options = Options() options.add_argument('--headless') # Sets browser to be headless driver = webdriver.Chrome(chrome_options=options) driver.get('https://health-infobase.canada.ca/covid-19/covidtrends/?HR=1&mapOpen=false') # tells browser to go to this page wait = WebDriverWait(driver, 10) # sets a limit of 10 seconds of our wait criteria wait.until(EC.presence_of_element_located((By.ID, "health-region-row-hr1"))) # tells browser to wait for this element input_elem = driver.find_element_by_id("postalCode") # finds the search bar button_elem = driver.find_element_by_id("searchPostal") # find the search button unmapped_to_mapped = {} # temporary dictionary for saving data unmapped_to_mapped['unmapped'] = [] # temporary list if there are errors when scraping #tqdm is a library used to show a progress bar, can be used on any iterable for i in tqdm(data): try: input_elem.send_keys(i) # sends the postal code to the search bar button_elem.click() # clicks the button after entering the keys input_elem.clear() # clears the search bar region = driver.find_element_by_class_name('main-title').text # finds the region based on the first 'main-title' occurence, which we know comes up since we interacted and played with the website before beginning unmapped_to_mapped[i] = region # add the result to out dictionary close_buttons = driver.find_elements_by_class_name('close') # this is the close button of the popups with the health region info for i in close_buttons: try: i.click() # close the popup (or popups) that come up except: x=1 # just a dummy except so that the code does not break except Exception as e: print(e) unmapped_to_mapped['unmapped'].append(i) # adds any unmappable postals to this driver.quit() with open("unmapped_to_mapped.json", "w") as file: # writes the unmapped_to_mapped to a json json.dump(unmapped_to_mapped, file)