How to Use Selenium

What is Selenium?

It is actually a test framework that is able to interact with dynamically generated websites
Unlike beautifulsoup4 that only works on static (or websites rendered on serverside) websites, Selenium can work on any website as long as you use a webdriver
advantages
- extremely advanced with many possibilities
- can interact with javascript
disadvantages
- quite slow, and computationally expensive
- not the best idea for extremely large datasets if scraping data

Setting up Selenium

first activate your environment
Then pip install selenium
Next step is to install your webdrivers:

pip install webdrivermanager
webdrivermanager firefox chrome --linkpath /usr/local/bin

Using Selenium

Interacting with elements

now lets try out the webdriver:
in your terminal just enter python

in your python shell try the following:

from selenium import webdriver
wd = webdriver.Chrome()

The above should have opened up a test browser, next enter the following:
```
wd.get('https://www.google.com')
```

Now to actually interact with with a webpage we will go to a simpler example:

wd.get('https://health-infobase.canada.ca/covid-19/covidtrends/?HR=1&mapOpen=false')

Now inspect element your browser, and more specifically the search bar and search button (results below, first is seach bar, second is search button):

<input id="postalCode" type="text" name="locationInput" placeholder="e.g. K0A or Cold Lake" list="">

<button id="searchPostal" class="btn btn-info">
    <i class="fa fa-search" aria-hidden="true"></i><span class="wb-inv">Search</span>
</button>

Now that we see the css id of both, we can now interact with them:

input_elem = wd.find_element_by_id('postalCode')
search_button = wd.find_element_by_id('searchPostal')

# alternatively one can also use:
from selenium.webdriver.common.by import By
input_elem = wd.find_element(By.ID, 'postalCode')
search_button = wd.find_element(By.ID, 'searchPostal')

# To send a value to the search bar
input_elem.send_keys('M1P4P5') # can be anything you want

# To click the search button
search_button.click()

The above selects the first occurence of those IDs in the webpage
After running the above, you should notice that on your test browser it entered the string into the search bar and then searched that value
now what if we want to clear everything in the search bar (one useful instance is if you need to search multiple things and the search bar does not clear itself after a query):
```
input_elem.clear()
```
what if we have multiple elements?

#instead of find_element, use find_findelements
input_elem = wd.find_elements_by_id('postalCode')
search_button = wd.find_elements_by_id('searchPostal')

input_elem = wd.find_elements(By.ID, 'postalCode')
search_button = wd.find_elements(By.ID, 'searchPostal')

what if the element we are interacting with does not have an id?

Just replace id with the element you want to use
complete list of options here

#simply replace id with the element you are looking for
input_elem = wd.find_elements_by_class_name('postalCode')
search_button = wd.find_elements_by_xpath('searchPostal')

input_elem = wd.find_elements(By.TAG_NAME, 'postalCode')
search_button = wd.find_elements(By.NAME, 'searchPostal')

you are also not limited to just clicks and sending keys, you can even drag and drop elements using seleniums ActionsChain class (from either one element to another or simply using an x and y value, more about ActionChain can be read here) or even execute javascript code on the browser using execute_script (on a webdriver object), you can also scroll down webpages (which is useful for scraping images or posts on infinitely scrolling websites like facebook or twitter)
- full list of possibilities here
What if you want to wait for something to appear before beginning to scrape? (for example, a table to load or an image to load) You can use seleniums wait functionality. The below example highlights waiting for the presence of an element before beginning:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10) # gives the browswer a 10 second time limit before calling out timeout exception
wait.until(EC.presence_of_element_located((By.ID, "health-region-row-hr1"))) # once the element with the id of "health-region-row-hr1" pops up it will begin scraping (given its within 10 seconds)

waiting does not only have to be until an element is located, you can also wait until an element is clickable, text to appear or for it to be visible
- full list of all possibilities here
when done with interacting with all elements, and want to end your session, simply run:

# if you only want to close the window that the webdriver is focusing on (for example an ad popup) then run:
wd.close()

#if you want to terminate everything, i.e. tabs, pop-ups, browser itself etc. run:
wd.quit()

Getting data

now that we interacted with the browser and got a response, we want to get some data out of that response
We get the data similar to how we find the buttons and other such elements from the previous section
first take a look at the css element of the item with the text in it, in our example it will be:

<span class="main-title">City of Toronto Health Unit</span>

we can see that the class is 'main-title', and from interacting with the website ourselves, we know that this is the first occurence, and since we want the first occurence, we simply run what we have run before:

health_region_element = wd.find_element_by_class_name('main-title')

to get the text from this, we simply run:

health_region = health_region_element.text

you are not only limited to text however, you can extract much more, even screenshots of that specific element!
- full list here

Selenium Options

Your webdriver has many 'options' that may help you better with how you want to run your scraping program, for instance if you want to run it without a gui, then there is an option for running things headlessly
- to add an option, you create an options object before creating the webdriver object
- the below example shows how to make a browser run headlessly
- ```
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless') 
driver = webdriver.Chrome(chrome_options=options)
```

there are more options however, such as adding extensions (unfortunately only available for chrome)
- list of chrome options here
- list of firefox options here

Putting it all together

Using the same website as always for the example
Lets say we have a list of postalcodes as such:

python ['M1P4P5', 'M1S4C3', 'V6G3H5', 'R8A1C6', 'S7N3Y3', 'V0N1B4', 'S7N4X3', 'M1B2C3']

If we were to scrape normally (i.e. just feed these through a loop, only collecting text and clearing the textbox) we encounter an issue where when searing for one in the health region of toronto, then one in a different region (whose health region popup comes back to the top), when searching one that is back in toronto we have an issue where the city of toronto's health region does not come to the top, thus we need to interact with many close elements after we scrape
We also want to run this headlessly
We want to extract the health region result
Before we begin extraction we need to wait for the page to finish loading, which we can see by looking for the presence of the health-region-row-hr1 id
after we are done we want to close the entire browser
we want to write our result to a JSON
This leaves us with:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from tqdm import tqdm
import json

data = ['M1P4P5', 'M1S4C3', 'V6G3H5', 'R8A1C6', 'S7N3Y3', 'V0N1B4', 'S7N4X3', 'M1B2C3']

options = Options()
options.add_argument('--headless') # Sets browser to be headless
driver = webdriver.Chrome(chrome_options=options)

driver.get('https://health-infobase.canada.ca/covid-19/covidtrends/?HR=1&mapOpen=false') # tells browser to go to this page

wait = WebDriverWait(driver, 10) # sets a limit of 10 seconds of our wait criteria
wait.until(EC.presence_of_element_located((By.ID, "health-region-row-hr1"))) # tells browser to wait for this element 

input_elem = driver.find_element_by_id("postalCode") # finds the search bar
button_elem = driver.find_element_by_id("searchPostal") # find the search button

unmapped_to_mapped = {} # temporary dictionary for saving data
unmapped_to_mapped['unmapped'] = [] # temporary list if there are errors when scraping

#tqdm is a library used to show a progress bar, can be used on any iterable
for i in tqdm(data):
    try:
        input_elem.send_keys(i) # sends the postal code to the search bar
        button_elem.click() # clicks the button after entering the keys
        input_elem.clear() # clears the search bar
        region = driver.find_element_by_class_name('main-title').text # finds the region based on the first 'main-title' occurence, which we know comes up since we interacted and played with the website before beginning
        unmapped_to_mapped[i] = region # add the result to out dictionary
        close_buttons = driver.find_elements_by_class_name('close') # this is the close button of the popups with the health region info
        for i in close_buttons:
            try:
                i.click() # close the popup (or popups) that come up
            except:
                x=1 # just a dummy except so that the code does not break
    except Exception as e:
        print(e)
        unmapped_to_mapped['unmapped'].append(i) # adds any unmappable postals to this

driver.quit()

with open("unmapped_to_mapped.json", "w") as file: # writes the unmapped_to_mapped to a json
    json.dump(unmapped_to_mapped, file)