More Selenium4 Goodies

After my previous post on Selenium 4 Relative Locators, I further explored Selenium4 features and found a few more goodies in WebElement and WebDriver interfaces.

Element Screenshots

Yes, now we can capture screenshot of an individual element or group of elements. This is a very useful feature. I talked about capturing element screenshots in my Selenium Testing Tools Cookbook. However, the new feature added in Selenium 4 (alpha-3) is inbuilt and much simpler.

The WebElement interface now supports getScreenShotAs() method by implementing the TakesScreenshot to capture a screenshot of the element.

This method accepts the OutputType argument and screenshots can be captured as FILE, BYTES or BASE64 string.

Let’s try to capture screenshot of a link and the search box displayed on Google Search Home page:


// find the Images link on Google Search home page
WebElement imagesLink = driver.findElement(By.linkText("Images"));

// take a screenshot of the link element
File linkScr = imagesLink.getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(linkScr, new File("./target/linkScr.png"));

We can also capture a group of elements by taking a screenshot of the parent element. Here is a complete example capturing the Images link and the search box:

carbon (4).png

The new getRect() method

The new getRect() method is introduced in WebDriver interface which is essentially a combination of previous getSize() and getLocation() methods. Here’s a difference between previous methods and new the getRect() method which returns a Rectangle object:

carbon (5).png

New additions in WebDriver

In addition, to maximize() method, the browser window can now be made fullscreen by using the new fullscreen() method:


A new parentFrame() method is added for navigating between frames.


I’m not really sure if this is completely new feature (or maybe I’m too lazy to go through the changes) but we can now create a new empty tab or new browser window by using the newWindow() method.


That’s it for now. I’ll deep dive into new Selenium Grid features in an upcoming post.

Closing note

These features are in alpha release and subject to change in future. Please use with caution. You can find the complete code example from this post in my GitHub repo

Setting up minimal Selenium Grid with Docker

Here’s simple guide to setup a minimal Selenium Grid with Docker. For running Docker on your machine you will need Docker toolbox installed from Below steps are done on a Mac.

We will use Hub and Node images from Selenium project hosted at Docker Hub

Next we need to create a docker-compose file describing how we want to run the Selenium Grid Hub and connect nodes to the Hub. In this example we will launch a multi-container setup with a Hub connected to Firefox and Chrome nodes:

If you don’t have Docker running, then start the Docker daemon with default machine by using following command:

docker-machine start default

To connect to the Docker shell run following command:

docker-machine env

and then:

eval $(docker-machine env)

This will connect the terminal session to the Docker shell

Finally run the docker-compose command from the directory where docker-compose.yml file is stored:

docker-compose up

This will get required images from the Docker hub and launch the Hub node followed by Firefox and Chrome nodes which will be registered to the Hub. Now we have a minimal Selenium Grid up and running. We can point Selenium tests to this Grid for execution. In the next post we’ll see some advanced options and integration with Maven and Cloud tools.

Using Tesseract with Selenium WebDriver for checking text on images using OCR

Recently a team approached me looking for a solution to extract text from an image displayed on a web page and verify it’s contents as part of Selenium tests.

This post explains the solution using Tesseract, Tess4J along with Selenium for checking text displayed on images.

Tesseract is a famous open source OCR engine. It uses the Leptonica Image Processing Library. Tesseract support a wide variety of image formats and convert them to text in over 60 languages.

Tesseract works on Linux, Windows and Mac OSX. Please refer Readme page for installation instructions.

This sample is built on Mac. You can install Tesseract on Mac using homebrew:

brew install tesseract

In addition to Tesseract (written in C++), we need a Java wrapper called Tess4J which provides JNA wrapper for Tesseract OCR API.

Here is a sample page which has a barcode displayed as image. We will extract the barcode number and assert it’s value.


Since I am using Maven for this project, I added Tess4j dependency to my pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""


Here’s JUnit test which navigates to the sample page and checks the number displayed on the barcode image:

package me.unmesh.selenium.ocr.example;

import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.By;

import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import static org.junit.Assert.*;

import net.sourceforge.tess4j.*;

 * A demo test to verify text from an image using Tesseract OCR API
 * @author  upgundecha
public class BarcodeTest {
    private WebDriver driver;

    public void setUp() {
        driver = new FirefoxDriver();
        // navigate to the dummy page with a barcode image

    public void tearDown() {

    public void testBarcodeNumber() throws Exception {
        // get and capture the picture of the img element used to display the barcode image
        WebElement barcodeImage = driver.findElement("barcode"));
        File imageFile = WebElementExtender.captureElementPicture(barcodeImage);

        // get the Tesseract direct interace
        Tesseract instance = new Tesseract();

        // the doOCR method of Tesseract will retrive the text
        // from image captured by Selenium
        String result = instance.doOCR(imageFile);

        // check the the result
        assertEquals("Application number did not match", "123-45678", result.trim());

Instead of capturing screenshot of the entire page using Selenium, I captured screenshot of the image element where the barcode is displayed on the page.

    <title>Barcode Sample</title>
        <td style="padding:10px; font-size:15px; font-family:Arial, Helvetica; text-align:center;">
          <p> Please write down your application id</p>
          <img id="barcode" src="barcode.png" />

The captured image is then passed to doOCR() method of Tesseract instance to retrieve the text.

To capture the image of a WebElement I used captureElementPicture() method from WebElementExtender class which is described in my book Selenium Testing Tools Cookbook:

package me.unmesh.selenium.ocr.example;

import java.awt.Rectangle;
import java.awt.image.BufferedImage;

import javax.imageio.ImageIO;

import org.openqa.selenium.OutputType;
import org.openqa.selenium.Point;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.internal.WrapsDriver;

 * This class provides various additional helper methods on elements
 * @author upgundecha

public class WebElementExtender {

     * Gets a picture of specific element displayed on the page
     * @param element The element
     * @return File
     * @throws Exception
    public static File captureElementPicture(WebElement element)
            throws Exception {

        // get the WrapsDriver of the WebElement
        WrapsDriver wrapsDriver = (WrapsDriver) element;

        // get the entire screenshot from the driver of passed WebElement
        File screen = ((TakesScreenshot) wrapsDriver.getWrappedDriver())

        // create an instance of buffered image from captured screenshot
        BufferedImage img =;

        // get the width and height of the WebElement using getSize()
        int width = element.getSize().getWidth();
        int height = element.getSize().getHeight();

        // create a rectangle using width and height
        Rectangle rect = new Rectangle(width, height);

        // get the location of WebElement in a Point.
        // this will provide X & Y co-ordinates of the WebElement
        Point p = element.getLocation();

        // create image  for element using its location and size.
        // this will give image data specific to the WebElement
        BufferedImage dest = img.getSubimage(p.getX(), p.getY(), rect.width,

        // write back the image data for element in File object
        ImageIO.write(dest, "png", screen);

        // return the File object containing image data
        return screen;

Tesseract is clean, fast and accurate for OCR testing needs. Similar approach can be followed for .NET using Emgu library

PageObject Generator Utility for Selenium WebDriver

Today I saw an interesting tweet lined up in my twitter stream about a Page Recorder utility developed by Dmitry Zhariy which aids in generating PageObjects for Selenium WebDriver tests. I could not resist to get hands-on with this tool and write this post.

I was playing with an idea to build such an utility and someone already done such a good work developing this cool tool. You can read the original blog post about SWD Page Recorder utility here (translated in English)

This project is hosted on GitHub and licensed under The MIT License.

First Impressions

The SWD Page Recorder utility helps automation developers in finding and locating elements as well as creating page objects through a nicely built user interface. You don’t need to juggle around browsers and tools like Firebug/Developer tools in Google Chrome or IE to find/create locator strategies. This tool allows you to launch various types of browsers, navigate to page and spy on elements, look at their attributes, create & test locators. You can then use this information to generate page objects in various programming languages.

SWD Recorder can be used to test locators just like Selenium IDE on browsers like IE, Chrome and Safari.

This is still in beta phase and have some areas for improvements. Read on the original blog for more details. I played with the utility to create a page object for with following steps:

Launch the SwdPageRecorder application. On the main Window you need to select & configure Browser that you want to use from Browser Settings tab. It also allows option to connect to RemoteWebDriver instance.

Select the desired Browser and hit Start button to start the Browser instance. By default utility points to You can change this by entering desired URL in Browser textbox above Browser Settings tab and click on Go button.

It will navigate to the URL as shown in below screenshot:

Browser Settings
Browser Settings

Switch to Locators tab and click on Start button in In-Browser Web Element Explorer section. Now switch to the Browser instance opened by SWD Page Recorder.

Focus on a desired element in the Browser window and press Ctrl + Right click. This will open a popup window as shown in below screenshot:

Element Information
Element Information

Add a desired element by specifying a logical/descriptive name in Code Identifier textbox and click on Add element button. In this example I will specify emailTextBox in Code identifier textbox

Go on adding elements that are needed for test with above steps.  You can see the elements from the page added to the tree in below screenshot:

Login Page Elements
Login Page Elements

You can also add elements manually or edit elements that are already added by using WebElement section. Elements can be highlighted using Highlight button to test that locator information is sufficient or debug the locator values.

Generating PageObject Code

Once you capture all the elements needed for your PageObject, switch to Source Code tab. The source code tab provides templates for generating PageObject code in various languages (C#, Java, Perl, Python, Ruby etc.). Select a desired template and click on Generate button to generate the code. SWD Page Recorder generated following code the elements added from Login page.

PageObject Code
PageObject Code

You can either copy the code back to the editor or save this in a file and done!


Overall this utility worked pretty good. There are few glitches which I hope should be gone after beta is over. There is a scope for improvement in overall usability of the tool. Along with PageObjects I also want to see utility generating a sort of XML/Properties file based UI-Map.

iOS Automation with Appium & Selenium

Note: This post is not up to date with latest release of Appium. An update coming soon…

Yesterday I saw a tweet on Appium release from Sauce Labs and immediately started exploring it. This post summarizes my initial experience with Appium.

Appium ( is an open source tool/framework for automating iOS Native and Hybrid Apps. It uses the WebDriver JSON wire protocol to drive iOS apps.

Appium server is written in Node.js and talks to iOS using UIAutomation via Instruments. You can use the Selenium WebDriver API for writing tests which talk to Appium via JSON wire protocol for running the Selenium commands. This also gives you advantage of writing tests in your language of preference.


I found installing Appium quite easy on a local machine. You need Node.js installed before using Appium.

1. Install Node.js from

2. Install WebDriver package for Node.js with the following command

sudo npm install wd

3. Install Appium with the following command

sudo npm install appium -g

4. Start the Appium server with the following command

appium &

Appium server will start at http://localhost:4723

Implementing test using Selenium WebDriver

I am using a sample BMI Calculator App developed with native iOS SDK for this example

Bmi Calculator App
Bmi Calculator App

Build the app using xcodebuild command (In this example the BmiCalc app)

xcodebuild -sdk iphonesimulator6.1

I am using Maven to setup a Java project for this test and here is pom.xml with the following dependencies added. For this example I have used IntelliJ IDEA. For more information on using Maven for Selenium script development refer bonus Chapter Integration with other Tools from my Selenium Testing Tools Cookbook

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""


And here is BmiCalcTest class

import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.By;

import java.util.List;

import static org.junit.Assert.assertEquals;

public class BmiCalcTest {

    private WebDriver driver;

    public void setUp() throws Exception {

        //Appium needs the path of app build
        //Set up the desired capabilities and pass the iOS SDK version and app path to Appium
        File app = new File("/Users/upgundecha/Desktop/AppExamples/BmiCalculator/build/Release-iphonesimulator/");
        DesiredCapabilities capabilities = new DesiredCapabilities();
        capabilities.setCapability(CapabilityType.BROWSER_NAME, "iOS");
        capabilities.setCapability(CapabilityType.VERSION, "6.1");
        capabilities.setCapability(CapabilityType.PLATFORM, "Mac");
        capabilities.setCapability("app", app.getAbsolutePath());

        //Create an instance of RemoteWebDriver and connect to the Appium server.
        //Appium will launch the BmiCalc App in iPhone Simulator using the configurations specified in Desired Capabilities
        driver = new RemoteWebDriver(new URL("http://localhost:4723/wd/hub"), capabilities);

    public void testBmiCalc() throws Exception {

        //iOS controls are accessed through WebElement class
        //Locate the Height & Weight textField by their accessibility labels using
        WebElement heightTextField = driver.findElement("Height"));

        WebElement weightTextField = driver.findElement("Weight"));

        //Locate and tap on Calculate button using the click() method
        WebElement calculateButton =  driver.findElement("Calculate"));;

        //Locate all the label elements using By.tagName()
        List<WebElement> labels = driver.findElements(By.tagName("staticText"));

        //Check the calculated Bmi and Category displayed on labels
        //Label with index 8 has value of the Bmi and index 9 has the value for category

    public void tearDown() throws Exception {
        //Close the app and simulator

I really liked using Selenium WebDriver API for writing iOS tests with Appium. I can add iOS support to my existing Selenium Framework with minimal changes. Appium presently supports locating elements using the tag name (i.e type of iOS control) and accessibility labels.

Running tests in Cloud

You can also run Appium with Sauce Labs Cloud, for more details read

Overall Appium is a great tool to start with.

Getting Started –
Appium on GitHub –
Samples –
Wiki –
Google Group –!forum/appium-discuss