The latest version for this tutorial is available here. Go to have a check now!
Not all pages are created equal. When web pages are showing variations, you can use branch judgment to achieve condition based scraping. Here is how it works:
Question: When should you consider using branch judgment?
Answer: There are two main scenarios when branch judgment can be useful.
1) When you are only interested in getting data from certain pages with a specific tag, such as "New", "Hot selling", "On Sale" etc.
2) When data on the page is being displayed in different forms, i.e. sometimes showing up as text, other times showing up as images.
In the example image below, we want information on all the laptops that are on sale. Looking closely at the item detail page, it seems like we can use the on sale icon as a condition to test for: if element if found on the item page, we’ll go ahead and capture the product information; otherwise, we’ll skip the page/product entirely.
Let’s see how it is done!
1) Build a loop to click each link on the list (see tutorial)
2) Use branch judgment to test for the condition: whether if present on the item page
- Switch to the Workflow Mode by toggling the button located on the upper right-hand side
- Drag a Branch Judgement action to the inside of the loop
- Click the branch on the left-hand side, select "Execute branch when: When the current page contains element"
- Fill in the XPath for element : ".//div[@class='pb-savings']" into the text box for "Element XPath" (how to find the XPath).
- Click "Save"
- Click the branch on the right-hand side, select "Do not judge. Always execute the branch"
- Click "Save"
In Octoparse, you can set the condition to one of the following:
- Do not judge. Always execute the branch - when this option is selected, Octoparse will not judge at all and will proceed to execute the actions within the branch immediately. Only select this option for the branch on the right side.
- Execute branch when current page contains text - when selected, Octoparse will look for the designated text string within the current page.
- Execute branch when current page contains element - when selected, Octoparse will look for the designated element (according to the XPath filled in) within the current page.
- Execute branch when current loop item contains text - when selected, Octoparse will look for the designated text string within the current loop item.
- Execute branch when current loop item contains element - when selected, Octoparse will look for the designated element (according to the XPath filled in) within the current loop item.
3) On the product item page (select one from the loop that has element ), click on any desired data fields to capture (learn how ). Rename the fields if needed.
4) Drag the "Extract Data" action into the branch to the left
So now, we have configured Octoparse to look for the element on the page. If the element is found, capture the desired data, otherwise, skip the product.
- If a condition is set as "whether an element is found", the designated element must be uniquely found on the page or the judgment may fail to work.
- Octoparse goes through the branches from left to right by default. It is important to always keep the condition you want to test for within the left branch; if the condition for the left branch is "Do not judge", Octoparse will not proceed to the branch on the right as "Do not judge" will always be tested "True".
- You can leave the branch blank if no data extraction action is needed when the condition is not met.
- When a data extraction action is being added to both branches, both the number of the data fields and the name of the data fields are required to be kept the same.
- You can use nested branch judgment to further refine the test.