Use Octoparse to Download Web Data Easily - User GuideMonday, January 25, 2021
Octoparse is a modern visual web data extraction software. Both experienced and inexperienced users would find it easy to bulk extract information from websites with it. For most scraping tasks, no coding is needed.
Octoparse supports Windows XP, 7, 8, 10. It works well for both static and dynamic websites, including those web pages using Ajax. To export the data, there are various data formats of your choice like CSV, EXCEL, HTML, TXT, and databases (MySQL, SQL Server, and Oracle via API). Octoparse simulates human operation to interact with web pages.
Its remarkable features such as filling out forms, entering a search term into the textbox, etc., make extracting web data an easy process. You can run your extraction project either on your local machines (Local Extraction) or in the cloud (Cloud Extraction).
Some of our clients use Octoparse’s cloud service, which can extract and store large amounts of data to meet large-scale extraction needs.
Octoparse free and paid editions share some features in common. Paid editions allows users to extract enormous amounts of data on a 24-7 basis using Octoparse’s cloud service. The prices of each plan can be viewed here.
Octoparse provides a visual operation pane, which is very user-friendly and straightforward. It simulates human web browsing behavior like opening a web page, logging into an account, entering text, pointing-and-clicking the web element, etc. Just click the information on the website in the built-in browser and start the extraction, and you will get the structured data you need.
There are 2 extraction modes (Task Template and Advanced Mode) in Octoparse. It takes you only half an hour to get started with Octoparse, and people who have programming experience would spend less time to get familiar with Octoparse.
Scraping the web on a large scale simultaneously, based on distributed computing, is the most powerful feature of Octoparse. After you upload your scraping project to the cloud, you can choose to perform the extraction concurrently using many cloud servers. If you need to scrape 10,000 web pages within a short time, then Octoparse cloud service fits best. Standard Edition limits you with only 10 cloud servers, though it still greatly speeds up the process of data extraction. You can set up a time schedule for regular data extraction.
For the Advanced Mode, the tool provides a rich set of tools. These tools include:
# RegEx Tool#
# Xpath Tool #
# Database Auto Export Tool #
# API #
To improve users' experience, Octoparse provides the inbuilt RegEx generator. The refining scraped fields might require you to apply RegEx, so this fits it best both generating and verifying RegExes.
The Octoparse API makes it easy to connect your system to numerous data in real-time. You can either import the Octoparse data into your own database or use our API to require access to your own account’s data. Just configure the rule for your task, and Octoparse cloud servers will do the rest. Data are returned as XML.
To use the Octoparse Standard API, you will need to hold a Standard or Professional account with at least one runnable task set up. Documentation: http://dataapi.octoparse.com/help
To use the Octoparse Advanced API, you will need to hold a Professional account with at least one runnable task set up. Documentation: http://advancedapi.octoparse.com/help
Does it ever drive you crazy that your IP address is banned and you cannot access a website because you scrape it frequently? It happens especially when you extract data from business directories that apply strict anti-bot measures. Octoparse enables you to scrape these websites by rotating anonymous HTTP proxy servers. In Cloud Extraction, Octoparse applies lots of 3rd party proxies for automatic IP rotation. For Local Extraction, you can add a list of external proxy addresses manually and configure them for automatic rotation. To do this, you can click here to learn how to include IP rotation into a scraping project.
IPs are rotated with a certain interval of time you set. In this way, you can extract data from the website without taking risks of getting IP addresses banned.
Check out this video to know how Octoparse prevents getting blacklisted or blocked when scraping websites.
Author: The Octoparse Team