Data Scraping, also known as screen scraping or frame grabbing, has been with us for some time.
“For some, it is about capturing (and some might say stealing) content, while for others, it’s a way of feeding output from one app as input to another,” said Greg Schulz, an analyst with StorageIO Group.
The technology came into prominence a few decades back as computing moved almost overnight from being the province of the few to being ubiquitous. That dramatic shift meant that there was data sitting on old, hard-to-access systems. Screen scrapers became popular as a way to interface with older apps that did not have export capabilities and grab that input to use for the modern apps of the time.
Data scraping is a way of extracting data generated by another program. Its most common use is web scraping, whereby the scraper grabs information from a website.
While there are relatively benign applications, there is also a nefarious side. Tools exist to grab or steal protected content, including text, images and videos. These tools violate copyright and intellectual property (IP) laws.
It is a way of getting around the fact that some companies try to avoid having their content downloaded or reused for unauthorized purposes. Perhaps they want users to register, become a subscriber or pay before they can gain full access to knowledge. Whatever the reason, these companies use access and permission controls and other means to prevent the exposure of the data via an easily consumable API. Data scraping can circumvent such safeguards.
How data scraping is done
Web scraping is a fairly direct process when viewed at a high level. Code is utilized to pull information, generally via a scraper bot. The bot sends a request to the website, parses the HTML document and converts it into a different format.
Over time, the game has grown more sophisticated. As scraper bots become successful, content protection strategies are beefed up to thwart their efforts. In turn, the bots respond by developing tactics to outmaneuver these new protection mechanisms — and so it goes.
For the scrapers, content may be derived at little or no expense. Instead of having to write their own content, conduct research and obtain customer reviews, for example, the scrapers may post material on their sites. They avoid having to pay for certain reports and other documents.
See more: What is Data Aggregation?
Data scraping use cases
Content: Instead of writing your own content, a scraper may replicate or repurpose what is on another site. Bots look for content to improve search engine optimization purposes, for example.
Reviews: Sites such as Yelp and Airbnb go to great lengths to obtain customer reviews. Some scraper bots may capture such content and reproduce it on another site.
Pricing: Many vendors are leery about posting prices. If they post their prices publicly, competitors will undercut them. Hence, there is a specialized form of scraper that scours the web for price-related content.
Contacts: Marketing lives and dies on contacts. It needs good email addresses and phone numbers in order to accomplish its mission. Contact scrapers ransack websites for any contact data written in plain text. They go through employee directories, about us pages, contact pages, mailing lists and other locations.
Older applications: Some older apps are written in obscure computer languages that aren’t easy to access. Tools are used to transform that data into a more manageable format.
“This offers a quick and easy way to add a GUI interface to an older app, until a rewrite, port, other modification could take place,” Schulz said.
Videos: Some videos on platforms like YouTube use scraping to create their content. The scraped material is used for the voice over on a video. Similarly, images from the websites are scraped for use in the video.
Data scraping mitigation
How do you prevent scraping? Techniques include limiting access rates. A human browsing a site does so at a certain rate. A bot is several orders of magnitudes faster. Therefore, one tactic is to limit the maximum number of requests made by one IP address in a given time period.
Another approach is to modify HTML markup regularly. By changing certain elements, the efforts of scrapers are hampered. Random shifts in content protection or code make it more complicated to extract the data.
Similarly, CAPTCHAs and other challenges can be used. This method poses simple questions for a human that tend to baffle the bots. Additionally, text embedded in images can be difficult to extract as it requires optical character recognition (OCR).
Data scraping tools
There are two sides to this market: Tools that scrape and tools that protect against scraping.
These are some of the top providers:
- Nintex RPA
- Veryfi OCR API & SDK
- Astera ReportMiner
- Automate RPA
See more: What is Data Visualization?