Do not index
Do not index
notion image
A lot of our products are created around collecting data. A useful technique for that is web scraping.
 
Web scraping is essentially a form of data mining. Pretty much any information that you can see on a web page can be retrieved (scraped).
 
We scrape a lot of data and I am personally very knowledgable about scraping because I have been doing it for years now.
 

Proxies

When you scrape you need a proxy most of the times.
 
A proxy is a (usually) a server that acts as an intermediary for requests coming from you the internet.
 
Proxies are tipically used to browse the web anonymously.
 
When scraping, proxies are tipically used to:
  • render some data that is available to only some geographic locations or devices. You can make your request seem like they are coming from a specific geographical region or device which enables you to see the specific content that the website displays for that given location or device. For example, I might want to retrieve information regarding the price of a flight on a Brasilian site.
  • bypass some limitations of the website you are browsing. Sometimes your machine could be IP blocked and thereby it will not be able to connect to the targeted site’s server.

    Which proxies to use

    There are hundreds of companies now that offer proxies. I’ve tested at least a dozen of companies and what I currently use is Luminati (aff link).
     
    They are the best in class when it comes to the quality of their proxies, support but also ease of implementation.
     
    In fact, the feature that I like the most is the “Luminati Proxy Manager”.While some people use it for the automation of proxies with no code required, I use it because of the custom settings I can set on each machine I use to scrape. You can set all sort of things like auto-retry on receiving status codes, timeouts or failures, and automatic IP rotation based on your requirements, waterfall routing, proxy rotation and more.
     
    Best of all I can use regular expressions to filter specific URLs out of what I scrape and reduce traffic.
     
    If you are on a tight budget, they have shared IPs you can use (IPs used by multiple customers for an unlimited number of domains), but I usually use exclusive IPs which allow you to guarantee the IPs are only used by you for your target site.
     
    Luminati also offer residential, static and mobile IPs so they got me covered for a large number of use cases.
     
    Thanks for reading,
    Mike Rubini
    Mike Rubini

    Written by

    Mike Rubini

    CEO