Recently I found myself having to gather some metadata about a large number of URLs (title tag, text from page, etc.)
This is typically very easy to do in Python using Requests and Beautiful Soup:
That does the job for most small tasks, but what if you have to grab info on hundreds of thousands, or even millions of URLs? By default, each link was being crawled one at a time, with the next one not starting until the previous one finished. Why not crawl multiple links at once? That’s where concurrency comes in.
There are a couple of widely used packages for making concurrent web requests with Python: grequests and requests-futures. Grequests was made by the same guy who made requests, but unfortunately looks to be abandoned as of May 2015. Requests-futures takes advantage of the concurrency.futures package that was introduced in Python in version 3.2, but is also available as a backport in earlier versions (incluiding Python 2.7) with a simple:
pip install futures
I tried both packages, but for my use case, found them to be too limiting. Instead I started with an example from the official python docs and tailored it to my needs:
This allows you to run the process on multiple threads at once, configurable by the max_workers argument passed into the ThreadPoolExecutor object.
If getting all of the data is of critical importance to you, you’ll probably want to handle specific exceptions (web requests can return a number of exceptions). For me, I wanted to get through as many URLs as possible as quickly as possible, with a reasonable (95%+) amount of accuracy, so I chose a blanket try/except.