Outline
Have you ever built a web scraper and wanted to download more than HTML files? In this post, we will discuss how to download images from the web using a byte stream.
The imports required
For this project, we will need the requests library to get the image file contents from a remote server and the os library to deal with file paths.
import requests
import os
How to structure file paths
When scraping images from the web, an issue of naming the images comes up. Three options first present themselves.
- Create a random string for the filename
- Strip the filename from the URL
- Recreate the original file path using the URL
Option 1 can often be a viable solution, however, you lose the original context of the image. (i.e. the path and filename) Option 2 runs into the issue of multiple images having the same filename and thus you have to deal with duplicates. Lastly, we have option 3. Option 3 solves the problems of 1 and 2 although it does place the scraped images in subfolders which can be more difficult for viewing multiple images at once.
Option 3 can be implemented as follows:
BASE_FOLDER = 'saves'
domain = 'https://<some domain>.com'
filename = BASE_FOLDER + '/' + domain[8:] + link[len(domain):]
All images are placed in a base folder and then a first-level sub-folder with the domain name (https://
is striped). The rest of the link creates further sub-folders and the final file name. Now to ensure we are able to save the file successfully, we need to use the os library to check if the file path already exists and if not, build the path.
# make sure path exists
os.makedirs(os.path.dirname(filename), exist_ok=True)
Retrieving the image data
To retrieve the image contents, we must use the requests library.
# Grab the image content
img_data = requests.get(link).content
The content
attribute contains the bytes needed to write the image file. This is the key part of downloading images in Python. Once we have the bytes we can use the standard Python procedure to write a file using the wb
access mode (i.e. write bytes).
# Write to file
with open(filename, 'wb') as f:
f.write(img_data)
Bringing it all together
import requests
import os
BASE_FOLDER = 'saves'
def save_image(link, domain=None, filename=None):
"""
link = https://<some image URL>
Saves in BASE_FOLDER unless filename is set
"""
print('Downloading (img): ', link)
# Build the file path based on the link
if not filename:
filename = BASE_FOLDER + '/' + domain[8:] + link[len(domain):]
# Don't overwrite if the file already exists
if os.path.exists(filename):
return
# make sure path exists
os.makedirs(os.path.dirname(filename), exist_ok=True)
# Grab the image content
img_data = requests.get(link).content
# Write to file
with open(filename, 'wb') as f:
f.write(img_data)
We now have a concise method to handle our image downloads.
Comments
Login to Add comments.