Finding relevant images in a large set of files

I recently had to recover data (images to be precise) from a 1 TB large drive. I used photorec to find the data on the drive. It resulted in lots and lots of images within lots and lots files. How do you find the relevant images within that amount of data? For initial filtering I focused on a single criteria: image dimension. I wrote a little script to find all images, check their size, and symlink them to a folder sorting them by YEAR/MONTH/DAY. You can invoke them by calling

./findimages.sh /some/directory

. It will produce /some/directory/found with accoring subfolder for the date of the image. Note that you need the exiv2 binary to properly use this script.

#!/bin/sh

export LC_ALL=C
DIR='${1}'
TARGET_DIR='${DIR}/found/'

function symlink() {
	IMAGE="${1}"
	EXIV="${2}"
	IMAGE_DATE=$(echo "${EXIV}" | grep "Image timestamp" | sed -e "s/Image timestamp\s*:\s*//")
	DATE_FOLDER=$(echo ${IMAGE_DATE} | sed -e "s:\([0-9:]\+\) [0-9:]\+:\1:" | sed -e "s|:|/|g")
	mkdir -p "${TARGET_DIR}${DATE_FOLDER}"
	ln "${IMAGE}" "${TARGET_DIR}${DATE_FOLDER}"
	sync
}

COUNT=$(find ${DIR} -name '*.jpg' 2> /dev/null | wc -l)
CURRENT=1

for IMAGE in $(find ${DIR} -name '*.jpg' 2> /dev/null); do
	echo "${CURRENT}/${COUNT}"
	CURRENT=$((${CURRENT} + 1))
	EXIV="$(exiv2 ${IMAGE} 2> /dev/null)"
	IMAGE_SIZE=$(echo "${EXIV}" | grep "Image size" | sed -e "s/Image size\s*:\s*//")
	if [[ ${IMAGE_SIZE} =~ ^[0-9]+( )x( )[0-9]+.*$ ]]; then
		X=$(echo ${IMAGE_SIZE} | sed -e "s:\([0-9]\+\) x [0-9]\+:\1:")
		Y=$(echo ${IMAGE_SIZE} | sed -e "s:[0-9]\+ x \([0-9]\+\):\1:")
		if [ "$((${X} * ${Y}))" -gt "786432" ]; then
			symlink "${IMAGE}" "${EXIV}"
		fi
	else
		echo "${IMAGE}: ${IMAGE_SIZE} (${EXIV})"
		symlink "${IMAGE}" "${EXIV}"
	fi
done

If you have suggestions on how to improve the script or smarter alternatives, please let me know

Copyright © christophbrill.de, 2002-2017.