Eric Bergman-Terrell's Blog

Python Programming Tip: Generate Hash for a File
February 14, 2021

I've written a file sync and verify program in Python. In order to verify that files are copied correctly from the source folder to the destination folder, the program generates hashes for each source and destination file, and compares the hashes, rather than the entire file contents. A hash, or message digest, is a relatively short string of bytes generated from a message. For my program, the messages are the contents of the files being verified after a sync.

Comparing hashes, rather than file contents, allows the program to generate the hashes in parallel, without thrashing disk drives. The program uses one thread to generate the hashes for files in the source folder, and one thread for the destination folder. If the drives are magnetic, and are different physical drives, there's no risk of thrashing (wasting time due to excessive seeking). If the drives are SSD (solid state drives), thrashing is a moot point.

FileHash.create_file_hash (below) computes the hash. The Python with idiom is used to open the file. The with statement ensures that the file is closed after it's read, or after an exception is thrown during reading. The with statement is a convenient and readable way to avoid leaks. Use it!

FileHash.create_file_hash reads the file, block by block, with the specified BLOCK_SIZE, until all content has been read. As each block is read, a hash algorithm object is updated. When all blocks have been read, the hash algorithm generates the hash with the hexdigest() method.

Your installed version of Python will support multiple hash algorithms, which differ in their performance characteristics, and the frequency of collisions. A collision occurs with a hash algorithm returns the same hash for multiple messages. Since the hashes are usually shorter than the messages, collisions are a possibility for any hash algorithm. For my program, I chose the SHA256 algorithm, which provides decent performance and low enough risk of collisions. Note: If you are using hashing for security purposes, the risk of collisions can be very serious, and the hash algorithm must be selected with extreme care!

Without further ado, here's the source code:

def before_callback(retry_state):
    if retry_state.attempt_number > 1:
        app_globals.log.print(f'***** FileHash.create_file_hash: attempt_number: {retry_state.attempt_number} file path: {retry_state.args[1]}')


ERROR_MARKER = '<<ERROR>>'


def return_error_marker(retry_state):
    app_globals.log.print(
        f'***** FileHash.create_file_hash: failing after retries. attempt_number: {retry_state.attempt_number} File path: {retry_state.args[1]}')
    return ERROR_MARKER


class FileHash:
    BLOCK_SIZE = 65536

    @retry(wait=wait_fixed(Constants.RETRY_WAIT), stop=stop_after_attempt(Constants.MAX_RETRIES),
           before=before_callback, retry_error_callback=return_error_marker)
    def create_file_hash(self, path):
        try:
            with open(path, 'rb') as file:
                hash_algorithm = FileHash._get_hash_algorithm()

                while True:
                    file_bytes = file.read(self.BLOCK_SIZE)

                    if len(file_bytes) == 0:
                        return hash_algorithm.hexdigest()

                    hash_algorithm.update(file_bytes)

        except OSError as os_error:
            app_globals.log.print(
                f'***** FileHash.create_file_hash OSError cannot read file: {path} error: {os_error} *****')
            app_globals.log.print(f'\terrorno: {os_error.errno} winerror: {os_error.winerror} strerror: {os_error.strerror} filename: {os_error.filename}')

            raise

    @staticmethod
    def _get_hash_algorithm():
        return hashlib.sha256()

The above code leverages the tenacity library to retry file reads. My program needs retries as the files may reside on network drives, and reads may fail if the network is saturated. I covered retries previously, in this blog post.

Manual testing of the hashing code was easy. I used the Linux sha256sum command, and compared the results from FileHash.create_file_hash(). Automating this test was trivial (see TestFileHash.test_hash_with_arbitrary_file):

class TestFileHash(BaseTest):
    def test_hash(self):
        root = self.get_temp_folder()
        file_path = 'folders/destination/file1.txt'
        full_file_path = os.path.join(root, file_path)

        file_hash = FileHash().create_file_hash(full_file_path)

        hash_algorithm = hashlib.sha256()
        hash_algorithm.update(b'hello eric')
        expected_result = hash_algorithm.hexdigest()

        self.assertEqual(expected_result, file_hash)

    @unittest.skip('integration test')
    def test_hash_with_arbitrary_file(self):
        full_file_path = 'C:\\Temp\\11.flac'

        # From Linux sha256sum command
        expected_hash = 'f1d7f41a0621e3a5c92fc68e98f235c63ed80be9690885df86cb2024770d46e0'

        actual_hash = FileHash().create_file_hash(full_file_path)

        self.assertEqual(expected_hash, actual_hash)

retry
My collision rate is better than MD5, but not as good as SHA256.

Keywords: Python, Hash, Message Digest, Hash Algorithm, SSD, magnetic disk drives, thrashing, with statement, tenacity, retry, SHA256, MD5

Reader Comments

Comment on this Blog Post

Recent Posts

TitleDate
EBTCalc (Android) Version 1.44 is now availableOctober 12, 2021
Vault (Desktop) Version 0.72 ReleasedOctober 6, 2021
EBT Compass is Now Available for Android DevicesJune 2, 2021
Convert a Windows 10 Notebook into a High-Capacity Photo FrameApril 3, 2021
EBT Music Player Added To Website, Source Code in GitHubMarch 15, 2021
T-Mobile 5G21-12W-A High-Speed Internet Gateway MonitorMarch 13, 2021
Vault 3 Source Code in GitHubMarch 3, 2021