04 April 2024

Checking if two video files are the same aside from metadata

As I wanted to make sure that two video files were identical aside from metadata, I decided to write a script to help me. The script will create a hash from the audio and video data without the header data.

Caveats:

  • There is a chance for hash collisions, so double check the results manually before deleting
  • This script doesn't flag videos as duplicates if they are the same video but have different resolution, bitrate, audio data, etc


Requirements

  • ffmpeg
  • md5sum


Shell Version

Initial shell script to compare to 2 files: 

#!/bin/sh
file1="$1"
file2="$2"

flags="-fflags +bitexact -flags:v +bitexact -flags:a +bitexact -c copy -f matroska"

file1hash=$(ffmpeg -i "$file1" $flags -c copy -f matroska -loglevel error - | md5sum | cut -f1 -d" ")
file2hash=$(ffmpeg -i "$file2" $flags -c copy -f matroska -loglevel error - | md5sum | cut -f1 -d" ")

echo "$file1hash $file1"
echo "$file2hash $file2"


Example Shell

$./diffvideo.sh file1.m4v file2.m4v
cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX file1.m4v
cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX file2.m4v


Python Version

Expanded python script to compare more files:

#!/usr/bin/env python3

import argparse
import shlex
import subprocess


def setup_cli():
    parser = argparse.ArgumentParser(
        prog='',
        description='',
        epilog='',
    )
    parser.add_argument('filenames', nargs='*')
    return parser


def check_files(filenames):
    hashes = {'not_a_video': []}
    for file in filenames:
        flags = '-fflags +bitexact -flags:v +bitexact -flags:a +bitexact'
        cmd = f'ffmpeg -i "{file}" {flags} -c copy -f matroska -'
        ff_proc = subprocess.run(shlex.split(cmd), capture_output=True)
        if ff_proc.returncode != 0:
            hashes['not_a_video'].append(file)
            continue
        hash_proc = subprocess.run('md5sum', capture_output=True, input=ff_proc.stdout)
        filehash = hash_proc.stdout.decode().split()[0]
        if filehash not in hashes:
            hashes[filehash] = []
        hashes[filehash].append(file)

    return hashes


def print_results(hashes):
    not_a_video = hashes.pop('not_a_video')
    singles = {}
    dupes = {}
    for key, value in hashes.items():
        if len(value) > 1:
            dupes[key] = value
        else:
            singles[key] = value

    if not_a_video:
        print('\nNot a video:')
    for each in not_a_video:
        print(f'    {each}')

    if dupes:
        print('\nDuplicates found:')
    for key, value in dupes.items():
        print(f'    {key}')
        for v in value:
            print(f'        {v}')

    if singles:
        print('\nNo Duplicates for these files:')
    for key, value in singles.items():
        print(f'    {value[0]}')


if __name__ == "__main__":
    parser = setup_cli()
    args = parser.parse_args()
    hashes = check_files(args.filenames)
    print_results(hashes)


Example Python

$./diffvideo.py *

Not a video:
    test.txt
    file.docx

Duplicates found:
    cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX
        file1.m4v
        file2.m4v
    ab76XXXXXXXXXXXXXXXXXXXXXXXXXXXX
        file5.m4v
        file6.m4v

No duplicates for these files:
    file3.m4v
    file4.m4v


Appendix

Sources

No comments:

Post a Comment