Showing posts with label ffmpeg. Show all posts
Showing posts with label ffmpeg. Show all posts

04 April 2024

Checking if two video files are the same aside from metadata

As I wanted to make sure that two video files were identical aside from metadata, I decided to write a script to help me. The script will create a hash from the audio and video data without the header data.

Caveats:

  • There is a chance for hash collisions, so double check the results manually before deleting
  • This script doesn't flag videos as duplicates if they are the same video but have different resolution, bitrate, audio data, etc


Requirements

  • ffmpeg
  • md5sum


Shell Version

Initial shell script to compare to 2 files: 

#!/bin/sh
file1="$1"
file2="$2"

flags="-fflags +bitexact -flags:v +bitexact -flags:a +bitexact -c copy -f matroska"

file1hash=$(ffmpeg -i "$file1" $flags -c copy -f matroska -loglevel error - | md5sum | cut -f1 -d" ")
file2hash=$(ffmpeg -i "$file2" $flags -c copy -f matroska -loglevel error - | md5sum | cut -f1 -d" ")

echo "$file1hash $file1"
echo "$file2hash $file2"


Example Shell

$./diffvideo.sh file1.m4v file2.m4v
cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX file1.m4v
cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX file2.m4v


Python Version

Expanded python script to compare more files:

#!/usr/bin/env python3

import argparse
import shlex
import subprocess


def setup_cli():
    parser = argparse.ArgumentParser(
        prog='',
        description='',
        epilog='',
    )
    parser.add_argument('filenames', nargs='*')
    return parser


def check_files(filenames):
    hashes = {'not_a_video': []}
    for file in filenames:
        flags = '-fflags +bitexact -flags:v +bitexact -flags:a +bitexact'
        cmd = f'ffmpeg -i "{file}" {flags} -c copy -f matroska -'
        ff_proc = subprocess.run(shlex.split(cmd), capture_output=True)
        if ff_proc.returncode != 0:
            hashes['not_a_video'].append(file)
            continue
        hash_proc = subprocess.run('md5sum', capture_output=True, input=ff_proc.stdout)
        filehash = hash_proc.stdout.decode().split()[0]
        if filehash not in hashes:
            hashes[filehash] = []
        hashes[filehash].append(file)

    return hashes


def print_results(hashes):
    not_a_video = hashes.pop('not_a_video')
    singles = {}
    dupes = {}
    for key, value in hashes.items():
        if len(value) > 1:
            dupes[key] = value
        else:
            singles[key] = value

    if not_a_video:
        print('\nNot a video:')
    for each in not_a_video:
        print(f'    {each}')

    if dupes:
        print('\nDuplicates found:')
    for key, value in dupes.items():
        print(f'    {key}')
        for v in value:
            print(f'        {v}')

    if singles:
        print('\nNo Duplicates for these files:')
    for key, value in singles.items():
        print(f'    {value[0]}')


if __name__ == "__main__":
    parser = setup_cli()
    args = parser.parse_args()
    hashes = check_files(args.filenames)
    print_results(hashes)


Example Python

$./diffvideo.py *

Not a video:
    test.txt
    file.docx

Duplicates found:
    cb31XXXXXXXXXXXXXXXXXXXXXXXXXXXX
        file1.m4v
        file2.m4v
    ab76XXXXXXXXXXXXXXXXXXXXXXXXXXXX
        file5.m4v
        file6.m4v

No duplicates for these files:
    file3.m4v
    file4.m4v


Appendix

Sources