De-Duping Files With Powershell


I often have to do little one-off tasks that could be manually, but almost always are better done with a small utility application or script. I don’t have a very strong scripting background, so I’ve been trying to force myself to use (and therefore get more comfortable with) Powershell whenever possible. This week I had a perfect little one-off task that Powershell was a great fit for: de-duping a set of files.

I had a set of ~1000 files in a directory. They each had unique names, but some of them had duplicate contents. I needed to whittle that down to a set of complete unique files, removing any that were duplicates. I wasn’t able to reliably determine which files had duplicate contents based on file meta data like size or date modified. The approach I wanted to implement was to loop over all of the files and for each one:

  • Compute a hash of the file contents
  • If I hadn’t yet seen that hash value, add it to a hash table along with the full name of the file
  • If I had already seen that hash value, do nothing and go on to the next file.
  • After all files had been looked at, copy the files that were added to the hash table to another folder, which would represent the set of unique files.

Simple, right? This would be very easy for me to bang out with any .NET language, but I thought it was a good opportunity to use Powershell and maybe learn something new.

The PowerShell code needed to iterate through a list of files in a folder is pretty straightforward, and my hash table needs were easily met by using the .NET Dictionary<TKey, TValue> type. The only tricky part was calculating the hash of each file’s contents for comparison purposes. Luckily for me the PowerShell Community Extensions (PSCX) has a Get-Hash cmdlet that makes this easy. Here’s the full script:

Import-Module pscx

$inputFolder = "C:\input"
$outputFolder = "C:\output"

$lookupTable = New-Object 'System.Collections.Generic.Dictionary[String,String]'

Get-ChildItem $inputFolder | Sort-Object -Property "Name" | Foreach-Object {
    $currentFileHash = Get-Hash $_.FullName

    Write-Host $currentFileHash
    if($lookupTable.ContainsKey($currentFileHash)){
        return #I still don’t completely grok control structures in powershell
    }
    $lookupTable.Add($currentFileHash, $_.FullName)
}

$lookupTable.GetEnumerator() | Foreach-Object {
    Copy-Item $_.Value $outputFolder
}

The only part of this script that made me scratch my head a bit was needing to use the ‘return’ statement where I would normally use a ‘continue’ in C#. I still don’t completely grok flow control statements in PowerShell as I always seem to get this wrong on the first try. My understanding of what’s happening here is that the code is returning from the function that is being invoked by the Foreach-Object cmdlet for each item in the collection. Returning from that function simply moves control to the next item in the collection.

Leave a comment

Your email address will not be published. Required fields are marked *