PowerShell

Windows: Find and eliminate Duplicate Files with PowerShell

We are living in a big data world which is both a blessing and a curse. Big data usually means a huge number of files such as photos and videos and finally a huge amount of storage space. Files are accidentally or deliberately moved from location to location without first considering that these duplicate files consumes more and more storage space. I want to change that with you in this blog post. We will search duplicate files and then move them to a different storage location for further review.

The Goal

With my script in hand you are able to perform the described scenario. Make sure your computer runs Windows PowerShell 5.1 or PowerShell 7.

  1. Open PowerShell (Windows Key + X + A)
  2. Navigate to the script location. Enter the full path to the destination folder. This folder is our target for searching for duplicate filesAnmerkung 2020-04-26 101414.png
  3. A window will pop-up to select duplicate files based on the hash value. All selected files will be moved to C:\DuplicatesCurrentDateAnmerkung 2020-04-26 101749.png
  4. Afterwards the duplicate files are moved to the new location. You will again see a new window appearing that shows the moved files for further reviewAnmerkung 2020-04-26 102219.png

Which brings me to the code.

The Script

Here is the code for download.

find_duplicate_files.ps1

And here is the code in full length. Copy the code to your local computer and open it in PowerShell ISE, Visual Studio Code or an editor of your choice. Hit F5 (PowerShell ISE or VS Code).


# .SYNOPSIS
# find_ducplicate_files.ps1 finds duplicate files based on hash values.

# .DESCRIPTION
# Prompts for entering file path. Shows duplicate files for selection.
# Selected files will be moved to new folder C:\Duplicates_Date for further review.

# .EXAMPLE
# Open PowerShell. Nagivate to the file location. Type .\find_duplicate_files.ps1 OR
# Open PowerShell ISE. Open find_duplicate.ps1 and hit F5.

# .NOTES
# Author: Patrick Gruenauer | Microsoft MVP on PowerShell [2018-2020]
# Web: https://sid-500.com

############# Find Duplicate Files based on Hash Value ###############
''
$filepath = Read-Host 'Enter file path for searching duplicate files (e.g. C:\Temp, C:\)'

If (Test-Path $filepath) {
''
Write-Warning 'Searching for duplicates ... Please wait ...'

$duplicates = Get-ChildItem $filepath -File -Recurse `
-ErrorAction SilentlyContinue |
Get-FileHash |
Group-Object -Property Hash |
Where-Object Count -GT 1

If ($duplicates.count -lt 1)

{
Write-Warning 'No duplicates found.'
Break ''
}

else {
Write-Warning "Duplicates found."
$result = foreach ($d in $duplicates)
{
$d.Group | Select-Object -Property Path, Hash
}

$date = Get-Date -Format "MM/dd/yyy"
$itemstomove = $result |
Out-GridView -Title `
"Select files (CTRL for multiple) and press OK. Selected files will be moved to C:\Duplicates_$date" `
-PassThru

If ($itemstomove)

{
New-Item -ItemType Directory `
-Path $env:SystemDrive\Duplicates_$date -Force
Move-Item $itemstomove.Path `
-Destination $env:SystemDrive\Duplicates_$date -Force
''
Write-Warning `
"Mission accomplished. Selected files moved to C:\Duplicates_$date"

Start-Process "C:\Duplicates_$date"
}

else
{
Write-Warning "Operation aborted. No files selected."
}
}
}
else
{
Write-Warning `
"Folder not found. Use full path to directory e.g. C:\photos\patrick"
}

Credits

Thanks to Kenward Bradley’s one-liner which sparks the idea in me to write this script. Here you go:

http://kenwardtown.com/2016/12/29/find-duplicate-files-with-powershell/

See also

Cool stuff? Take a look at my other scripts here:

Downloads Section sid-500.com

11 replies »

  1. There are a gazillion utilities meant to detect and delete duplicate files. What I’m looking for is more tricky : to detect partial duplicates, when a file fragment A exists inside a bigger file B, but (to compound the difficulty) the beginning of A is not necessarily the beginning of B. It happens in data recovery scenarios, especially with some types of video files which don’t have a single header structure (like MPG / VOB / MTS). What I could come up with so far is to :
    1) Extract a small string near the beginning of each unidentified file fragment into a text file with a Powershell script.
    For instance this extracts 20 bytes at offset 40000 :
    $offset = 40000
    $length = 20
    foreach ($file in gci *.mpg, *.vob, *.mts) {
    $buffer = [Byte[]]::new($length)
    $stream = [System.IO.FileStream]::new($file.FullName, ‘Open’, ‘Read’)
    $stream.Position = $offset
    $readSize = $stream.Read($buffer, 0, $length)
    $stream.Dispose()
    if ($readSize) {
    $hex = for ($i = 0; $i -lt $readSize; $i++) { ‘\x{0:X2}’ -f $buffer[$i] }
    $hex = $hex -join ”
    $name = $file.FullName
    $size = $file.Length
    $ascii = [System.Text.Encoding]::Default.GetString($buffer)
    Add-Content -Path “G:\HGST 4To MPG-VOB-MTS logical search 40000 20.txt” -Value “$hex $name $size $offset”
    }
    }
    $buffer = $null
    I got help here :
    https://stackoverflow.com/questions/62238275/quickly-copy-short-string-from-binary-file-as-hex-to-text-file-in-loop-part
    2) Then, load this list into WinHex as a list of search terms and run a “simultaneous search” in “logical” mode (meaning : it analyses a given volume on a file-by-file basis, and reports the logical offset where the string was found in each individual file).
    3) Then, based on the search report, for each group of identified files matching one of these extracted strings, compare checksums between the whole file fragment and the corresponding segment of the bigger file (I do this with Powershell using a small CLI tool called “dsfo” — it could be done with Powershell alone as this article demonstrates, but it works well and makes for smaller scripts), and delete the fragment if indeed there’s a complete match.

    But that method is quite complicated and tedious, and another difficulty is that, whatever offset value I choose, there are always hundreds of strings (out of a few thousands files) which are not specific enough to yield only relevant matches (for instance “00 00 00 00 …” or “FF FF FF FF…”, or even more complex strings which happen to be present in many unrelated files). So I was wondering if there were more streamlined and efficient ways of performing that kind of task. I ran a Web search with « search duplicates within files », and this article appeared in the first page of results — none of which being actually relevant since they all deal with complete / perfect duplicates, which, again, is quite trivial. WinHex has a “block-wise hashing and matching” feature, which would seem like it could do what I want, but it creates a hash database for every single sector of each input file (it can’t be set to a bigger block value), requiring a huge amount of space just to store that (a MD5 hash is 32 bytes, so building the hash database for 500GB of input files requires about 30GB, and actually twice that amount since it first create a temporary file) ; and then the result of the analysis is useless for that purpose as, contrary to the “simultaneous search”, it doesn’t report logical offsets, it only reports a list of all sectors from input files indexed in the hash database which were found at physical offset X on the analysed volume. So, back to square one. é_è

    Liked by 1 person

  2. Hi Patrick,
    your script is perfect; I wonder what if I want it to shows all the duplicates except the first copy of the file i.e. if the file is existing 3 times, the tool shows only 2 lines to be removed, and if only twice, it shows only one line to be removed; this will make it way easier to select all lines and remove at once.
    also any tips to add select all feature to the pop up window?
    King Regards.

    Like

  3. Hey Patrick
    What if the you found 1000 duplicates? What would the fastest way you could select all the duplicates and copy them out without having to ctrl + click?

    Like

  4. I found a minor error in New-item just before -path there is this ` and there are more in the script.
    then there is this with the date format. there missing a y in MM/dd/yyy and I’m using mm-dd-yyyy but that’s fine.
    now it works after removing the `
    Thx.

    Liked by 1 person

  5. like the idea but why C: -make it ask for path to where it will copy the duplicates. I’ve tried to change to D: but that fails.

    Like

      • okay, but using system drive is no go for me, so how to change it to use data drive D: E: or whatever will solve it for me.

        Like

      • tried using default script settings but with different path and it found 10 files. super. but hitting ok, the script failes with:

        New-Item : A positional parameter cannot be found that accepts argument ‘C:\Duplicates_05-12-2020’.
        At C:\Service\Scripts\find_duplicate_files.ps1:53 char:1
        + New-Item -ItemType Directory `-Path $env:SystemDrive\Duplicates_$date …
        is it me or…

        Like

Leave a Reply to Gabriel Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.