Windows: Find and eliminate Duplicate Files with PowerShell

By Patrick Gruenauer on 26. April 2020 • ( 22 Comments )

We are living in a big data world which is both a blessing and a curse. Big data usually means a huge number of files such as photos and videos and finally a huge amount of storage space. Files are accidentally or deliberately moved from location to location without first considering that these duplicate files consumes more and more storage space. I want to change that with you in this blog post. We will search duplicate files and then move them to a different storage location for further review.

The Goal

With my script in hand you are able to perform the described scenario. Make sure your computer runs Windows PowerShell 5.1 or PowerShell 7.

Open PowerShell (Windows Key + X + A)
Navigate to the script location. Enter the full path to the destination folder. This folder is our target for searching for duplicate files
A window will pop-up to select duplicate files based on the hash value. All selected files will be moved to C:\DuplicatesCurrentDate
Afterwards the duplicate files are moved to the new location. You will again see a new window appearing that shows the moved files for further review

Which brings me to the code.

The Script

Here is the code for download.

find_duplicate_files.ps1

And here is the code in full length. Copy the code to your local computer and open it in PowerShell ISE, Visual Studio Code or an editor of your choice. Hit F5 (PowerShell ISE or VS Code).


# .SYNOPSIS
# find_ducplicate_files.ps1 finds duplicate files based on hash values.

# .DESCRIPTION
# Prompts for entering file path. Shows duplicate files for selection.
# Selected files will be moved to new folder C:\Duplicates_Date for further review.

# .EXAMPLE
# Open PowerShell. Nagivate to the file location. Type .\find_duplicate_files.ps1 OR
# Open PowerShell ISE. Open find_duplicate.ps1 and hit F5.

# .NOTES
# Author: Patrick Gruenauer | Microsoft MVP on PowerShell [2018-2020]
# Web: https://sid-500.com

############# Find Duplicate Files based on Hash Value ###############
''
$filepath = Read-Host 'Enter file path for searching duplicate files (e.g. C:\Temp, C:\)'

If (Test-Path $filepath) {
''
Write-Warning 'Searching for duplicates ... Please wait ...'

$duplicates = Get-ChildItem $filepath -File -Recurse `
-ErrorAction SilentlyContinue |
Get-FileHash |
Group-Object -Property Hash |
Where-Object Count -GT 1

If ($duplicates.count -lt 1)

{
Write-Warning 'No duplicates found.'
Break ''
}

else {
Write-Warning "Duplicates found."
$result = foreach ($d in $duplicates)
{
$d.Group | Select-Object -Property Path, Hash
}

$date = Get-Date -Format "MM/dd/yyy"
$itemstomove = $result |
Out-GridView -Title `
"Select files (CTRL for multiple) and press OK. Selected files will be moved to C:\Duplicates_$date" `
-PassThru

If ($itemstomove)

{
New-Item -ItemType Directory `
-Path $env:SystemDrive\Duplicates_$date -Force
Move-Item $itemstomove.Path `
-Destination $env:SystemDrive\Duplicates_$date -Force
''
Write-Warning `
"Mission accomplished. Selected files moved to C:\Duplicates_$date"

Start-Process "C:\Duplicates_$date"
}

else
{
Write-Warning "Operation aborted. No files selected."
}
}
}
else
{
Write-Warning `
"Folder not found. Use full path to directory e.g. C:\photos\patrick"
}

Credits

Thanks to Kenward Bradley’s one-liner which sparks the idea in me to write this script. Here you go:

http://kenwardtown.com/2016/12/29/find-duplicate-files-with-powershell/

Published by Patrick Gruenauer

Microsoft MVP on PowerShell & Microsoft 365, IT-Trainer, IT-Consultant, MCSE: Cloud Platform and Infrastructure, Cisco Certified Academy Instructor. View all posts by Patrick Gruenauer

22 replies »

Joseph says:

17. August 2022 at 17:12

Is there a way to change the criteria it looks for by file extenstion?

LikeLike

Reply
Terry says:

1. April 2022 at 18:16

Here are some changes I’ve made that people may like..

1. Modified Line 26 from “Get-FileHash |” to “Select-Object -Property FullName,Length,@{name=”Hash”;expression={(Get-FileHash $_.FullName).hash}} |” to add the file size to the output.
2. Modified Line 42 from “MM/dd/yyy” to “MM-dd-yyyy” as the “/” character is not allowed in file names
3. Added “$DupPath = Read-Host ‘Enter file path to move duplicate files (e.g. C:\Temp\, D:\)'” to prompt for output directory location on Line 43 changed all “$env:SystemDrive” to “$DupPath”

LikeLiked by 1 person

Reply
- Terry says:
  
  1. April 2022 at 19:24
  
  Oh, forgot to say Modify line 41 from “$d.Group | Select-Object -Property Path, Hash” to “$d.Group | Select-Object -Property FullName, Hash, Length” so that the Length column shows in the output and also change line 55 from “Move-Item $itemstomove.Path” to “Move-Item $itemstomove.FullName” as there is no longer the Path property
  
  LikeLike
  
  Reply
Alexander says:

17. January 2022 at 2:42

How could I modify this so that it will delete duplicates instead of moving?

LikeLike

Reply
Arie says:

22. December 2021 at 19:55

Will this script preserve one copy of the file in its original place? Or will it move all of them?

LikeLike

Reply
- Patrick Gruenauer says:
  
  25. December 2021 at 12:23
  
  Hi, files will be moved.
  
  LikeLike
  
  Reply
Ossama Mehmood says:

15. September 2021 at 22:38

I used DuplicateFilesDeleter felt that’s quite easier : )

LikeLike

Reply
Gabriel says:

21. January 2021 at 21:27

There are a gazillion utilities meant to detect and delete duplicate files. What I’m looking for is more tricky : to detect partial duplicates, when a file fragment A exists inside a bigger file B, but (to compound the difficulty) the beginning of A is not necessarily the beginning of B. It happens in data recovery scenarios, especially with some types of video files which don’t have a single header structure (like MPG / VOB / MTS). What I could come up with so far is to :
1) Extract a small string near the beginning of each unidentified file fragment into a text file with a Powershell script.
For instance this extracts 20 bytes at offset 40000 :
$offset = 40000
$length = 20
foreach ($file in gci *.mpg, *.vob, *.mts) {
$buffer = [Byte[]]::new($length)
$stream = [System.IO.FileStream]::new($file.FullName, ‘Open’, ‘Read’)
$stream.Position = $offset
$readSize = $stream.Read($buffer, 0, $length)
$stream.Dispose()
if ($readSize) {
$hex = for ($i = 0; $i -lt $readSize; $i++) { ‘\x{0:X2}’ -f $buffer[$i] }
$hex = $hex -join ”
$name = $file.FullName
$size = $file.Length
$ascii = [System.Text.Encoding]::Default.GetString($buffer)
Add-Content -Path “G:\HGST 4To MPG-VOB-MTS logical search 40000 20.txt” -Value “$hex $name $size $offset”
}
}
$buffer = $null
I got help here :
https://stackoverflow.com/questions/62238275/quickly-copy-short-string-from-binary-file-as-hex-to-text-file-in-loop-part
2) Then, load this list into WinHex as a list of search terms and run a “simultaneous search” in “logical” mode (meaning : it analyses a given volume on a file-by-file basis, and reports the logical offset where the string was found in each individual file).
3) Then, based on the search report, for each group of identified files matching one of these extracted strings, compare checksums between the whole file fragment and the corresponding segment of the bigger file (I do this with Powershell using a small CLI tool called “dsfo” — it could be done with Powershell alone as this article demonstrates, but it works well and makes for smaller scripts), and delete the fragment if indeed there’s a complete match.

But that method is quite complicated and tedious, and another difficulty is that, whatever offset value I choose, there are always hundreds of strings (out of a few thousands files) which are not specific enough to yield only relevant matches (for instance “00 00 00 00 …” or “FF FF FF FF…”, or even more complex strings which happen to be present in many unrelated files). So I was wondering if there were more streamlined and efficient ways of performing that kind of task. I ran a Web search with « search duplicates within files », and this article appeared in the first page of results — none of which being actually relevant since they all deal with complete / perfect duplicates, which, again, is quite trivial. WinHex has a “block-wise hashing and matching” feature, which would seem like it could do what I want, but it creates a hash database for every single sector of each input file (it can’t be set to a bigger block value), requiring a huge amount of space just to store that (a MD5 hash is 32 bytes, so building the hash database for 500GB of input files requires about 30GB, and actually twice that amount since it first create a temporary file) ; and then the result of the analysis is useless for that purpose as, contrary to the “simultaneous search”, it doesn’t report logical offsets, it only reports a list of all sectors from input files indexed in the hash database which were found at physical offset X on the analysed volume. So, back to square one. é_è

LikeLiked by 1 person

Reply
Ali Elsheikh says:

23. November 2020 at 19:39

Hi Patrick,
your script is perfect; I wonder what if I want it to shows all the duplicates except the first copy of the file i.e. if the file is existing 3 times, the tool shows only 2 lines to be removed, and if only twice, it shows only one line to be removed; this will make it way easier to select all lines and remove at once.
also any tips to add select all feature to the pop up window?
King Regards.

LikeLike

Reply
- Patrick Gruenauer says:
  
  23. November 2020 at 20:28
  
  Hi, the script was designed to find only 2 duplicates. Best
  
  LikeLike
  
  Reply
Owin says:

20. July 2020 at 13:20

Hey Patrick
What if the you found 1000 duplicates? What would the fastest way you could select all the duplicates and copy them out without having to ctrl + click?

LikeLike

Reply
- Patrick Gruenauer says:
  
  20. July 2020 at 15:09
  
  Oh that’s much. My script is only for small environments.
  
  LikeLike
  
  Reply
Kenneth says:

12. May 2020 at 12:57

I found a minor error in New-item just before -path there is this ` and there are more in the script.
then there is this with the date format. there missing a y in MM/dd/yyy and I’m using mm-dd-yyyy but that’s fine.
now it works after removing the `
Thx.

LikeLiked by 1 person

Reply
Kenneth says:

12. May 2020 at 12:27

like the idea but why C: -make it ask for path to where it will copy the duplicates. I’ve tried to change to D: but that fails.

LikeLike

Reply
- Patrick Gruenauer says:
  
  12. May 2020 at 12:29
  
  I used the $env variable to store the files on the OS drive regardless of the volume letter
  
  LikeLike
  
  Reply
  - Kenneth says:
    
    12. May 2020 at 12:32
    
    okay, but using system drive is no go for me, so how to change it to use data drive D: E: or whatever will solve it for me.
    
    LikeLike
  - Patrick Gruenauer says:
    
    12. May 2020 at 12:47
    
    Just change the file path in script (Move-Item)
    
    LikeLike
  - Kenneth says:
    
    12. May 2020 at 12:35
    
    tried using default script settings but with different path and it found 10 files. super. but hitting ok, the script failes with:
    
    New-Item : A positional parameter cannot be found that accepts argument ‘C:\Duplicates_05-12-2020’.
    At C:\Service\Scripts\find_duplicate_files.ps1:53 char:1
    + New-Item -ItemType Directory `-Path $env:SystemDrive\Duplicates_$date …
    is it me or…
    
    LikeLike