Comment on AdGuard/PiHole Blocklists merge duplicates
CarbonatedPastaSauce@lemmy.world 1 year ago
I doubt you’ll find something off the shelf for this. I wrote a powershell script that deduplicates lists and also does a pass over the results to convert any blocks to CIDR notation. If you’re interested I’ll share it.
But honestly you could probably have ChatGPT whip this up for you in your language of choice. It’s pretty straightforward.
nyar@lemmy.world 1 year ago
I’d like to see your script.
CarbonatedPastaSauce@lemmy.world 1 year ago
Sorry it took a while, I’m currently on vacation! But I had some time to reread it and sanitize it for public sharing. Here you go:
`# If you need a web proxy to download blocklists, uncomment the next line and modify the proxy URL
$WebProxy = “example.notarealproxyserveraddress.com:8080“
Change this path to the folder you want to store files in during processing, usually the script’s directory
All downloaded blocklists and final merged files will be stored here before copying to final destination
$ScriptDir = “C:\Scripts\Merge-Blocklists"
Path to file containing list of IP Blocklist URLs
Create this text file with one URL per line for the blocklists you want to download and merge
$URLfile = $ScriptDir + “blocklist-URLs.txt"
Path / filenames for the final output files
$IPOutputFile = $ScriptDir + “iplist.txt” $NetOutputFile = $ScriptDir + “netlist.txt"
Path to the script log file
$LogFile = $ScriptDir + “log.txt"
Create blank log file
$null | Out-File $LogFile
Path to merged file the script creates
$MergedFile = $ScriptDir + “BL_merged-list.txt"
Create blank merged file
$null | Out-File $MergedFile
Regex to validate IPv4 addresses, CIDR ranges, and blocklist URLs
$IPregex = “^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]).){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$” $CIDRregex = “^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]).){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(/(3[0-2]|[1-2][0-9]|[0-9]))$“
Function to validate blocklist URLs
function ValidateURL{ param( [string]$URL )
}
Function to see if an IP is inside a CIDR
Function CIDRcontainsIP { param( $IPAddress, $CIDR )
}
Function to write to log and show in console
Function LogThis { $TimeStamp = Get-Date -format HH:mm:ss $LogData = $TimeStamp + " - " + $log Out-File -FilePath $LogFile -InputObject $LogData -Append Write-Host $log }
Get the list of IP blocklist URLs to pull from (ignoring lines that don’t start with http)
try{ $BlockListURLs = Get-Content $URLfile -ErrorAction Stop | Where-Object {$_ -match “^http”} } catch{ $log = “Failed to load blocklist URLs from file $BlockListURLs”; LogThis exit }
Validate BlockList URLs
foreach($bURL in $BlockListURLs) { if(!(ValidateURL -URL $bURL)) { $log = “BlockList URL $bURL is invalid”; LogThis $InvalidURLDetected = $true } }
Exit if any invalid URLs were detected
if($InvalidURLDetected) { $log = “Invalid URLs were detected in file $URLfile - remove or correct invalid entries and rerun script”; LogThis exit }
Clean up any pre-existing blocklist files
$PreviousFiles = Get-ChildItem -Path $ScriptDir -Filter “BL_*.txt” if($PreviousFiles) { foreach($file in $PreviousFiles){ Remove-Item $file } }
Download the Blocklist files
foreach($URL in $BlockListURLs) { # Generate a filename from the domain name and target filename - strip the extension and add .txt $BlockListFile = $ScriptDir + “BL_” + $URL.Split(”/”)[2] + “-” + ($URL.Split(”/”)[-1]).Split(”.”)[0] + “.txt”
}
Import all the downloaded files and merge into a single file
$BlockListFiles = Get-ChildItem -Path $ScriptDir -Filter “BL_*.txt”
foreach($File in $BlockListFiles) { # Special handling for SpamHaus since they comment each line if($File.Name -match ‘spamhaus’) { $FileAppend = Get-Content $File | % {$.split(" ")[0]} | Where-Object {$ -match $IPregex -or $_ -match $CIDRregex} } else{ $FileAppend = Get-Content $File | Where-Object {$_ -match $IPregex -or $_ -match $CIDRregex} } $log = “Adding $($FileAppend | Measure-Object | Select-Object -ExpandProperty count) lines from $($File.Name) to merge file”; LogThis $FileAppend | Out-File $MergedFile -Append }
Read in the merged file contents so it can be deduplicated
$MergedList = Get-Content $MergedFile $PreDedupeCount = $MergedList | Measure-Object | Select-Object -ExpandProperty count $MergedList = $MergedList | Select-Object -Unique $PostDedupeCount = $MergedList | Measure-Object | Select-Object -ExpandProperty count $log = “Removed $($PreDedupeCount - $PostDedupeCount) entries via deduplication”; LogThis
Separate the results into hashtables for IP addresses and CIDR ranges
$IPList = @{} $CIDRList = @{}
foreach($val in $MergedList) { if($val -match $CIDRregex){ $CIDRList.Add(”$val”,1) } elseif($val -match $IPregex) { $IPList.Add(”$val",1) } else{ $log = “Merged list value $val does not match IP or CIDR regex”; LogThis } }
$IPcount = $IPList.GetEnumerator() | Measure-Object | Select-Object -ExpandProperty count $CIDRcount = $CIDRList.GetEnumerator() | Measure-Object | Select-Object -ExpandProperty count
$log = “Found $IPcount unique IP addresses and $CIDRcount unique CIDR ranges to evaluate”; LogThis
Build an array from $IPList hashtable so we can modify the hashtable without ending the foreach loop
$IPListCopy = $IPList.GetEnumerator() | Select-Object -ExpandProperty Name
Evaluate all the individual IPs to see if they are contained in an existing CIDR
If they are, set them for removal by making the hashtable value 0
$ProcessedIPs = 0 foreach($val in $IPListCopy) { foreach($CIDR in $CIDRList.Keys) { if(CIDRcontainsIP -IPAddress $val -CIDR $CIDR){ write-host “IP $val is in CIDR $CIDR” -fore Yellow $IPList.$val = 0 } } $ProcessedIPs++ if(($ProcessedIPs % 100) -eq 0) { Write-Host “Evaluated $ProcessedIPs of $IPcount IP addresses” } }
$RemovedIPcount = $IPList.GetEnumerator() | Where-Object {$_.Value -eq 0} | Measure-Object | Select-Object -ExpandProperty count $log = “IP Address analysis found $RemovedIPCount IP addresses that were already contained in existing CIDR ranges”; LogThis
Write the remaining IPs and CIDRs to the final output files
$FileHeader = “# Last Updated $(Get-Date)” $FileHeader | Out-File $IPOutputFile -Encoding ASCII $FileHeader | Out-File $NetOutputFile -Encoding ASCII $IPList.GetEnumerator() | Where-Object {$.Value -eq 1} | Select-Object -ExpandProperty Name | Sort-Object | Out-File -FilePath $IPOutputFile -Encoding ASCII -Append $CIDRList.GetEnumerator() | Where-Object {$.Value -eq 1} | Select-Object -ExpandProperty Name | Sort-Object | Out-File -FilePath $NetOutputFile -Encoding ASCII -Append
Change these paths to wherever you want the final files to go
Copy-Item $IPOutputFile C:\web\blocklist\iplist.ipset -Force Copy-Item $NetOutputFile C:\web\blocklist\netlist.netset -Force `