powershell - Improve performance when searching for a string within multiple word files -
i have drafted powershell script searches string among large number of word files. script working fine, have around 1 gb of data search through , taking around 15 minutes.
can suggest modifications can make run faster?
set-strictmode -version latest $path = "c:\tester1" $output = "c:\scripts\resultmatch1.csv" $application = new-object -comobject word.application $application.visible = $false $findtext = "roaming" $charactersaround = 30 $results = @() function getstringmatch { ($i=1; $i -le 4; $i++) { $j="d"+$i $finalpath=$path+"\"+$j $files = get-childitem $finalpath -include *.docx,*.doc -recurse | where-object { !($_.psiscontainer) } # loop through *.doc files in $path directory foreach ($file in $files) { $document = $application.documents.open($file.fullname,$false,$true) $range = $document.content if($range.text -match ".{$($charactersaround)}$($findtext).{$($charactersaround)}"){ $properties = @{ file = $file.fullname match = $findtext textaround = $matches[0] } $results += new-object -typename pscustomobject -property $properties $document.close() } } } if($results){ $results | export-csv $output -notypeinformation } $application.quit() } getstringmatch import-csv $output
as mentioned in comments, might want consider using openxml sdk library (you can newest version of sdk on github), since it's way less overhead spinning instance of word.
below i've turned current function more generic one, using sdk , no dependencies on caller/parent scope:
function get-wordstringmatch { param( [parameter(mandatory,valuefrompipeline)] [system.io.fileinfo[]]$files, [string]$findtext, [int]$charactersaround ) begin { # import openxml library add-type -path 'c:\program files (x86)\open xml sdk\v2.5\lib\documentformat.openxml.dll' |out-null # make "shorthand" reference word document type $worddoc = [documentformat.openxml.packaging.wordprocessingdocument] -as [type] # construct regex pattern $pattern = ".{$charactersaround}$([regex]::escape($findtext)).{$charactersaround}" } process { # loop through *.doc(x) files foreach ($file in $files) { # open document, wrap content stream in streamreader $document = $worddoc::open($file.fullname, $false) $documentstream = $document.maindocumentpart.getstream() $documentreader = new-object system.io.streamreader $documentstream # read entire document if($documentreader.readtoend() -match $pattern) { # got match? output our custom object new-object psobject -property @{ file = $file.fullname match = $findtext textaround = $matches[0] } } } } end{ # clean $documentreader.dispose() $documentstream.dispose() $document.dispose() } }
now have nice function supports pipeline input, need gather documents , pipe them it!
# variables $path = "c:\tester1" $output = "c:\scripts\resultmatch1.csv" $findtext = "roaming" $charactersaround = 30 # gather files $files = 1..4|foreach-object { $finalpath = join-path $path "d$i" get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} } # run them through our new function $results = $files |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround # got results? export csv if($results){ $results |export-csv -path $output -notypeinformation }
since of our components support pipelining, in 1 go:
1..4|foreach-object { $finalpath = join-path $path "d$i" get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} } |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround |export-csv -path $output -notypeinformation
Comments
Post a Comment