powershell - Improve performance when searching for a string within multiple word files -


i have drafted powershell script searches string among large number of word files. script working fine, have around 1 gb of data search through , taking around 15 minutes.

can suggest modifications can make run faster?

set-strictmode -version latest $path     = "c:\tester1" $output   = "c:\scripts\resultmatch1.csv" $application = new-object -comobject word.application $application.visible = $false $findtext = "roaming" $charactersaround = 30 $results = @()  function getstringmatch {  ($i=1; $i -le 4; $i++) { $j="d"+$i  $finalpath=$path+"\"+$j $files    = get-childitem $finalpath -include *.docx,*.doc -recurse | where-object { !($_.psiscontainer) }     # loop through *.doc files in $path directory foreach ($file in $files) {     $document = $application.documents.open($file.fullname,$false,$true)     $range = $document.content      if($range.text -match ".{$($charactersaround)}$($findtext).{$($charactersaround)}"){          $properties = @{             file = $file.fullname             match = $findtext             textaround = $matches[0]           }          $results += new-object -typename pscustomobject -property $properties        $document.close()       }   }  }   if($results){     $results | export-csv $output -notypeinformation }  $application.quit()  }  getstringmatch  import-csv $output 

as mentioned in comments, might want consider using openxml sdk library (you can newest version of sdk on github), since it's way less overhead spinning instance of word.

below i've turned current function more generic one, using sdk , no dependencies on caller/parent scope:

function get-wordstringmatch {     param(         [parameter(mandatory,valuefrompipeline)]         [system.io.fileinfo[]]$files,         [string]$findtext,         [int]$charactersaround     )      begin {         # import openxml library         add-type -path 'c:\program files (x86)\open xml sdk\v2.5\lib\documentformat.openxml.dll' |out-null          # make "shorthand" reference word document type         $worddoc = [documentformat.openxml.packaging.wordprocessingdocument] -as [type]          # construct regex pattern         $pattern = ".{$charactersaround}$([regex]::escape($findtext)).{$charactersaround}"     }      process {         # loop through *.doc(x) files         foreach ($file in $files)         {             # open document, wrap content stream in streamreader              $document       = $worddoc::open($file.fullname, $false)             $documentstream = $document.maindocumentpart.getstream()             $documentreader = new-object system.io.streamreader $documentstream              # read entire document             if($documentreader.readtoend() -match $pattern)             {                 # got match? output our custom object                 new-object psobject -property @{                     file = $file.fullname                     match = $findtext                     textaround = $matches[0]                  }             }         }     }      end{         # clean         $documentreader.dispose()         $documentstream.dispose()         $document.dispose()     } } 

now have nice function supports pipeline input, need gather documents , pipe them it!

# variables $path     = "c:\tester1" $output   = "c:\scripts\resultmatch1.csv" $findtext = "roaming" $charactersaround = 30  # gather files $files = 1..4|foreach-object {     $finalpath = join-path $path "d$i"     get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} }  # run them through our new function $results = $files |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround  # got results? export csv if($results){     $results |export-csv -path $output -notypeinformation } 

since of our components support pipelining, in 1 go:

1..4|foreach-object {     $finalpath = join-path $path "d$i"     get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} } |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround |export-csv -path $output -notypeinformation 

Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -