powershell - Improve performance when searching for a string within multiple word files -

August 15, 2013

i have drafted powershell script searches string among large number of word files. script working fine, have around 1 gb of data search through , taking around 15 minutes.

can suggest modifications can make run faster?

set-strictmode -version latest $path     = "c:\tester1" $output   = "c:\scripts\resultmatch1.csv" $application = new-object -comobject word.application $application.visible = $false $findtext = "roaming" $charactersaround = 30 $results = @()  function getstringmatch {  ($i=1; $i -le 4; $i++) { $j="d"+$i  $finalpath=$path+"\"+$j $files    = get-childitem $finalpath -include *.docx,*.doc -recurse | where-object { !($_.psiscontainer) }     # loop through *.doc files in $path directory foreach ($file in $files) {     $document = $application.documents.open($file.fullname,$false,$true)     $range = $document.content      if($range.text -match ".{$($charactersaround)}$($findtext).{$($charactersaround)}"){          $properties = @{             file = $file.fullname             match = $findtext             textaround = $matches[0]           }          $results += new-object -typename pscustomobject -property $properties        $document.close()       }   }  }   if($results){     $results | export-csv $output -notypeinformation }  $application.quit()  }  getstringmatch  import-csv $output

as mentioned in comments, might want consider using openxml sdk library (you can newest version of sdk on github), since it's way less overhead spinning instance of word.

below i've turned current function more generic one, using sdk , no dependencies on caller/parent scope:

function get-wordstringmatch {     param(         [parameter(mandatory,valuefrompipeline)]         [system.io.fileinfo[]]$files,         [string]$findtext,         [int]$charactersaround     )      begin {         # import openxml library         add-type -path 'c:\program files (x86)\open xml sdk\v2.5\lib\documentformat.openxml.dll' |out-null          # make "shorthand" reference word document type         $worddoc = [documentformat.openxml.packaging.wordprocessingdocument] -as [type]          # construct regex pattern         $pattern = ".{$charactersaround}$([regex]::escape($findtext)).{$charactersaround}"     }      process {         # loop through *.doc(x) files         foreach ($file in $files)         {             # open document, wrap content stream in streamreader              $document       = $worddoc::open($file.fullname, $false)             $documentstream = $document.maindocumentpart.getstream()             $documentreader = new-object system.io.streamreader $documentstream              # read entire document             if($documentreader.readtoend() -match $pattern)             {                 # got match? output our custom object                 new-object psobject -property @{                     file = $file.fullname                     match = $findtext                     textaround = $matches[0]                  }             }         }     }      end{         # clean         $documentreader.dispose()         $documentstream.dispose()         $document.dispose()     } }

now have nice function supports pipeline input, need gather documents , pipe them it!

# variables $path     = "c:\tester1" $output   = "c:\scripts\resultmatch1.csv" $findtext = "roaming" $charactersaround = 30  # gather files $files = 1..4|foreach-object {     $finalpath = join-path $path "d$i"     get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} }  # run them through our new function $results = $files |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround  # got results? export csv if($results){     $results |export-csv -path $output -notypeinformation }

since of our components support pipelining, in 1 go:

1..4|foreach-object {     $finalpath = join-path $path "d$i"     get-childitem $finalpath -recurse | where-object { !($_.psiscontainer) -and @('*.docx','*.doc' -contains $_.extension)} } |get-wordstringmatch -findtext $findtext -charactersaround $charactersaround |export-csv -path $output -notypeinformation

Search This Blog

MOno

powershell - Improve performance when searching for a string within multiple word files -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -