Storing HTML Documents in Elasticsearch -


scenario

i have html documents, let's say: emails. want store these on elastic search , search plaintext of html emails.

problem

elasticsearch index html tags , attributes, too. don't want that. want search span if plain text, not html element. example <span>span</span> hit, not <span>some other content</span>.

question

would recommend, store html stripped field , html field in document? or should store html document on s3 , rather leave stripped html version in elastic search? make sense

i don't know happens if elastic search indexing html document, imagine index divs , spans , attributes. these things totally don't search for. so: suggestion on solving problem here great!

what doing now?

right before store document in es, check if index exists document type. if not, create collection given mapping. mapping looks this

{     "analysis": {         "analyzer": {             "htmlstripanalyzer": {                 "type": "custom",                 "tokenizer": "standard",                 "filter": "standard",                 "char_filter": [                     "html_strip"                 ]             }         }     },     "mappings": {         "${type}": {             "dynamic_templates": [                 {                     "_metadata": {                         "path_match": "_metadata.*",                         "mapping": {                             "type": "keyword"                         }                     }                 }             ],             "properties": {                 "_tags": {                     "type": "nested",                     "dynamic": true                 }             }         }     } } 

warning: ignore existing mappings. has nothing intentions. there.

i replacing ${type} document type, let's emails. what tell es not index html stuff?

a complete test case:

delete /test put /test {   "settings": {     "analysis": {       "analyzer": {         "htmlstripanalyzer": {           "type": "custom",           "tokenizer": "standard",           "filter": ["standard","lowercase"],           "char_filter": [             "html_strip"           ]         }       }     }   },   "mappings": {     "test": {       "properties": {         "html": {           "type": "text",           "analyzer": "htmlstripanalyzer"         }       }     }   } }  post /test/test/1 {   "html": "<td><tr>span<td></tr>" } post /test/test/2 {   "html": "<span>whatever</span>" } post /test/test/3 {   "html": "<html> <body> <h1 style=\"font-family: arial\">test</h1> <span>more test</span> </body> </html>" }  post /test/_search {   "query": {     "match": {       "html": "span"     }   } }  post /test/_search {   "query": {     "match": {       "html": "body"     }   } }  post /test/_search {   "query": {     "match": {       "html": "more"     }   } } 

Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -