Storing HTML Documents in Elasticsearch -
scenario
i have html documents, let's say: emails. want store these on elastic search , search plaintext of html emails.
problem
elasticsearch index html tags , attributes, too. don't want that. want search span
if plain text, not html element. example <span>span</span>
hit, not <span>some other content</span>
.
question
would recommend, store html stripped field , html field in document? or should store html document on s3 , rather leave stripped html version in elastic search? make sense
i don't know happens if elastic search indexing html document, imagine index divs , spans , attributes. these things totally don't search for. so: suggestion on solving problem here great!
what doing now?
right before store document in es, check if index exists document type. if not, create collection given mapping. mapping looks this
{ "analysis": { "analyzer": { "htmlstripanalyzer": { "type": "custom", "tokenizer": "standard", "filter": "standard", "char_filter": [ "html_strip" ] } } }, "mappings": { "${type}": { "dynamic_templates": [ { "_metadata": { "path_match": "_metadata.*", "mapping": { "type": "keyword" } } } ], "properties": { "_tags": { "type": "nested", "dynamic": true } } } } }
warning: ignore existing mappings. has nothing intentions. there.
i replacing ${type} document type, let's emails
. what tell es not index html stuff?
a complete test case:
delete /test put /test { "settings": { "analysis": { "analyzer": { "htmlstripanalyzer": { "type": "custom", "tokenizer": "standard", "filter": ["standard","lowercase"], "char_filter": [ "html_strip" ] } } } }, "mappings": { "test": { "properties": { "html": { "type": "text", "analyzer": "htmlstripanalyzer" } } } } } post /test/test/1 { "html": "<td><tr>span<td></tr>" } post /test/test/2 { "html": "<span>whatever</span>" } post /test/test/3 { "html": "<html> <body> <h1 style=\"font-family: arial\">test</h1> <span>more test</span> </body> </html>" } post /test/_search { "query": { "match": { "html": "span" } } } post /test/_search { "query": { "match": { "html": "body" } } } post /test/_search { "query": { "match": { "html": "more" } } }
Comments
Post a Comment