indexing - How to remove Scripts and Styles in content of SOLR Indexes[content field], while indexed through URL? -


whenever solr indexed collection ( configset sample_techproducts_configs) , using url, via following command:

bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3  

the indexes created have field content copied text field. field have value of content of web page parsed using embedded tika parse.

but, when webpage contains <script> or <style> tag <body> removed script or styles inside respective tags remains content of webpages, , shown in response solr queries.

how remove these unwanted content ?

do read inputstream of data_mode_web in simpleposttool (only whom content type "text/html" , remove <script> , <style> tags content , again convert content_string stream using stringtostream(string) in readpagefromurl(url u) function.


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -