indexing - How to remove Scripts and Styles in content of SOLR Indexes[content field], while indexed through URL? -
whenever solr indexed collection ( configset sample_techproducts_configs) , using url, via following command:
bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3 the indexes created have field content copied text field. field have value of content of web page parsed using embedded tika parse.
but, when webpage contains <script> or <style> tag <body> removed script or styles inside respective tags remains content of webpages, , shown in response solr queries.
how remove these unwanted content ?
do read inputstream of data_mode_web in simpleposttool (only whom content type "text/html" , remove <script> , <style> tags content , again convert content_string stream using stringtostream(string) in readpagefromurl(url u) function.
Comments
Post a Comment