indexing - How to remove Scripts and Styles in content of SOLR Indexes[content field], while indexed through URL? -
whenever solr indexed collection ( configset sample_techproducts_configs
) , using url, via following command:
bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3
the indexes created have field content
copied text
field. field have value of content of web page parsed using embedded tika parse.
but, when webpage contains <script>
or <style>
tag <body>
removed script or styles inside respective tags remains content of webpages, , shown in response solr queries.
how remove these unwanted content ?
do read inputstream
of data_mode_web
in simpleposttool
(only whom content type "text/html" , remove <script>
, <style>
tags content , again convert content_string stream using stringtostream(string)
in readpagefromurl(url u)
function.
Comments
Post a Comment