c# Regex replace everything not in specific UTF-8 character set ranges (whitelist) -


i'm trying include non-printable characters specific latin character set: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x

my regex looks this:

var output = regex.replace(input, @"[^\u0020-\u007e]|[^\u00a0-\u00ff]", string.empty); 

i have problem line separator' (u2028) specifically, want exclude control character well, wanted whitelist, rather blacklist.

i'm trying include u0020 (space) through u007e (tilde) or u00a0 (no-break space) through u00ff (latin small letter y diaeresis).

i've got negation wrong on sets, can't figure out how solve it. ideas?

update

the following appears work

var input = "</span><span>
    </span><span>" var output = regex.replace(input, @"[^\u0020-\u007e\u00a0-\u00ff]", string.empty); // gives: </span><span>    </span><span> 

example working: http://rextester.com/yciwtn86420


Comments

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -