Regex to match non-UTF8 characters?

I'd like a regular expression to match any non-UTF8 characters. I thought to use \p in my regex but I don't believe there's a property for UTF-8, unless I'm mistaken.

Thanks in advance!

Dave

Find more posts tagged with

Comments

Migrateduser

Might be an over kill but have you tried utf8 package from perl 5.8?

perldoc utf8

Migrateduser

I was hoping to make it thinner than that -- I've done this before (like a few years ago) but now that I'm in my 30's, I have trouble remembering such things! The utf8 package is a good fallback though.

Any straight regex, though?

ISCBorisB

I was hoping to make it thinner than that -- I've done this before (like a few years ago) but now that I'm in my 30's, I have trouble remembering such things! The utf8 package is a good fallback though.

Any straight regex, though?

It is much easier to check if the expression IS a valid UTF-8, see for example here. In perl then you can use if ($string !~ m/utf-8-regex/)...
If your issue is a configuration file though, tough luck.

Migrateduser

Wow, is THAT ugly! But it works well, thank you!

My situation is that I need to strip out all non-UTF8 characters as it comes from an external source (SharePoint). The problem is that the character set could be absolutely anything (and often is) so a translation function is not feasible. I'll have it mapping certain characters to UTF8 equivalents but anything else will be trashed.

Again, thanks.

ISCBorisB

Wow, is THAT ugly! But it works well, thank you!

My situation is that I need to strip out all non-UTF8 characters as it comes from an external source (SharePoint). The problem is that the character set could be absolutely anything (and often is) so a translation function is not feasible. I'll have it mapping certain characters to UTF8 equivalents but anything else will be trashed.

Again, thanks.

Get to google and type "Demoronizer". This Paul Graham's little gem helped me to deal with MS generated code before.
If you use it, make sure it's before your own symbol mapping, etc...

Not sure if it'll help you but I think it's worth a try