Thread Regex für Spamfilter bei Nicht-Ascii (33 answers)
Opened by GwenDragon at 2012-06-17 18:27

GwenDragon
 2013-05-24 10:17
#167785 #167785
User since
2005-01-17
14848 Artikel
Admin1
[Homepage]
user image
Nein \p{Alpha} kann nicht greifen, denn:
Quote
For code points below 256 ...
if locale rules are in effect ...

The POSIX class matches according to the locale,


Lässt sich nachlesen in perlrecharclass:
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
There are various other synonyms that can be used besides the names listed in the table. For example, \p{PosixAlpha} can be written as \p{Alpha} . All are listed in Properties accessible through \p{} and \P{} in perluniprops, plus all characters matched by each ASCII-range property.

Both the \p counterparts always assume Unicode rules are in effect. On ASCII platforms, this means they assume that the code points from 128 to 255 are Latin-1, and that means that using them under locale rules is unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the POSIX character classes are useful under locale rules. They are affected by the actual rules in effect, as follows:
If the /a modifier, is in effect ...

Each of the POSIX classes matches exactly the same as their ASCII-range counterparts.
otherwise ...
For code points above 255 ...

The POSIX class matches the same as its Full-range counterpart.
For code points below 256 ...
if locale rules are in effect ...

The POSIX class matches according to the locale, except that word uses the platform's native underscore character, no matter what the locale is.
if Unicode rules are in effect or if on an EBCDIC platform ...

The POSIX class matches the same as the Full-range counterpart.
otherwise ...

The POSIX class matches the same as the ASCII range counterpart.


Hast du überlesen, dass das locale auf en_EN.UTF-8 steht? Und multilinguale Zeichen reinkommen. Das bedeutet auch á oder ö, die sind aber <\xff und da greift dann eben kein Unicoderegel beim Regex.

//EDIT:
Schau:
root@srv2 ~ # perl -E "say 'testärger' ~~ /^\p{Alpha}+$/i"

root@srv2 ~ #

Wegen des locale ist eben das ä kein Alphazeichen.
Last edited: 2013-05-24 10:33:55 +0200 (CEST)

View full thread Regex für Spamfilter bei Nicht-Ascii