Leerzeichen-Regex lässt StackExchange ausfallen? - #185151 (Allgemeines zu Perl)

clms

2016-07-26 18:10

User since
2010-08-29
373 Artikel
BenutzerIn
[default_avatar]

2016-07-26T15:05:01 renee
Das hat wiederum damit zu tun, dass Du eine "Zeichenkette" (das A) verwendest. Die Regex-Engine erkennt, dass dieser Substring nicht vorkommt und bricht dann ruckzuck ab...

OK, das ist der Grund warum die Suche nach 'A' am Ende so schnell war.

Ich hatte in der Zwischenzeit weiter experimentiert.

*$ habe ich weggelassen, weil das ja immer matched, und es stattdessen mit +$ probiert.

Insgesamt habe ich jetzt drei Inputstrings mit fünf Regex kombiniert:

Die Input-Strings:

MxA: "A" am Ende
MxB: "B" am Ende
MxS: kein Zeichen mehr hinter dem Whitespace

Die Regex:

M1x: /\s+A/ - also Whitespace vor A
M2x: /(?<=\S)\s+A/ - also Whitespace mit Anker als lookbehind Assertion vor A
M3x: /\s+$/ - also Whitespace vor Zeilenende
M4x: /(?<=\S)\s+$/ - also Whitespace mit Anker als lookbehind Assertion vor Zeilenende
M5x: /\S\s+$/ - also non-Whitespare vor Whitespace vor Zeilenende

Hier der Code:

more (3.1kb):

use strict;
use warnings;

use Benchmark qw(:all :hireswallclock) ;

our $inputA = "X".(" \t" x 10000)."A";
our $inputB = "X".(" \t" x 10000)."B";
our $inputS = "X".(" \t" x 10000);

our $m1A = 0;
our $m1B = 0;
our $m1S = 0;
our $m2A = 0;
our $m2B = 0;
our $m2S = 0;
our $m3A = 0;
our $m3B = 0;
our $m3S = 0;
our $m4A = 0;
our $m4B = 0;
our $m4S = 0;
our $m5A = 0;
our $m5B = 0;
our $m5S = 0;

my $c1A = <<'CODE';
my $s = $inputA;
$m1A++ if $s =~ m/\s+A/g;
CODE

my $c1B = <<'CODE';
my $s = $inputB;
$m1B++ if $s =~ m/\s+A/g;
CODE
my $c1S = <<'CODE';
my $s = $inputS;
$m1S++ if $s =~ m/\s+A/g;
CODE

my $c2A = <<'CODE';
my $s = $inputA;
$m2A++ if $s =~ m/(?<=\S)\s+A/g;
CODE

my $c2B = <<'CODE';
my $s = $inputB;
$m2B++ if $s =~ m/(?<=\S)\s+A/g;
CODE

my $c2S = <<'CODE';
my $s = $inputS;
$m2S++ if $s =~ m/(?<=\S)\s+A/g;
CODE

my $c3A = <<'CODE';
my $s = $inputA;
$m3A++ if $s =~ m/\s+$/g;
CODE

my $c3B = <<'CODE';
my $s = $inputB;
$m3B++ if $s =~ m/\s+$/g;
CODE

my $c3S = <<'CODE';
my $s = $inputS;
$m3S++ if $s =~ m/\s+$/g;
CODE

my $c4A = <<'CODE';
my $s = $inputA;
$m4A++ if $s =~ m/(?<=\S)\s+$/g;
CODE

my $c4B = <<'CODE';
my $s = $inputB;
$m4B++ if $s =~ m/(?<=\S)\s+$/g;
CODE

my $c4S = <<'CODE';
my $s = $inputS;
$m4S++ if $s =~ m/(?<=\S)\s+$/g;
CODE

my $c5A = <<'CODE';
my $s = $inputA;
$m5A++ if $s =~ m/\S\s+$/g;
CODE

my $c5B = <<'CODE';
my $s = $inputB;
$m5B++ if $s =~ m/\S\s+$/g;
CODE

my $c5S = <<'CODE';
my $s = $inputS;
$m5S++ if $s =~ m/\S\s+$/g;
CODE

timethis(100_000,$c1A);
print "M1A: $m1A\n";
timethis(100_000,$c1B);
print "M1B: $m1B\n";
timethis(100_000,$c1S);
print "M1S: $m1S\n";
timethis(100_000,$c2A);
print "M2A: $m2A\n";
timethis(100_000,$c2B);
print "M2B: $m2B\n";
timethis(100_000,$c2S);
print "M2S: $m2S\n";
timethis(100_000,$c3A);
print "M3A: $m3A\n";
timethis(100_000,$c3B);
print "M3B: $m3B\n";
timethis(100_000,$c3S);
print "M3S: $m3S\n";
timethis(100_000,$c4A);
print "M4A: $m4A\n";
timethis(100_000,$c4B);
print "M4B: $m4B\n";
timethis(100_000,$c4S);
print "M4S: $m4S\n";
timethis(100_000,$c5A);
print "M5A: $m5A\n";
timethis(100_000,$c5B);
print "M5B: $m5B\n";
timethis(100_000,$c5S);
print "M5S: $m5S\n";

Das Ergebnis (hier mit Perl 5.18.2) hat mich überraschend:

Code: (dl )

timethis 100000: 2.16451 wallclock secs ( 2.17 usr +  0.00 sys =  2.17 CPU) @ 46082.95/s (n=100000)
M1A: 100000
timethis 100000: 1.70105 wallclock secs ( 1.70 usr +  0.00 sys =  1.70 CPU) @ 58823.53/s (n=100000)
M1B: 0
timethis 100000: 1.70482 wallclock secs ( 1.71 usr +  0.00 sys =  1.71 CPU) @ 58479.53/s (n=100000)
M1S: 0
timethis 100000: 1.7609 wallclock secs ( 1.76 usr +  0.00 sys =  1.76 CPU) @ 56818.18/s (n=100000)
M2A: 100000
timethis 100000: 1.70132 wallclock secs ( 1.70 usr +  0.00 sys =  1.70 CPU) @ 58823.53/s (n=100000)
M2B: 0
timethis 100000: 1.70033 wallclock secs ( 1.70 usr +  0.00 sys =  1.70 CPU) @ 58823.53/s (n=100000)
M2S: 0
timethis 100000: 7.1719 wallclock secs ( 7.17 usr +  0.00 sys =  7.17 CPU) @ 13947.00/s (n=100000)
M3A: 0
timethis 100000: 7.17198 wallclock secs ( 7.17 usr +  0.00 sys =  7.17 CPU) @ 13947.00/s (n=100000)
M3B: 0
timethis 100000: 2.04794 wallclock secs ( 2.05 usr +  0.00 sys =  2.05 CPU) @ 48780.49/s (n=100000)
M3S: 100000
timethis 100000: 155.864 wallclock secs (155.84 usr +  0.00 sys = 155.84 CPU) @ 641.68/s (n=100000)
M4A: 0
timethis 100000: 155.899 wallclock secs (155.86 usr +  0.01 sys = 155.87 CPU) @ 641.56/s (n=100000)
M4B: 0
timethis 100000: 1.75641 wallclock secs ( 1.76 usr +  0.00 sys =  1.76 CPU) @ 56818.18/s (n=100000)
M4S: 100000
timethis 100000: 7.17382 wallclock secs ( 7.18 usr +  0.00 sys =  7.18 CPU) @ 13927.58/s (n=100000)
M5A: 0
timethis 100000: 7.18063 wallclock secs ( 7.18 usr +  0.00 sys =  7.18 CPU) @ 13927.58/s (n=100000)
M5B: 0
timethis 100000: 1.91674 wallclock secs ( 1.92 usr +  0.00 sys =  1.92 CPU) @ 52083.33/s (n=100000)
M5S: 100000

Wenn nach Whitespace vor dem Zeilenende gesucht wird, könnte plötzlich Backtracking ins Spiel kommen. Die Fälle M3A und M3B, bei denen Backtracking auftreten kann, weil es keinen Treffer gibt, sind plötzlich 5 mal langsamer als die vergebliche Suche nach Whitespace vor dem 'A'. Matched die Regex dagegen (=> kein Backtracking) ist die Geschwindigkeit so wie in den vergleichbaren Fällen mit Treffern (M1A, M2A, M4S, M5S).

Erschreckend sind die Zeiten für M4A und M4B (Mismatches mit lookbehind als Anker): Fast 100 mal langsamer als im Fall wo nach Whitespace vor 'A' gesucht wird und immer noch 20 mal langsamer als ohne lookbehind Assertion.

In den Fällen M5A und M5B entspricht die Zeit dagegen den Fällen M3A und M3B obwohl hier wegen des \S am Anfang der Regex kein Backtracking auftreten kann.