The Perl solutions split up the string in small parts (1 char strings, or 4 char strings), then either (binary) xor the strings (effectively treating them as bitvectors), or just adding the code points of the characters. The 4 char splitting uses unpack to retrieve the number. There's also variation in getting the 1 or 4 char substrings: using split //, a regexp, or substr.
But whatever Perl method, it's still dwarfed by the C solution of Digest::MD5. Code and result follows.
#!/usr/bin/perl
use strict;
use warnings;
no warnings 'syntax';
use Benchmark 'cmpthese';
use Digest::MD5 'md5';
#
# Find out the "fastest" way of taking a string, and mapping that
# to a small integer (0 .. $MAX - 1), roughly evenly.
#
our $MAX = 5;
our @corpus = <DATA>; chomp @corpus;
@corpus = () if 0;
#
# Add really long string.
#
my $str = "";
$str .= chr int rand 255 for 1 .. 1000;
push @corpus => $str if 1;
my $tasks;
$$tasks {md5} = <<'--';
foreach my $str (@::corpus) {
my $c = ::md5 ($str);
my $i = ord (substr $c, -1) % $::MAX;
}
--
my $task_ord = <<'--';
foreach my $str (@::corpus) {
my $s = 0;
$s += ord __INNER__;
my $i = $s % $::MAX;
}
--
my $task_xor = <<'--';
foreach my $str (@::corpus) {
my $s = "\x{00}";
$s ^= __INNER__;
my $i = $s % $::MAX;
}
--
my $task_pack = <<'--';
foreach my $str (@::corpus) {
my $s = "\x{00}\x{00}\x{00}\x{00}";
$s ^= __INNER__;
my $i = unpack ("N", $s) % $::MAX;
}
--
$$tasks {$_ . "_ord"} = $task_ord for qw [regexp substr split];
$$tasks {regexp_ord} =~ s'__INNER__
'for $str =~ /(.)/sg'x;
$$tasks {split_ord} =~ s'__INNER__
'for split // => $str'x;
$$tasks {substr_ord} =~ s'__INNER__
'substr $str, $_, 1 for 0 .. length ($str) - 1'x;
$$tasks {$_ . "_xor"} = $task_xor for qw [regexp substr split];
$$tasks {regexp_xor} =~ s'__INNER__
'$_ for $str =~ /(.)/sg'x;
$$tasks {split_xor} =~ s'__INNER__
'$_ for split // => $str'x;
$$tasks {substr_xor} =~ s'__INNER__
'substr $str, $_, 1 for 0 .. length ($str) - 1'x;
$$tasks {$_ . "_pack"} = $task_pack for qw [regexp substr];
$$tasks {regexp_pack} =~ s'__INNER__
'$_ for $str =~ /(.{1,4})/sg'x;
$$tasks {substr_pack} =~ s'__INNER__'
substr $str, 4 * $_, 4
for 0 .. int (length ($str)) / 4 - 1'x;
cmpthese -1 => $tasks;
__DATA__
http://www.example.com/
http://www.example.com/some/longer/path/with/a/lot/of/parts/
http://www.example.com/almost/the/same1
http://www.example.com/almost/the/same2
Rate regexp_ord regexp_xor split_ord split_xor substr_xor substr_ord regexp_pack substr_pack md5
regexp_ord 627/s -- -8% -34% -40% -55% -56% -73% -85% -97%
regexp_xor 684/s 9% -- -28% -35% -51% -52% -71% -84% -97%
split_ord 956/s 53% 40% -- -9% -31% -33% -59% -77% -95%
split_xor 1046/s 67% 53% 9% -- -25% -27% -55% -75% -95%
substr_xor 1389/s 122% 103% 45% 33% -- -3% -40% -67% -93%
substr_ord 1437/s 129% 110% 50% 37% 3% -- -38% -65% -93%
regexp_pack 2327/s 271% 240% 143% 122% 67% 62% -- -44% -89%
substr_pack 4149/s 562% 506% 334% 297% 199% 189% 78% -- -80%
md5 20958/s 3242% 2962% 2091% 1904% 1409% 1358% 801% 405% --
Beside that fact that C is much faster than Perl, we can deduce that iterating over the string in chunks of 4 characters is faster than one char at the time, and that using substr to retrieve characters is faster than either split or a simple regexp.
