The Perl solutions split up the string in small parts (1 char strings, or 4 char strings), then either (binary) xor the strings (effectively treating them as bitvectors), or just adding the code points of the characters. The 4 char splitting uses unpack to retrieve the number. There's also variation in getting the 1 or 4 char substrings: using split //, a regexp, or substr.
But whatever Perl method, it's still dwarfed by the C solution of Digest::MD5. Code and result follows.
#!/usr/bin/perl use strict; use warnings; no warnings 'syntax'; use Benchmark 'cmpthese'; use Digest::MD5 'md5'; # # Find out the "fastest" way of taking a string, and mapping that # to a small integer (0 .. $MAX - 1), roughly evenly. # our $MAX = 5; our @corpus = <DATA>; chomp @corpus; @corpus = () if 0; # # Add really long string. # my $str = ""; $str .= chr int rand 255 for 1 .. 1000; push @corpus => $str if 1; my $tasks; $$tasks {md5} = <<'--'; foreach my $str (@::corpus) { my $c = ::md5 ($str); my $i = ord (substr $c, -1) % $::MAX; } -- my $task_ord = <<'--'; foreach my $str (@::corpus) { my $s = 0; $s += ord __INNER__; my $i = $s % $::MAX; } -- my $task_xor = <<'--'; foreach my $str (@::corpus) { my $s = "\x{00}"; $s ^= __INNER__; my $i = $s % $::MAX; } -- my $task_pack = <<'--'; foreach my $str (@::corpus) { my $s = "\x{00}\x{00}\x{00}\x{00}"; $s ^= __INNER__; my $i = unpack ("N", $s) % $::MAX; } -- $$tasks {$_ . "_ord"} = $task_ord for qw [regexp substr split]; $$tasks {regexp_ord} =~ s'__INNER__ 'for $str =~ /(.)/sg'x; $$tasks {split_ord} =~ s'__INNER__ 'for split // => $str'x; $$tasks {substr_ord} =~ s'__INNER__ 'substr $str, $_, 1 for 0 .. length ($str) - 1'x; $$tasks {$_ . "_xor"} = $task_xor for qw [regexp substr split]; $$tasks {regexp_xor} =~ s'__INNER__ '$_ for $str =~ /(.)/sg'x; $$tasks {split_xor} =~ s'__INNER__ '$_ for split // => $str'x; $$tasks {substr_xor} =~ s'__INNER__ 'substr $str, $_, 1 for 0 .. length ($str) - 1'x; $$tasks {$_ . "_pack"} = $task_pack for qw [regexp substr]; $$tasks {regexp_pack} =~ s'__INNER__ '$_ for $str =~ /(.{1,4})/sg'x; $$tasks {substr_pack} =~ s'__INNER__' substr $str, 4 * $_, 4 for 0 .. int (length ($str)) / 4 - 1'x; cmpthese -1 => $tasks; __DATA__ http://www.example.com/ http://www.example.com/some/longer/path/with/a/lot/of/parts/ http://www.example.com/almost/the/same1 http://www.example.com/almost/the/same2
Rate regexp_ord regexp_xor split_ord split_xor substr_xor substr_ord regexp_pack substr_pack md5 regexp_ord 627/s -- -8% -34% -40% -55% -56% -73% -85% -97% regexp_xor 684/s 9% -- -28% -35% -51% -52% -71% -84% -97% split_ord 956/s 53% 40% -- -9% -31% -33% -59% -77% -95% split_xor 1046/s 67% 53% 9% -- -25% -27% -55% -75% -95% substr_xor 1389/s 122% 103% 45% 33% -- -3% -40% -67% -93% substr_ord 1437/s 129% 110% 50% 37% 3% -- -38% -65% -93% regexp_pack 2327/s 271% 240% 143% 122% 67% 62% -- -44% -89% substr_pack 4149/s 562% 506% 334% 297% 199% 189% 78% -- -80% md5 20958/s 3242% 2962% 2091% 1904% 1409% 1358% 801% 405% --Beside that fact that C is much faster than Perl, we can deduce that iterating over the string in chunks of 4 characters is faster than one char at the time, and that using substr to retrieve characters is faster than either split or a simple regexp.