Saturday, July 04, 2009

The Power of C.

A few days ago, I was discussing the following problem with one of my cow-orkers: given a string, map it to a small integer (0 - 4) in a fast and repeatable way. My gut feeling was that doing it in C beats anything you do in pure Perl, even if the C solution does more than required. Below is a benchmark with a couple of Perl solutions and solution that uses Digest::MD5.

The Perl solutions split up the string in small parts (1 char strings, or 4 char strings), then either (binary) xor the strings (effectively treating them as bitvectors), or just adding the code points of the characters. The 4 char splitting uses unpack to retrieve the number. There's also variation in getting the 1 or 4 char substrings: using split //, a regexp, or substr.

But whatever Perl method, it's still dwarfed by the C solution of Digest::MD5. Code and result follows.

#!/usr/bin/perl

use strict;
use warnings;
no  warnings 'syntax';
        
use Benchmark 'cmpthese';
use Digest::MD5 'md5';

#
# Find out the "fastest" way of taking a string, and mapping that
# to a small integer (0 .. $MAX - 1), roughly evenly.
#
     
our $MAX    = 5;
our @corpus = <DATA>; chomp @corpus;
@corpus     = () if 0;
        
#
# Add really long string.
#
my $str  = "";
   $str .= chr int rand 255 for 1 .. 1000;
push @corpus => $str if 1;

my $tasks;

$$tasks {md5} = <<'--';
    foreach my $str (@::corpus) {
        my $c  = ::md5 ($str);
        my $i  = ord (substr $c, -1) % $::MAX;
    }
--
my $task_ord = <<'--';
    foreach my $str (@::corpus) {
        my $s  = 0;
           $s += ord __INNER__;
        my $i  = $s % $::MAX;
    }
--
my $task_xor = <<'--';
    foreach my $str (@::corpus) {
        my $s  = "\x{00}";
           $s ^= __INNER__;
        my $i  = $s % $::MAX;
    }
--
my $task_pack = <<'--';
    foreach my $str (@::corpus) {
        my $s  = "\x{00}\x{00}\x{00}\x{00}";
           $s ^= __INNER__;
        my $i  = unpack ("N", $s) % $::MAX;
    }
--
   
$$tasks {$_ . "_ord"}  =  $task_ord for qw [regexp substr split];
$$tasks {regexp_ord}   =~ s'__INNER__
                           'for $str =~ /(.)/sg'x;
$$tasks {split_ord}    =~ s'__INNER__
                           'for split // => $str'x;
$$tasks {substr_ord}   =~ s'__INNER__
                           'substr $str, $_, 1 for 0 .. length ($str) - 1'x;
        
$$tasks {$_ . "_xor"}  =  $task_xor for qw [regexp substr split];
$$tasks {regexp_xor}   =~ s'__INNER__
                           '$_ for $str =~ /(.)/sg'x;
$$tasks {split_xor}    =~ s'__INNER__
                           '$_ for split // => $str'x;
$$tasks {substr_xor}   =~ s'__INNER__
                           'substr $str, $_, 1 for 0 .. length ($str) - 1'x;
     
$$tasks {$_ . "_pack"} =  $task_pack for qw [regexp substr];
$$tasks {regexp_pack}  =~ s'__INNER__
                           '$_ for $str =~ /(.{1,4})/sg'x;
$$tasks {substr_pack}  =~ s'__INNER__'
                           substr $str, 4 * $_, 4
                                   for 0 .. int (length ($str)) / 4 - 1'x;
     
cmpthese -1 => $tasks;

__DATA__
http://www.example.com/
http://www.example.com/some/longer/path/with/a/lot/of/parts/
http://www.example.com/almost/the/same1
http://www.example.com/almost/the/same2
               Rate regexp_ord regexp_xor split_ord split_xor substr_xor substr_ord regexp_pack substr_pack  md5
regexp_ord    627/s         --        -8%      -34%      -40%       -55%       -56%        -73%        -85% -97%
regexp_xor    684/s         9%         --      -28%      -35%       -51%       -52%        -71%        -84% -97%
split_ord     956/s        53%        40%        --       -9%       -31%       -33%        -59%        -77% -95%
split_xor    1046/s        67%        53%        9%        --       -25%       -27%        -55%        -75% -95%
substr_xor   1389/s       122%       103%       45%       33%         --        -3%        -40%        -67% -93%
substr_ord   1437/s       129%       110%       50%       37%         3%         --        -38%        -65% -93%
regexp_pack  2327/s       271%       240%      143%      122%        67%        62%          --        -44% -89%
substr_pack  4149/s       562%       506%      334%      297%       199%       189%         78%          -- -80%
md5         20958/s      3242%      2962%     2091%     1904%      1409%      1358%        801%        405%   --
Beside that fact that C is much faster than Perl, we can deduce that iterating over the string in chunks of 4 characters is faster than one char at the time, and that using substr to retrieve characters is faster than either split or a simple regexp.

Friday, July 03, 2009