When accessing individual characters in a string in Perl, is substr or splitting to an array faster? – Education Career Blog

I’m writing a Perl script in which I need to loop over each character of a string. There’s a lot of strings, and each is 100 characters long (they’re short DNA sequences, in case you’re wondering).

So, is it faster to use substr to extract each character one at a time, or is it faster to split the string into an array and then iterate over the array?

While I’m waiting for an answer, I suppose I’ll go read up on how to benchmark things in Perl.

,

It really depends on exactly what you’re doing with your data — but hey, you’re headed the right way with your last question! Don’t guess, benchmark.

Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here’s a little sample code to get started with:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= qw(G A T C)->rand 4 for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

And a sample result:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

Turns out substr is surprisingly fast. 🙂

,

Here is what I would do instead of first trying to choose between substr and split:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

Update:

My curiosity got the best of me. Here is a benchmark:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $charsrand @chars, 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s$i };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

Output:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --

,

I have an example in Mastering Perl dealing with this problem. Do you want to create a bunch of individual scalars, each of which carries around the memory overhead of a Perl scalar, or store everything in a single string to reduce memory but maybe do more work. You say that you have a lot of these, so leaving them as single strings might work out much better for you if you are worried about memory.

Mastering Perl also has a couple chapters dealing with benchmarking and profiling, if you’re curious about those.

Ether says to get it working first and worry about the rest later. Part of that is hiding the operations behind a task-oriented interface. A nice object-oriented module can do that for you. If you don’t like the implmentation, you change it. However, the programs at the higher level don’t have to change because the interface stays the same.

Leave a Comment