Ikanobori Weblog

Well, one of my posts caused quite a bit of attention so I am revisiting the subject. I’ll start by explaining what the idea is:

I have a text file containing numbers, a random amount of numbers per line and the numbers are of random size. Can you quickly give me a sum of all numbers? The file has to be read from stdin.

You can find the example file I used here.

I made a solution in Python, using generator comprehensions. However, a battle ensued in the comments to create the quickest version of this program and after a few comments about 30% was shaven off the time my Python program took. Then things took a turn, comments started dripping in of other languages. Haskell, PHP, Ruby and C amongst others were submitted.

Thus I have decided to create a little posting about this specific problem. Firstly I have compiled and run all the different versions and languages on my hardware and timed then. I ran every tool 10 times and averaged the result which resulted in this graph:

Source codes are here, however, if you think you can do any better than this please submit a link to a pastebin in the comments with some explanation and I’ll update this post and graph. I’d like for funky languages to be included as well, as long as I can compile it on my Debian server!

Results (scroll down for credits and source code for each language, think you can do better or have other languages? comment!):

  • PHP: 2.53 s
  • x86 asm: 0.04 s
  • Ruby 1.9.1: 3.59 s
  • C: 0.04 s
  • Python: 2.18 s
  • Haskell: 0.05 s
  • gawk: 1.61 s
  • Perl: 4.74 s
  • Vala: 2.01 s

I also ran the fastest three on a large input file. Haskell fails, giving the wrong number.

  • C: 2.91 s
  • x86 asm: 4.39 s

Clearly, C and Haskell reign supreme even knowing more about the processor instructions then RayNbow’s x86 solution. Python is fastest at the interpreted languages. I ran the Ruby sample on Ruby 1.8 as well which took 20ish seconds, Ruby is surely stepping up it’s game, speedwise, still lacking a bit behind PHP though.

C (luite) (gcc -O2):

#include <stdio.h>
#include <stdint.h>

int main(int argc, char** argv) {
  uint64_t sum=0;
  char buf[8192];
  int cur=0;
  int i,j;
  while((i=fread(buf,1,8192,stdin))>0) {
    for(j=0;j<i;j++) {
      char c = buf[j];
      if(c >= '0' && c <= '9') {
  cur = (cur*10)+c-48;
      } else {
  sum+=cur;
  cur=0;
      }
    }
  }
  sum+=cur;
  printf("total: %llu\n", sum); /* what is the correct format type for uint64_t? */
}

Haskell (Don Stewart):

{-# LANGUAGE BangPatterns #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- A lazy bytestring solution
--

import qualified Data.ByteString.Char8 as S

main = do
    putStr "Grand total "
    print . go 0 =<< S.getContents
  where
    go !n !s = case S.readInt s of -- lazily reads current line
                    Nothing     -> n
                    Just (k,t)  -> go (n+k) (S.tail t)


{-

-- Important: turn on optimizations:

$ ghc -O2 --make A.hs             
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

$ cksum input_list_vs_gen.txt 
2913999841 17293232 input_list_vs_gen.txt

$ du -hs input_list_vs_gen.txt 
17M input_list_vs_gen.txt

$ time ./A < input_list_vs_gen.txt 
Grand total 17677692470
./A < input_list_vs_gen.txt  0.19s user 0.06s system 98% cpu 0.254 total

-}

PHP (John Piasetzki):

while($line = fgets(STDIN))
  $s += array_sum(explode(' ',$line));
echo $s;

Python (Pythoneer):

#!/usr/bin/env/python
import sys
from itertools import imap

print sum(sum(imap(int, line.split())) for line in sys.stdin)

Ruby (Thomas Hurst):

#!/usr/bin/env ruby
total = 0
STDIN.each_line {|line| line.split(" ").each {|x| total += x.to_i } }
puts "Grand total: #{total}"

x86 asm (James Block):

; for NASM, a 386 or better, and a 2.6-series Linux kernel
; it should run fine on a 64-bit system with appropriate flags for NASM
; assemble with nasm -f elf FILENAME.asm && ld FILENAME.o -o OUTPUTFILENAME
; run as ./OUTPUTFILENAME < INPUT_NUMBERS.txt
; expect weirdness with nonconforming input. this is pure assembly, you know
; each input number should be at most 32 bits, and the sum at most 64 bits

buffersize equ (1<<16)	;use 64kB chunks; can be set to any value
			;64kB is fastest on this old Willamette P4

section .data
	message: db "Grand total:"
	printbuffer: times 23 db 0x20
	dw 0x0A
	msglen equ $ - message

section .text
	global _start

_start:
	;we want to read a bunch of numbers from STDIN
	;we'll do this by allocating memory with mmap()
	;and then reading stdin with read()
	;
	;mmap signature:
	; void* mmap(void* start, size_t length, int prot, int flags,
	;	int fd, off_t offset);
	; (defined in <sys/mman.h>)
	;
	;arguments:
	; start - pass 0, usually ignored
	; length - pass $BIGNUM to allocate a $BIGNUM chunk
	; prot - pass PROT_READ | PROT_WRITE (0x01|0x02)
	; flags - pass MAP_PRIVATE (0x02) | MAP_ANONYMOUS (0x20)
	; fd - pass anything (-1)
	; offset - pass anything
	;
	; linux calling convention: 6-arg syscalls use the stack
	;   push arguments in REVERSE order
	;
	; system call number goes in eax:
	;  exit:	1
	;  read:	3
	;  write:	4
	;  mmap:	90
	;  munmap:	91

	mov esi, buffersize	;store buffer size in esi

	push dword 0	;offset
	push dword -1	;fd
	push dword 0x22	;flags - MAP_PRIVATE|MAP_ANONYMOUS
	push dword 0x03	;prot - PROT_READ|PROT_WRITE
	push dword esi	;allocation size
	push dword 0	;start
	mov eax, 90
	mov ebx, esp

	int 80h
	add esp, (4*6)	;clean up the refuse from calling mmap()

	;now eax holds the start address for our memory! yay!
	;let's put it in ebp instead, since ebp isn't good for much else
	mov ebp, eax

	;now let's read() in the data from stdin
	;read() signature:
	; ssize_t read(int fd, void* buf, size_t count)
	;3-arg syscalls pass arguments in registers, so
	;this one's much easier than mmap():

	mov ecx, eax
	mov ebx, 0
	mov eax, 3
	mov edx, esi
	int 80h

	;accumulate in the traditional pairing edx:eax
	;let ecx index our position in the memory block

	;read digits into ebx, accumulate into numbers in edi
	; then sum edi to the main accumulator, edx:eax
	; anything but a digit ends a number
	; zero byte ends all counting

	xor edx, edx
	xor eax, eax
	xor ecx, ecx
	dec ecx

readdigit:
	xor edi, edi
readdigit_lp:
	xor ebx, ebx
	inc ecx
	cmp ecx, esi
	je readmore
gotmore:mov bl, byte [ebp+ecx]
	test bl, bl
	jz finished_summing		;exit on zero byte

	sub bl, 48
	js next_number
	cmp bl, 9
	ja next_number	;if this is not a digit, skip it and sum

	shl edi, 1
	lea edi, [edi*4 + edi]
	add edi, ebx
	jmp readdigit_lp

next_number:
	add eax, edi
	jnc readdigit
	inc edx
	jmp readdigit

readmore:
	mov esp, eax
	mov eax, 3
	xor ebx, ebx
	mov ecx, ebp
	xchg edx, esi
	int 80h

	;if we didn't fill the whole buffer, stick a zero byte after the
	; bit we *did* read, to ensure proper termination of the loop

	cmp eax, edx
	je readcleanup
	mov byte [ebp+eax+1], 00
readcleanup:
	xor ecx, ecx
	xchg edx, esi
	mov eax, esp
	jmp gotmore

finished_summing:
	;edx:eax contains the 64-bit sum
	;now we have to print it
	;the following is an EXTREMELY ugly printf(%llu) substitute

	mov ecx, (printbuffer+22)
	mov ebp, 10
	mov esp, 429496729

printlp64:
	;if we can bail on this horrid 64-bit print loop
	; and use a much saner 32-bit version,
	; then, damn it, EXIT NOW
	test edx, edx
	jz printlp32

	;algorithm (64-bit divmod): let W = 2**32, Z = X + Y*W
	; (i.e., X is eax and Y is edx)
	; then let M = 10 (modulus) and define
	; X divmod M = (x_D, x_M)   (the definition of divmod:
	; Y divmod M = (y_D, y_M)   (quotient, remainder))
	; W divmod M = (W_D, W_M)
	;
	; let u = x_M + y_M * W_M and calculate
	; u divmod M = (u_D, u_M)
	;
	; then the following is true:
	; Z divmod M = (y_D * W + x_D + y_M*W_D + u_D, u_M)
	;
	; the following pile of junk implements the above
	;  (and allows for overflow in a couple places)

	mov edi, eax
	mov eax, edx
	xor edx, edx
	div ebp
	lea ebx, [edx*3]
	shl ebx, 1
	mov esi, eax
	mov eax, edx
	mul esp
	;edx:eax has y_M*W_D, ebx has y_M*W_M, edi has x, esi has y_D
	add esi, edx
	xor edx, edx
	xchg eax, edi
	div ebp
	;eax has x_D, edx has x_M, esi has y_D, edi has y_M*W_D
	add ebx, edx	;ebx has u
	add edi, eax
	xor edx, edx
	mov eax, ebx
	div ebp
	;eax has u_D, edx has u_M
	add dl, 0x30
	mov byte [ecx], dl
	dec ecx
	mov edx, esi
	add eax, edi
	jnc printlp64
	inc edx
	jmp printlp64

printlp32:
	div ebp
	add dl, 0x30
	mov byte [ecx], dl
	dec ecx
	xor edx, edx
	test eax, eax
	jnz printlp32

EXIT:
	mov eax, 4
	mov ebx, 1
	mov ecx, message
	mov edx, msglen
	int 80h

	mov eax, 1
	mov ebx, 0
	int 80h

Perl (Michael Pyne):

#!/usr/bin/env perl

use strict;

my $sum = 0;
while(<STDIN>) {
my @nums = split(' ');
foreach my $num (@nums) { $sum += $num; }
}

print "$sum\n";

Gawk (Eric Wong):

gawk 'END { printf "%d\n", n } { for (i=NF;i>0;--i) n += $i}'

Vala (snuxoll):

using GLib;

namespace snuxoll {
    class StdinReader : Object {

        public StdinReader () {

        }

        public bool eof() {
            return stdin.eof();
        }

        public string? read_line() {
            var line = new StringBuilder();
            var buffer = new char[1024];
            while ( ( !line.str.has_suffix ("\n") ) || this.eof() ) {
                string chunk = stdin.gets(buffer);
                if (chunk != null) {
                    line.append(chunk);
                } else {
                  return null;
                }
            }
            return line.str;
        }

    }
}

// start other file

using GLib;
using snuxoll;

class SumApplication : Object {

    private uint64 total { get; set; }

    public SumApplication() {
        this.total = 0;
    }

    public void run() {
        var reader = new StdinReader();
        while ( !stdin.eof() ) {
            var line = reader.read_line();
            if ( line != null ) {
              var nums = line.strip().split(" ");
              foreach(string num in nums) {
                    this.total += num.to_int();
                }
            }
        }
        stdout.printf("Grand Total: %llu\n", total);
    }

    static int main(string[] args) {
        var app = new SumApplication();
        app.run();
        return 0;
    }

}
88 COMMENTS
orgthingy
April 12, 2009

you can’t really judge for C code for example, in gcc you can have “performance over complination speed” and such, and you can use other compilers and will get different performance (like Tinycc or IntelCC) so you can’t really judge by that

but i do think snuxoll’s code is “Clean” as he usually is, well i thought he wasn’t that good even though he seemed to be in IRC..but after all he is a good coder :-)

Don Stewart
April 12, 2009

A faster Haskell version (are you compiling it with GHC, and using -O2)?

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3684#a3684

Runs in 0.2s on my core2. I can’t reproduced your 2s figure for the current example.

April 12, 2009

@Don Stewart:
I compiled the Haskell variant with -O2 yes. I’ll run your code now.

RayNbow
April 12, 2009

@dons
That one uses 64 bit Ints? I adapted the original shootout code because the machine I ran it on was 32 bits.

Don Stewart
April 12, 2009

Ok. readInteger might be more appropriate (if the spec requires arbitrary sized integers?).

Just as a sanity check, you should be getting a similar spread to the same benchmark on the shootout: http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

April 12, 2009

@Don Stewart:
Thanks for that link, also, I ran you code and it takes 0.29 seconds on the first run.

I’ll average it now and update the post.

Thomas Hurst
April 12, 2009

A minor change to the Ruby version simplifies it, uses less memory, and is almost twice as fast: http://rafb.net/p/beY0XQ71.html

luite
April 12, 2009

this C version is a bit faster:

http://pastebin.com/f26899b53

April 12, 2009

@Thomas Hurst:
Are you sure it’s not your file system cache playing with you? I’m getting comparable times from both Ruby versions, yours being a tiny tad slower.

Thomas Hurst
April 12, 2009

And modified to run a benchmark: http://rafb.net/p/UUlEjI48.html

Since it does a warmup run first, this one can be run under JRuby with good results:

Ruby 1.8.6: 7.265s
Ruby 1.9.1: 3.464s
JRuby 1.2.0: 2.828s
JRuby 1.2.0 –fast: 2.423s

April 12, 2009

@Thomas Hurst:
I see you are running a newer version than my Ruby, you have 1.9.1 which has some improvements over 1.9.0, that would explain the difference in speeds.

Thomas Hurst
April 12, 2009

Well, here you go: http://rafb.net/p/kUhFK118.html

Ruby 1.9.1:
Mine: 2.975s
Yours: 4.278s

Yours just runs JRuby out of memory here :/

April 12, 2009

@Thomas Hurst:
Please note, I did not write the Ruby solution or any of the fastest solutions listed here.

I’ll try and get Ruby 1.9.1 running on this box and run yours.

Anon
April 12, 2009

It might be worth replacing:
imull $10, %ebx
addl %eax, %ebx
with (intel asm, I don’t know how to convert it :s):
lea ebx, [ebx + 4*ebx] ; ebx *= 5
lea eax, [eax + 2*ebx] ; eax += 2*ebx
This removes the potentially slow imul and is probably similar to what the C compiler does.

Eric Wong
April 12, 2009

gawk ‘END { printf “%d\n”, n } { for (i=NF;i>0;–i) n += $i}’ < input_list_vs_gen.txt

gawk is decently fast for a scripting language, mawk is faster but overflows.

snuxoll
April 12, 2009

@ikanobori I wrote a better ruby solution, you seem to have used my old one.
http://gist.github.com/93128

Eric Wong
April 12, 2009

Erm, that’s a dashdash i to decrement. And obviously gawk does not use smart quotes :x

April 12, 2009

@Eric Wong:
I have learnt to read across Wordpress mishaps like that. I’ll run it in a few.

First I’m looking at snuxoll’s and Thomas Hurst’s solutions.

munchhausen
April 12, 2009

re: Your Ruby example

This approach – http://www.reddit.com/r/programming/comments/8bsy6/ruby_php_python_c_haskell_x86_asm_shootout/c08st58 – seems to be a bit faster.

Ken Hoxworth
April 12, 2009

This Ruby version seems to run a little faster on my system, but YMMV: http://gist.github.com/93796

I personally hate Ruby strings and try to avoid them whenever possible.

April 12, 2009

@Ken Hoxworth
Averages at ≈ 4 seconds, bit slower then the one by Thomas Hurst.

Isaac Gouy
April 12, 2009

> a random amount of numbers per line and the numbers are of random size

Do you have a specific reason to think the relative performance between language implementations will be any different than it was with the simple one number per line?

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

Philip Jenvey
April 12, 2009

Python’s itertools.imap over map gives a small speedup

April 12, 2009

@Isaac Gouy:
None, whatsoever. This is just a spin off of an earlier article I wrote.

Joao Pedrosa
April 12, 2009

Here’s a simpler Ruby version:

total = 0
STDIN.each_line{|s|s.split(’ ‘).each{|s|total+=s.to_i}}
puts total

Joao Pedrosa
April 12, 2009

Oops. Just now I saw that the Ruby version had been updated with something similar to what I wrote. :P

Anyway… Let’s see if I can come up with another type of contribution to this.

Matt Gallagher
April 12, 2009

At 0.04 and 0.05 seconds for the fastest versions, I’d be worried that you’re not testing the code but the loading of the executable and runtime. I’d be interested to see this run on test data that takes at least 5 seconds for the fastest case.

Michael Pyne
April 12, 2009

Qt4/C++:
// g++ -o qt4-sum -O3 -I$QTDIR/include/{,Qt,QtCore} -L$QTDIR/lib -lQtCore sum.cpp
#include

int main()
{
quint64 sum = 0;
QFile standardInput;

standardInput.open(0, QFile::ReadOnly);

QByteArray line;
bool ok;
while(!standardInput.atEnd()) {
line = standardInput.readLine(8192);
QList nums = line.simplified().split(’ ‘);
Q_FOREACH(const QByteArray &num, nums) {
sum += num.toUInt(&ok);
}
}

// Ensure output is generated by using qCritical
// instead of qDebug
qCritical() << sum;
return 0;
}

Perl:
#!/usr/bin/env perl

use strict;

my $sum = 0;
while() {
my @nums = split(’ ‘);
foreach my $num (@nums) { $sum += $num; }
}

print “$sum\n”;

Timings (Avg 5 warm-cache runs):
Qt4: 1.672 s
Perl: 2.703 s

For reference, the Python version from Pythoneer gives me 3.564 s on this system with the same timing method.

Michael Pyne
April 12, 2009

Note that the while() I posted in the Perl version is supposed to have <STDIN> but the HTML-tag-alike was stripped out.

Michael Pyne
April 12, 2009

And likewise, #include <QtCore> for the Qt4 version.

gso3
April 12, 2009

no java?

jon
April 12, 2009

Dude I read slashdot and so that means you’re a fool and an idiot

John Piasetzki
April 12, 2009

This php solution runs faster on my machine (YMMV). http://gist.github.com/93848

blah
April 12, 2009

The naive and general lisp

(format t “Grand total: ~A~%”
(loop as n = (read nil nil nil) while n summing n))

takes a few secs on my system …but it sure was easy to write…

blah
April 12, 2009

Basic ports of C code approaches to common lisp normally yield times slower than C but faster than usual “scripting languages”, and indeed following gets about 0.7 secs on my system with clozure cl -

(defun cintsum (stream)
(loop with buf of-type (simple-string 8192) = (make-string 8192)
with position fixnum = 8192
with cur fixnum = 0
with sum fixnum = 0
while (> position 0)
do (progn
(setf position (read-sequence buf stream))
(loop
for n upto (1- position)
do (let ((digit
(case (aref buf n)
(# 0)
(#\1 1) (#\2 2) (#\3 3)
(#\4 4) (#\5 5) (#\6 6)
(#\7 7) (#\8 8) (#\9 9))))
(if digit
(setf cur (+ (the fixnum (* cur 10)) digit))
(progn
(incf sum cur)
(setf cur 0))))))
finally (progn
(incf sum cur)
(return sum))))

(format t “Grand total: ~A~%” (cintsum *standard-input*))

[...] Originally posted here: Ikanobori Weblog » Language performance on the sum of a file. [...]

Jeff
April 12, 2009

A more efficient way to do this in PHP would be:

Fynn
April 12, 2009

Faster Python version using lazy imap – shaves another 0.3s on my machine:

#!/usr/bin/env/python
import sys
from itertools import imap

print sum(sum(imap(int, line.split())) for line in sys.stdin)

Joao Pedrosa
April 12, 2009

OK. I bit the bullet and challenged myself by giving a new environment a shot and it worked. I made use of the Google v8 JavaScript engine but using it with a few standard libraries in the following repository:
http://github.com/isaacs/k7/tree/master

I created the example from scratch and implemented a couple of versions but found the first version to be better and here it goes, JavaScript for the win now:

function readLines(file_path,fn){
var px=system.posix,
fd, s, z, i, len, offset=0,
NL=’\n’,
ES=”",
BS=8096,
unused=[],
last_i;
fd = px.fopen(file_path,”r”);
try{
while((s=px.fread(1,BS,fd)).length>0){
len=s.length;
last_i=len-1;
offset=0;
for(i=0;i0){
fn(unused.join(ES)+s.substr(offset,i-offset));
unused=[];
}else if(i-1>=offset){
fn(s.substr(offset,i-offset));
}else{
fn(ES);
}
offset=i+1;
}else if(i==last_i){
unused.push(s.substr(offset,len-offset));
}
}
}
if(unused.length>0){
fn(unused.join(ES));
}
}finally{
px.fclose(fd);
}
}

var total=0,space=’ ‘;

readLines(”/dev/stdin”,
function(line){
var a=line.split(space), i, len;
for(i=0,len=a.length;i<len;i++){
total+=(parseInt(a[i],10));
}
});

print(total);

And it times like this for me:
$ time /opt/ruby-1.9.2/bin/ruby me.rb < input_list_vs_gen.txt
17677692470

real 0m6.747s
user 0m6.304s
sys 0m0.128s
$ time ./sum.js < input_list_vs_gen.txt
17677692470
real 0m7.230s
user 0m6.756s
sys 0m0.184s

To compile “k7″ I had to make a small change to the makefile for my Ubuntu system:

-BUILD_LIBS =-lpthread -ldl
+BUILD_LIBS =-lpthread -ldl -lrt

It’s interesting the performance of the v8 engine because the “readLines” API I implemented in high-level JavaScript code and yet it compared quite well versus C optimized code of other language implementations.

Joao Pedrosa
April 12, 2009

I mean the k7 version of Sebastien’s repository: http://github.com/sebastien/k7/tree/master

HG
April 12, 2009

ruby 1.9.1: 2.999s

#!/usr/bin/env ruby
total = 0
STDIN.each_line {|line|
total += line.split(” “).inject(0) {|sum, x| sum + x.to_i }
}
puts “Grand total: #{total}”

[...] Read more here: Ikanobori Weblog » Language performance on the sum of a file. [...]

James Block
April 12, 2009

As I posted on Reddit, I wrote a pure assembly version that’s basically exactly as fast as C: http://pastebin.com/f2c9584a6

Summing the numbers is trivial. Figuring out how to print a 64-bit number on a 32-bit CPU took
about five times as long as writing the summation code. Such is life.
From a very old P4 machine:

$ time ./james_block_asm < input_list_vs_gen.txt
Grand total: 17677692470

real 0m0.118s
user 0m0.076s
sys 0m0.040s

$ time ./ray_asm < input_list_vs_gen.txt
Grand total 17677692470

real 0m0.714s
user 0m0.692s
sys 0m0.020s

$ time ./C < input_list_vs_gen.txt
total: 17677692470

real 0m0.114s
user 0m0.100s
sys 0m0.012s

Joao Pedrosa
April 12, 2009

I tweaked the JavaScript version a little further and got quite a good speedup:

$ time ./sum.js 0){
trail_nl=s.charAt(s.length-1)==NL;
lines=s.split(NL);
if(trail_nl){
//when splitting with a trailling newline an “extra line” is added so discard it
lines.pop();
}
len=lines.length;
if(len==1&&!trail_nl){
unused.push(lines[0]);
}else{
last_i=len-1;
if(unused.length>0){
lines[0]=unused.join(ES)+lines[0];
unused=[];
}
for(i=0;i<len;i++){
if(i0){
fn(unused.join(ES));
}
}finally{
px.fclose(fd);
}
}

var total=0,space=’ ‘;

readLines(”/dev/stdin”,
function(line){
var a=line.split(space), i, len;
for(i=0,len=a.length;i<len;i++){
total+=(parseInt(a[i],10));
}
});

print(total);

In some ways, this JavaScript engine is everything that so many other high-level languages have tried to be. ;-) Bright future for it.

Joao Pedrosa
April 12, 2009

Sorry for trying to paste code in the comments. This time I have put it on gist: http://gist.github.com/93891

The newest Javascript version times in:
$ time ./sum.js < input_list_vs_gen.txt
17677692470
real 0m4.441s
user 0m4.008s
sys 0m0.232s

Michael
April 12, 2009

I’m wondering how Java compares. I’ve written a test: http://pastebin.com/f51d4c4b2

Originally, I was using String.split(), but (idea from the great shootout) I tried StringTokenizer instead. It’s still much slower than the C version; does anyone have any improvements?

juris galang
April 12, 2009

Slightly faster Ruby version:

#!/usr/bin/env ruby

class Array
def sum
inject(nil) { |sum, x| sum ? sum + x.to_i : x.to_i }
end
end

total = 0
STDIN.each_line { |line| total += line.split(’ ‘).sum }
puts “Grand total: #{total}”

this version’s performance:

> time ./sum.rb time ./sum.rb < numbers.txt
Grand total: 2599246967

real 0m0.658s
user 0m0.641s
sys 0m0.016s

juris galang
April 12, 2009

btw: that was in Ruby 1.9.1

juris galang
April 12, 2009

correction:

original performance:
> time ./sum.rb time ./sum.rb < numbers.txt
Grand total: 2599246967

real 0m0.623s
user 0m0.605s
sys 0m0.015s

Joe
April 12, 2009

Tcl:

set data [read stdin]
foreach num $data {
incr result $num
}

puts $result

In version 8.5 of the interpreter from ActiveState it runs between 3.4 and 3.6 seconds.

Awk:
awk ‘{for(i=1; i<=NF; i++) {sum = sum + $i}} END {printf(”%20f\n”, sum)}’ input_list_vs_gen.txt

.95-1.0 seconds.

woogley
April 12, 2009

Java version that completes in about 0.3 seconds:

http://pastebin.com/m68184c34

Trevor Caira
April 12, 2009

The given python version parses the file twice. If you parse it only once, you get a significant speedup:

print sum(map(int, sys.stdin.read().split()))

Replacing map with imap as noted by Philip Jenvey leads to yet further improvement:

print sum(imap(int, sys.stdin.read().split()))

mike
April 12, 2009

maybe you should use some ASM code that is worth half a damn

April 12, 2009

Hey guys, I want everybody to know that most of these comments were posted while I was sound asleep (my last comment dates from 3 AM last night).

I have to go search for some easter eggs but I’ll be back in a few hours and update the test results.

I’ll also include a larger test file so we can see how the very fast programs perform on a larger file so we don’t necessarily compare load times.

I’m particularly interested in the Java and new Python submits. Also I’m looking forward to running the improved asm piece.

[...] View original here:  Ikanobori Weblog » Language performance on the sum of a file. [...]

Phill
April 12, 2009

PHP is going to have a bit of a hard time with this.

The problem is mainly with array_sum, which makes up about 80% of the time taken and I havent experimented on that end of things enough to find a faster way yet.

My solution is too close to John Piasetzki’s so I wont bother posting it.
Used a Do loop rather than While which is slightly faster.

kesavan
April 12, 2009

php code that is at-least a minute faster (comparable to the python code now)…I have posted it here: http://pastebin.com/f26847cbe

kesavan
April 12, 2009

forgot to output the sum in the previous link that I had posted…
including echo: http://pastebin.com/d69b12c92

James Block
April 12, 2009

Why do I waste my time like this? A better assembly version, that can handle files bigger than its buffer: http://pastebin.com/f368d5959 . There should be no more limit on total size, though it’s still limited to 32-bit numbers and a 64-bit sum. It’s exactly as fast as the previous version on my machine, at the same speed as C.

snuxoll
April 12, 2009

I also wrote an implementation in vala, could probably get cleaner if I turned StdinReader into an interator, but I’m too lazy http://gist.github.com/93333.

April 12, 2009

@Juris Galang:
Your solution comes up with the wrong total, the correct total should be: 17677692470

@James Block:
I ran your code, came to 0.06 seconds, I will average it and add it to the post. Cool!

@Everyone else:
I’ll be spending my time after dinner on going through all the new submissions and adding them to my article as I average them out!

Markus
April 12, 2009

You could also use numpy:

import sys, numpy

print numpy.fromfile(sys.stdin, numpy.int64, sep=” “).sum()

which is 3 times faster than

print sum(sum(map(int, line.split())) for line in sys.stdin)

on my 32-Bit box (6.934s vs 2.346s)

Dax Huiberts
April 12, 2009

I have a slightly faster version of the ruby one: http://www.pastie.org/444396

Too bad ruby doesn’t have a build in sum method for enumerables (the rails extensions do though). So here is another version, albeit slower, which adds a simple sum method to Enumerable: http://www.pastie.org/444397

Now that makes the the line:
puts STDIN.sum{|l|l.split.sum(&:to_i)}
some tidy, compact and a strong piece of code :D

Marek Kubica
April 12, 2009

Here’s a simplistic recursive Scheme solution:

http://paste.lisp.org/display/78462

(it performs rather poor because of that deep recursion. No time to make a better one, though)

Superwayne
April 12, 2009

An implementation in “real” java (based on the Scanner class): http://pastebin.com/f72444f0a

However, pretty slow… needs about 9 s to complete.

John Stumpo
April 12, 2009

http://pastebin.com/m64a72659

x86 assembly implementation using the raw Linux syscall interface (not even libc).

[...] Ikanobori Weblog » Language performance on the sum of a file. print sum(sum(map(int, line.split())) for line in sys.stdin) (tags: microbenchmark python ruby haskell c) This entry was posted on Monday, April 13th, 2009 at 1:31 am and is filed under Bookmarks. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. [...]

John Stumpo
April 12, 2009

http://pastebin.com/m15358830
Slightly optimized version of my previous posting. Still raw syscall interface with no libc or any other dynamic libs. On my system, the executable was 600 bytes.

April 13, 2009

@John Stump
The executable on my system comes to 600 bytes as well. However, it runs in .063 seconds.

I am going to have the fastest 3 (asm, haskell and c) do some crunching on a bigger file when I get the time. I’ll be sure to include yours as well.

April 13, 2009

Everybody:

If you submitted code in a language which was already in the list but you don’t see your code in the post that means it averages out at a slower speed than the current mention.

If you submitted code in a language not yet in the list that means that I haven’t yet installed your compiler but will get to it as soon as I have the time to add your code (this goes for the C++ (Qt4), Java, Vala and Tcl code).

Also I will be running all languages against a larger input file. The larger input file will consists of 1 GB. Maximum time to parse will be set to 30 seconds. It should show us more of a difference between the asm, C and haskell versions.

William D. Neumann
April 13, 2009

Since I’m too full of Easter ham to do much else but play with code, here are a couple of OCaml solutions. First the fast, imperative, C-like solution (note, if using OCaml versions < 3.11, replace Array1.unsafe_get with Array1.get):
http://pastebin.com/f6230b85e
(compile with ocamlopt unix.cmxa bigarray.cmxa [name of source file] -o [output file name])

And a “fancier”, yet slower, functional version using camlp4 stream parsers:
http://pastebin.com/f106fa718
(compile with ocamlopt -pp camlp4of [name of source file] -o [output file name])

James Block
April 13, 2009

Nice, John. I especially like the simple print loop… much tighter than mine, since you exploit the fact that printing the sample input requires only one full 64-bit modulus, after which they’re all 32 bits. Noticing that would have made my version a lot easier to write. It’s still about three times slower than mine though ;)

Mine is so fast because this problem is very simple: so simple that *all* intermediate computations fit in the 8 x86 GPRs. I never need to spill to memory, or even the stack (but I do need to abuse ESP to hold data — I can only get away with that by never pushing, popping, or calling anything). I haven’t tweaked my code for all-out speed, but it’s simple enough that there’s not going to be much low-hanging fruit; designing with
1. few memory references
2. little (turns out to be no) stack usage and
3. small instruction lengths
in mind is probably close enough to speed-of-light for this problem that breaking out PREFETCH et al won’t help much, and would have the (oh so dire) cost of breaking 386 compatibility.

One thing I should have done with my original version, though, was tweak the buffer size: I pulled 4MB out of a hat, with no rationale whatsoever. Turns out that 64kB is a lot faster… fast enough that my program now consistently beats C on this P4/1.7GHz.

http://pastebin.com/f74d9536e has the tweaked buffer size, along with a couple trivial changes to avoid the stack in a slowpath (the fastpath remains unchanged). It’s exactly as fast as changing the buffer size in the other version to 64kB.

jmelesky
April 13, 2009

The tested perl version is doing unnecessary copying. Removing that shaves at least a second off the time.

Replace the while loop with:
while(<STDIN>){
foreach my $num (split) { $sum += $num; }
}

On my machine takes it from 3.7ish seconds to 2.5ish.

jmelesky
April 13, 2009

I spoke too soon — using help from the standard library drops the perl speed even further. I’m clocking the following around 1.6s:

#!/usr/bin/env perl

use strict;
use List::Util ’sum’;

my $sum = 0;
while (<STDIN>) {
$sum += sum split;
}

print “$sum\n”;

April 13, 2009

@James Block:
Your ASM version now consistently outperforms the C version, averaging out at 0.04s. However, when I run it on a 762MB sample file it is considerably slower, taking 5 seconds where the C version only takes 2.9.

Any idea where that comes from?

TimG
April 13, 2009

This following python version took about half the time on my old P3 (linux) and about 3/4 the time on my (not quite as old) P4 (windows). It’s not quite as elegant as the original, however :p.

The buffer size seems to alter the speed a bit, but 8K seemed to work pretty well on both. Replacing map with imap was a fair bit slower on my P3, but gave a minor speedup on my P4. YMMV.

The save variable basically keeps track of any incomplete ints that get read from the file, and gets added onto the front of the next read.

import sys

def sumints(f, _bufsize=8192):
save = None
result = 0
while True:
buf = f.read(_bufsize)
if not buf:
return result
X = buf.split()
if save is not None:
if buf[0] in ‘ \n’:
result += int(save)
else:
X[0] = save + X[0]
if buf[-1] not in ‘ \n’:
save = X.pop()
else:
save = None
result += sum(map(int, X))

print sumints(sys.stdin)

JV
April 13, 2009

A slow idiomatic C++ solution
#include
#include
#include

int main(int argc, char** argv)
{
std::cout << “Grand Total: ” << std::accumulate(std::istream_iterator(std::cin), std::istream_iterator(), 0LL) << “\n”;
return 0;
}

PS
April 13, 2009

A scalable (reads line-by-line) Tcl solution:

#!/usr/bin/tclsh8.5
proc main {} {
while {[gets stdin line] >= 0} {
foreach num $line {
incr sum $num
}
}
puts $sum
}
main

James Block
April 13, 2009

I can’t replicate your results here on my P4/1.7GHz testing box with 256MB of RDRAM (yeah, it’s an old one, but it’s my usual box for random fooling around).

Too lazy to make a real test file, I concatenated the 17MB file with itself a bunch of times, doubling its size every time. This seems as good a test as any. I found that results depended a lot on the OS’s caches (this box runs Debian etch): after running “cat biginput biginput > doubleinput”, the first run afterward (of any program) would always be fastest, presumably due to cache effects. Things quickly got out of hand and runtime exploded when files exceeded the capacity of main memory, which is hardly surprising. As well, when programs take this long to run, you get interruptions from the OS scheduler and other threads, so the results tend to bounce around a lot. I probably should have tried renicing, but didn’t bother.

Summary results (first run discarded):
132MB input (6 runs each):
C: average 2.912s, min 2.727s
asm: average 2.617s, min 2.437s

264MB input (12 runs each):
C: average 13.309s, min 12.776s
asm: average 13.288s, min 12.656s

528MB input (3 runs each):
C: avg 26.760s, min 26.741s
asm: avg 26.565s, min 26.530s

1056MB input (4 runs each):
C: avg 54.226s, min 54.112s
asm: avg 53.911s, min 53.868s

The time command reports a distribution of where the time is spent, too: let’s look at some results from the 1GB test

C:
real 54.170s
user 6.436s
sys 1.312s

asm:
real 53.900s
user 5.156s
sys 1.328s

Both programs spend a little time in the kernel, actually doing something, and a reasonable amount of time actually crunching numbers. But the majority of the time is spent waiting for I/O to complete, at least on this pathetic old box. At this point we’re not testing execution speed, we’re testing the OS’s disk schedulers and caching.

Don Stewart
April 14, 2009

Hey, for the 1G version, I imagine the output overflows a 32 bit integer (you’re using 32 bits?). So in that case, replace S.readInt with S.readInteger, for an unbounded result.

Don Stewart
April 14, 2009

In the Haskell version, that is.

Wutz?
April 14, 2009

Ucan has teh better c0dez:

#!/usr/bin/env ruby

p STDIN.split(” “).size

kthxbye

Dmitri Sotnikov
April 15, 2009

here’s a clojure version, for best results run with java -server, eg:
cat input_list_vs_gen.txt | java -server -cp clojure.jar clojure.lang.Script sum.clj

(defn sum-line [#^String string]
(let [nums #^ints (map #(Integer/parseInt %) (. string (split " ")))]
(reduce + nums)))

(defn read-input [sum]
(let [chars (read-line)]
(if chars
(recur (+ sum (sum-line chars)))
sum)))

(time (println (read-input (int 0))))

I suspect it could be faster, so if anybody’s up for it I’d like to see that.

Dmitri Sotnikov
April 15, 2009

here’s a clojure version, sorry if it’s a repost, my initial comment didn’t seem to post:

(defn sum-line [#^String string]
(let [nums #^ints (map #(Integer/parseInt %) (. string (split " ")))]
(reduce + nums)))

(defn read-input [sum]
(let [chars (read-line)]
(if chars
(recur (+ sum (sum-line chars)))
sum)))

(time (println (read-input (int 0))))

Markus
April 16, 2009

Python multiprocessing version:

import sys
import multiprocessing

def sumline(line):
return sum(map(int, line.split()))

if __name__ == “__main__”:
pool = multiprocessing.Pool()
print sum(pool.imap_unordered(sumline, sys.stdin, 1000))

Original: 6,814s
Multiprocessing: 3.925s (dualcore)

[...] Ruby, PHP, Python, C, Haskell, x86 asm: Language performance on the sum of a file [...]

[...] to cobol, C, Java, Ruby/Haskell/Scala/. It’s gotten to the point now where its even efficient when it gets boiled all the way down! So what are some of the spoken language “idioms” [...]

Vincenzo
August 15, 2009

On my Mac Intel, Core 2 Duo, OS X Leopard, PHP 5.2.8, Python 2.5:

PHP: 2.046s
Python: 2.763s

Using the code in this post.
The PHP code that ‘KESAVAN’ in the comments claims to be faster is actually slower on my machine.

Post a comment