Skip to content

Python custom dictionary benchmark (with pyhash.super_fast_hash )

October 8, 2013

Recently I had a task to measure performance of python dictionary vs custom dictionary using Google pyhash super_fast_hash() (in my case but I can be any other hash function).

Actually Python provide not bad implementation of hash collection. To proof this statement it was performed standard performance test set(sequential insert, random insert, delete operation).  To see the difference it plots will be built.

Implementation

First of all we need to implement custom dictionary with overridden hash function:

#filename: custom_dictionary.py
import pyhash

class CustomDictionary(dict):
    def __init__(self, *arg, **kw):
        super(CustomDictionary, self).__init__(*arg, **kw)

    def __eq__(self, other):
        return super.__eq__(other)

    def __hash__(self):
        hasher = pyhash.super_fast_hash()
        return hasher(self)

After it we need some test harness to run performance tests. For these purposes timeit lib was used:

import unittest
import timeit
from custom_dictionary import CustomDictionary

class BenchTests(unittest.TestCase):
    min_keys = 2 * 1000 * 100
    max_keys = 10 * 1000 * 100
    step = 2 * 1000 * 100
    repeat_time = 10 # repeat time to get average value

    isForFirstTime = False

    def setUp(self):
        if not self.isForFirstTime:
            self.setupClass()
            self.__class__.isForFirstTime = True

    def setupClass(self):
        self.delete_file_content("output")

    def test_sequential_insert(self):

        current_step = self.min_keys

        while current_step <= self.max_keys:
            setup_statement = "from custom_dictionary import CustomDictionary; l = [(str(x), x) for x in range(%d)]; d = {}; cd = CustomDictionary()" % current_step

            t = timeit.Timer("""
            for s, i in l:
                d[s] = i
            """, setup_statement)

            t2 = timeit.Timer("""
            for s, i in l:
                cd[s] = i
            """, setup_statement)

            tm1 = t.timeit(number=self.repeat_time)
            tm2 = t2.timeit(number=self.repeat_time)

            self.write_to_output("sequential classy", (current_step, tm1))
            self.write_to_output("sequential custom", (current_step, tm2))

            current_step += self.step

All source files can be found on github https://github.com/tjlee/ya_task

Results

After that we’ve got the following result graphs:

deleterandomseq

Benchmarks were made on:

Mac Os X 2.4 GHz Intel Core 2 Duo / 4 GB 1333 MHz DDR3

Links:

pyhash https://code.google.com/p/pyfasthash/

timeit http://pymotw.com/2/timeit/

matplotlib http://matplotlib.org/

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: