Memory-efficient Count-Min Sketch Counter (based on Madoka C++ library)
|travis| |coveralls| |pyversion| |version| |license|
Madoka is an implementation of a Count-Min sketch data structure for summarizing data streams.
String-int pairs in a Madoka-Sketch may take less memory than in a standard Python dict, Counter, Redis.
Counting error rate is about 0.0911 %
More details are described in Benchmark.ipynb
_
.. _Benchmark.ipynb: https://github.com/ikegami-yukino/madoka-python/blob/master/Benchmark.ipynb
This module is based on madoka
_ C++ library.
.. _madoka: https://github.com/s-yata/madoka
NOTE: Madoka-Sketch does not have index of keys. so Madoka-Sketch can not dump all keys such as Python dict's dict.keys()
. However, when set k
parameter to costructer, most_common
method (returns key and value as many as k
) is available.
Contributions are welcome!
::
$ pip install madoka
Madoka has some classes having same interface. These classes are vary in value data type. So you can choose for your purpose.
For example, if you wants to count float data, it's preferable to choose CroquisFloat class or CroquisDouble class.
From here, I will describe about Sketch class. But, Croquis classes have also same interfaces mostly. So you can use other classes by the same way as Sketch class. In that case, you should replace to intended class from "Sketch".
.. code:: python
import madoka sketch = madoka.Sketch()
Sketch madoka.Sketch([width=1048576, max_value=35184372088831, path='', flags=0, seed=0, k=5])
width
is a size of register. If you are worrying about gap, you should increase width
value. The larger width
is, the fewer mistakes madoka makes in estimating value. But, the larger width
is, the larger memory consumption is.
Permission of path
should be 644
k
means Top-K used by most_common
method. if you don't want to use most_common
method, then I recommend to set k=0
so it is slightly fast.
madoka.Sketch()
calls madoka.Sketch.create()
, so you don't have to explicitly call create()
in initialization
.. code:: python
sketch['mami'] += 1
or
.. code:: python
sketch.inc('mami')
int inc(key[, key_length=0])
key_length
is automatically determined when not giving key_length
. Thus, the order of parameters differs from original madoka C++ library... code:: python
sketch['mami'] += 6
or
.. code:: python
sketch.add('mami', 6)
int add(key, value[, key_length=0])
key_length
is automatically determined when not giving key_length
. Thus, the order of parameters differs from original madoka C++ library... code:: python
sketch['mami'] = 6
or
.. code:: python
sketch.set('mami', 6)
void set(key, value[, key_length=0])
Note that set()
does nothing when the given value is not greater than the current key value.
Also note that the new value is saturated when the given value is greater than the upper limit.
Additionally note that key_length
is automatically determined when not giving key_length
. Thus, the order of parameters differs from original madoka C++ library.
.. code:: python
sketch['mami']
or
.. code:: python
sketch.get('mami')
int get(key[, key_length=0])
key_length
is automatically determined when not giving key_length
. Thus, the order of parameters differs from original madoka C++ library... code:: python
sketch.values()
generator
.. code:: python
sketch.save('example.madoka')
void save(path)
path
should be 644.. code:: python
sketch.load('example.madoka')
void load(path)
path
should be 644.. code:: python
sketch.clear()
void clear()
create()
in maintaining current settings... code:: python
sketch.create()
void create([width=0, max_value=0, path=NULL, flags=0, seed=0])
path
should be 644.. code:: python
sketch.copy(othersketch)
.. code:: python
sketch += other_sketch
or
.. code:: python
sketch.merge(othersketch)
void merge(Sketch[, lhs_filter=None, rhs_filter=None])
.. code:: python
sketch.shrink(sketch, width=1000)
void shrink(Sketch[, width=0, max_value=0, filter=None, path=None, flags=0])
When width > 0, width must be less than source sketch
Permission of path
should be 644
.. code:: python
summed_sketch = sketch + other_sketch
.. code:: python
summed_sketch = sketch + {'mami': 1, 'kyoko': 2}
.. code:: python
'mami' in sketch
.. code:: python
sketch.inner_product(other_sketch)
list
.. code:: python
sketch['madoka'] = 1 sketch['mami'] = 2 sketch['sayaka'] = 3 sketch['kyouko'] = 4 sketch['homura'] = 5 sketch.median() # => 3
.. code:: python
sketch.filter(lambda x: x + 1)
void filter(Callable[, apply_zerovalue=False])
If apply_zerovalue = True, filter_method is applied also 0 values (It may be slow) (from version 0.6 or later)
Note that processing time increases according to sketch's width. If you feel this method is slow, I recommend setting width to less than 1000000 when creating sketch
.. code:: python
sketch.fromdict({'mami': 14, 'madoka': 13})
or
.. code:: python
sketch += {'mami': 14, 'madoka': 13}
.. code:: python
sketch.most_common()
generator most_common([k=5])
returns key-value pair as many as k
Note that this method is required to set k
parameter in constructer.
madoka
_ C++ library is licensed under the Simplified BSD License... |travis| image:: https://travis-ci.org/ikegami-yukino/madoka-python.svg?branch=master :target: https://travis-ci.org/ikegami-yukino/madoka-python :alt: travis-ci.org
.. |coveralls| image:: https://coveralls.io/repos/ikegami-yukino/madoka-python/badge.svg :target: https://coveralls.io/r/ikegami-yukino/madoka-python :alt: coveralls.io
.. |pyversion| image:: https://img.shields.io/pypi/pyversions/madoka.svg
.. |version| image:: https://img.shields.io/pypi/v/madoka.svg :target: http://pypi.python.org/pypi/madoka/ :alt: latest version
.. |license| image:: https://img.shields.io/pypi/l/madoka.svg :target: http://pypi.python.org/pypi/madoka/ :alt: license