Roll Your Own Data Persistence

Author: Sean J Taylor
Date: 2008.03.15
Venue:PyCon 2008

Get the Slides

The slides have tons of clickable links, you can view them here:

http://blog.codejams.net/pycon/

About Me

Who is this guy?
  • economics research at FRB, Python for research
  • web development in Python for Matrix Group, Intl.
  • love Python, open source, reinventing the wheel

Inspiration

CouchDB
basura
  • Sam Ruby's throwaway Python version
  • uses bsddb, simplejson, and wsgi

Keys and Values

Document-oriented Database are simple key-value stores.

Examples:

Example: Invoice

These are actual requirements from a real-world project.
  • Invoice has LineItems
  • LineItems may be split into accounting codes
  • Invoice has Payments
  • Must record amount paid on each accounting code for AR
  • The Invoice/LineItem/Payment/Allocation must be versioned
  • Need to be able to query for Invoices with certain complex properties

Example: Invoice

Here is the Invoice as a JSON document

{ "payments": [
  {"type": "VISA", "amount": 14.00}],
  "lineItems": [
    { "description": "Dive into Python",
      "amount": 17.00,
      "allocation": {
        "code1": {"amount":5.00, "paid": 5.00},
        "code2": {"amount":12.00, "paid": 9.00}}
    }]
}

Implementation

How will we build this?
  1. choose a way to persist string keys and string values
  2. choose a serialization format
  3. add features as "middleware"
  4. add querying by wrapping our db in a MapReduce implementation
  5. serve our database over HTTP using WSGI

Warnings

IANADBE:
I am not a database (expert|engineer). This approach has not been used in production.
Caveat Developer:
Let the developer beware.

We may be giving up some of the ACID properties that are so desirable in databases.

Storage: Roll Your Own

DictMixin is useful for constructing persistent dictionaries.

class DictMixin(object):
  def keys(self): pass
  def __getitem__(self, key): pass
  def __setitem__(self, key, value): pass
  def __delitem__(self, key): pass

All other mapping-interface methods are derived from these four operations. See Ian Bicking's Blog Entry for more info.

Storage: Roll Your Own

Example: FileStore

class FileStore(DictMixin):
  def __init__(self, path):
    self.path = path
  def keys(self):
    return os.listdir(self.path)
  def __getitem__(self, key):
    p = os.path.join(self.path, key)
    if not os.path.exists(p):
      raise KeyError(key)
    return open(p).read()

Storage: Roll Your Own

FileStore continued

class FileStorage(DictMixin):
  def __setitem__(self):
    f = open(os.path.join(self.path, key))
    f.write(value)
    f.close()
  def __delitem__(self, key):
    os.remove(os.path.join(self.path, key))

Add file locking and it's safe for multiple read/writes.

Storage: Roll Your Own

Other ideas for rolling your own storage:

Storage: Standard Library

The standard library comes with plenty of options.

>>> import *dbm as dbm
>>> db = dbm.open(filename, mode)

Storage: Standard Library

dumbdbm:
  • written in Python, a flat file
  • fallback option
dbm:
  • uses UNIX dbm library
  • removes items/values methods

Storage: Standard Library

gdbm:
  • GNU DBM library
  • removes items/values methods
  • adds firstkey/nextkey methods
dbhash:
  • BSD DB library
  • adds first/last/next/previous methods

Storage: Cheese Shop

I recommend shove, which implements storage backends for:
  • Amazon S3
  • FTP
  • ZODB
  • MySQL
  • many others

Storage Tradeoffs

Choice of storage will depend on your application. You may need:

  • to be able to edit the data with a text editor
  • to make the store accessible from other languages
  • to scale to a large number of documents
  • to partition the data or acheive some redundancy

Storage Benchmarks

Benchmark:
copy 849 manpages from storage backend to memory and back
backend writetime listkeys readtime
bsddb 11.40 0.80 2.65
sqlite 33.76 0.23 4.32
dbm 14.27 0.74 2.62
file 2.15 0.03 9.90

Serialization: Standard Library

We have some good options in the standard library:
  • pickle/cPickle
  • marshal (not safe)
  • repr/eval (not safe)

Serialization: Cheese Shop

JSON:
simplejson, cjson, jsonlib, others
YAML:
PySyck, PyYAML
XML:
ore.xd, pyxmlserial, xmlpolymerase

Serialization: YAML Example

payments:
 - amount: 14.00
   type: VISA
lineItems:
 - description: Dive Into Python
   amount: 17.00
   allocation:
     code1:
       amount: 5.00
       paid: 5.00
     code2:
       amount: 12.00
       paid: 9.00

Serialization Tradeoffs

  • interoperability
  • speed
  • human editability/readability

Serialization Benchmark

images/serialization_benchmark.png

Adding Features

To add features, we subclass a UserDict to make middleware

class UserDict(object):
  def __init__(self, data):
    self.data = data
  def __getitem__(self, key):
    return self.data[key]
  def __setitem__(self, key, value):
    self.data[key] = value

Adding Features: Serialization

Values are converted to/from JSON as they are set/get.

class JSONSerializer(UserDict):
  def __getitem__(self, key):
    return simplejson.loads(self.data[key])
  def __setitem__(self, key, value):
    self.data[key] = simplejson.dumps(value)

Adding Features: Data Integrity

Using FormEncode, we can easily validate a document has the correct schema:

class InvoiceSchema(Schema):
  payments = ForEach(PaymentSchema)
  lineItems = ForEach(LineItemSchema)
class PaymentSchema(Schema):
  type = UnicodeString()
  amount = Number()
class LineItemSchema(Schema):
  allocation = AllocationSchema()

Adding Features: Data Integrity

We'll pass our document through a validation test when it is saved.

class ValidatingDict(UserDict):
  def __init__(self, data, validator=None):
    self.validator = validator
    ...
  def __setitem__(self, key, value):
    self.validator(value)
    self.data[key] = value

Adding Features: Versioning

class VersionedDict(UserDict):
  def __getitem__(self, key):
    value = self.data[key]
    version = value['latest']
    if isinstance(key, tuple):
      key, version = key
    return value[version]
>>> db['docname']
>>> db['docname', 'revision number']

Adding Features: Versioning

class VersionedDict(UserDict):
  def __setitem__(self, key, value):
    try:
      versions = self.data[key]
    except KeyError:
      versions = {}
    if (versions and value['rev'] != versions['latest']):
      raise Conflict()
    rev = uuid.uuid1(); value['rev']=rev
    versions[rev] = value; versions['latest'] = rev
    self.data[key] = versions

Adding Features: Timestamp

class TimeStampDict(UserDict):
  def __setitem__(self, key, value):
    now = datetime.now()
    value['timestamp'] = now
    self.data[key] = value

Adding Features: Replication

class ReplicationDict(UserDict):
  def __init__(self, data, backup=None):
    self.backup = backup
    ...
  def __setitem__(self, key, value):
    t = Thread(target=self.backup.__setitem__,
               args=(key, value))
    t.start()
    ...

Adding Features: Compression

class ZipDict(UserDict):
  def __getitem__(self, key):
    value = self.data[key]
    return zlib.decompress(value)
  def __setitem__(self, key, value):
    value = zlib.compress(value)
    self.data[key] = value

Adding Features: Encryption

import Crypto.Cipher.AES as AES
class CipherDict(UserDict):
  def __init__(self, data, key=key):
    self.crypter = AES(key)
    ....

You get the idea.

Putting It All Together

from functools import partial
def compose(outer, inner):
  def composition(*args, **kwd):
    return outer(inner(*args, **kwd))
  return composition

compose_all = partial(reduce, compose)

Putting It All Together

store  = FileStore('/var/database')
backup = SimpleDBStore()
db = compose_all([
  partial(ValidatingDict,
    validator=InvoiceSchema.to_python),
  TimeStampDict,
  VersionedDict,
  JSONSerializer,
  partial(ReplicatingDict, backup=backup)
])(store)

Unsolved problem: how to share state across the stack.

Simple Querying

def has_high_balance(doc):
  if calculate_balance(doc) >= 50.00:
    return True
invoices = (inv for inv in db.itervalues()
            if has_high_balance(inv))

So maybe this isn't the most efficient approach, but it's simple.

Map-Reduce Querying

MapReduce was invented by Google to query incredibly large datasets.

It can be run incrementally if we're willing to store redundant information.

Map-Reduce Querying

To implement it, we will define views, which will consist of:
  • a map function
  • storage of intermediate map output
  • storage of reduce input
  • a reduce function
  • storage of the calculated view

The Map Function

The Map Function takes a document and may yield 0 or more key-value pairs.

def map_func(doc):
  for li in doc['lineItems']:
    for c, a in li['allocation'].iteritems():
      yield c, a['amount'] - a['paid']

Map Output

The Map function is run on each document as it is set:

map_output = {'invoiceId': [
  ('code1', 13.00),
  ('code2', 18.00),
  ('code1', 0.00),
  ('code3', 1.00)]}

This is cached for each key for each view in the database.

Constructing Reduce Input

reduce_input = defaultdict(list)
for values in map_output.itervalues():
  for key, val in values:
    reduce_input[key].append(val)

reduce_input should be persistent for incremental updates.

The Reduce Function

The Reduce function takes a key and a list of values to reduce. In our case, it will be a simple sum.

def reduce_func(accounting_code, amounts):
  return sum(amounts)

The Stored View

The resuling view on our data looks like this:

{'code1': 13.00,
 'code2': 18.00,
 'code3': 1.00}

MapReduceView

class MapReduceView(UserDict):
  def __init__(self, data, store1, store2):
    self.map_output = store1
    self.reduce_input = store2
    ....
  def map(self, key, value): abstract
  def reduce(self, key, values): abstract
  def update_view(self, key, value):
    # update map_output
    # update reduce_input
    # update data

More on Map Reduce

MapReduce -> Mapping Interface

To keep our interface from being polluted, we'll make the views addressable.

class MapReduceDict(UserDict):
  views = {'ar_by_accounting_code': my_view_instance}
  def __setitem__(self, key, value):
    for view in self.views.values():
      view.update_view(key, value)

  def __getitem__(self, key):
    if key.startswith('_view/'):
      return self.views[key[6:]]

The Server

To create a database server, we'll use WSGI to serve over HTTP. REST maps neatly onto the mapping interface.

Things to consider:
  • need to reverse serialization
  • how to access keys
  • need to worry about content-type
  • handling ETags would be nice

The Server

class HTTPDict(UserDict):
  def GET(self, uri, environ):
   try:
     data = self.data[uri]
   except KeyError:
     raise HTTPNotFound()
   return '200 OK', data

The Server

def POST(self, uri, environ):
  resource_id = str(uuid.uuid1())
  self._set_item(resource_id, environ)
  return '201 Created', ''
def PUT(self, uri, environ):
  new = uri not in self.data
  self._set_item(uri, environ)
  return '201 Created' if new else '204 No Content', ''

Things We're Still Missing

To be safe, we should also think about:
  • monitoring disk space
  • monitoring memory usage
  • parallelizing MapReduce with the processing module
  • testing concurrency
  • worry about consistency and locking
  • keeping views synced with data
  • transactions (can do this with BerkeleyDB)
  • many others

Summary

In summary:
  • Worse is better: choose a simple interface, implementation
  • UserDict and DictMixin are awesome, use them like WSGI
  • Document-oriented databases are an interesting approach to some problems
  • document-storage lets you do less work upfront, but more work is necessary later to construct queries

Thank You

Thanks for listening.

Contact me at:
seanjtaylor at gmail dot com