Ashish Banerjee's Website

Structured Datatypes in Python

Any useful programming language provides a means to implement composite data types, ie. data types built as a composition of primitive and other composite data types. Python provides many ways to implement these structures, but I want to write about how to do this using named tuples and dataclasses.

The reason why I like these two features is because they significantly improve the maintainability of a codebase. They very clearly declare all the fields a structure contains, and they integrate very nicely with Python's type hints. Additionally, they make it possible to refer to fields using their names (rather than using an index or some other kind of non-descriptive identifier), and they can be immutable. I won't get into the pros and cons of immutability (there's a good reference here for the interested reader), but in general, from the standpoint of maintainability, it's preferable to make data structures immutable whenever possible so it is easier to reason about the code.

Demonstration

Suppose we wish to store a Person object in memory. We'll define the fields that we wish to store as name, birthdate, occupation, and a unique identifier id.

NamedTuple

Here is how we can define our Person object using a named tuple:

from collections import namedtuple

Person = namedtuple('Person', ['name', 'birthdate', 'occupation', 'id'])
some_person = Person(
    name='John Doe', birthdate=date(2020, 11, 10), occupation='student',
    id=1)

Looks great, but what happened to the type hints?! It actually turns out that there is a separate syntax for doing this in the typing module:

from datetime import date
from typing import NamedTuple, Union

class Person(NamedTuple):
    name: str
    birthdate: date
    occupation: Union[str, None]
    id: int

some_person = Person(
    name='John Doe', birthdate=date(2020, 11, 10), occupation='student',
    id=1)

This approach satisfies all the criteria we outlined in the introduction. It provides type hints, it's immutable and it always guarantees that there are 4 fields that can be identified by name. What I like even more about this code is that anyone can see exactly what fields a Person structure contains by looking at its definition.

It is also possible to add methods to the class; for example:

class Person(NamedTuple):
    # field names here...

    def say_name(self):
        return f"My name is {self.name}"

Now we can run something like print(some_person.say_name()), and the logic for say_name() is inside the class, rather than in a function somewhere outside and far away.

It is also very easy to add fields to this structure. So long as the fields are always referred to by name, more fields can be added at will, and tools like grep can be used to easily find everywhere a particular field is used.

Here are some drawbacks to this method:

It is still possible to access fields by index. For example, this works:
```
birthdate = some_person[1]
```
However, this would break if fields were added to the top of the class. For example, if an address field were added between name and birthdate, then the above code would have to be changed because the birthdate would now store the value of the address instead.
To the best of my knowledge, it is not possible to introduce a subclass of structures defined as named tuples. Of course, it is always possible to create another class composed of this structure.

Dataclass

Data classes are very similar to named tuples. The syntax here is:

from dataclasses import dataclass
from datetime import date
from typing import Union

@dataclass(frozen=True)
class Person:
    name: str
    birthdate: date
    occupation: Union[str, None]
    id: int

some_person = Person(
    name='John Doe', birthdate=date(2020, 11, 10), occupation='student',
    id=1)

It has all the same benefits mentioned above for named tuples, with the caveat that dataclasses are mutable by default (hence the frozen=True option). Additionally, dataclasses do not suffer from the two drawbacks listed above. It is not possible to refer to the fields by index, and a subclass of Person can be written like so:

class Employee(Person):
    salary: int

some_employee = Employee(
    name='John Doe', birthdate=date(2020, 11, 10), occupation='student',
    id=1, salary=3000)

Conclusion

Named tuples and dataclasses appear to serve the same purpose, so how to choose between them? A good breakdown of the small but important differences can be found here. The main difference is:

Named tuples are backed by tuples, meaning they inherit most, if not all, of the properties of tuples. In other words, they are hashable (so they can be used as keys in dicts), orderable, immutable, etc.
Dataclasses are backed by dicts, but have many options which determine what other things they can do. They can be made mutable or immutable, they can be made hashable, comparable, etc. depending on what options are passed to the @dataclass decorator.

Another important point to note is that dataclasses were introduced in Python 3.7, while named tuples go all the way back to Python 2.6, so if compatibility with older versions of Python is a concern, it is better to use named tuples.

Other articles

Ascii-Art Conversion Using Python

Published: Wed 08 July 2020
By Ashish Banerjee

In Computers.

tags: Software Python

I recently came across an excellent blog post which has some suggestions on interesting short-term coding projects. I think these projects are a fun way to learn a new language (lately, I've been interested in learning Nim and Ocaml, for reasons that I'll try to explain some other time).

For …
read more
Transferring Files Over HTTP Using Python
Published: Sun 17 May 2020
By Ashish Banerjee

In Computers.

tags: Software python HTTP

Sending files around a network always seems to be difficult for some reason. But Python has a built-in utility that makes this much easier. Simply run this at the command line:
```
python3 -m http.server 8080
```
Python will start an HTTP server on port 8080 and it will serve all …
read more
My Python Workflow

Published: Sat 25 April 2020
By Ashish Banerjee

In Computers.

tags: Software python pdb vim

Python is my favourite language. Some find it surprising that my preferred editor for coding is not an IDE, but vim. Not even vim with plugins, but vim with very few customisations. This won't be an in-depth post on how I use vim - that would take me too long to …
read more