Engineering for All: Single Responsibility
“a class should have only one reason to change.” — Robert C. Martin
Single responsibility is the first of the SOLID principles, and its the most fundamental. Although initially, it seems very obvious and almost silly, when applied to a group of code, it can have a dramatic effect. Robert Martin’s fundamental idea is that the goal of SRP is to lessen the need for changing a unit’s code to only one reason. What ends up happening is a smaller, more precise grouping of code.
Now classically, SRP is related to Object-Oriented programming, but Let’s generalize and include all groups of code (functions, modules, etc..)
to apply it more broadly.
So restated this principle says you should have only one reason to change a group of code, which would mean it serves a singular responsibility or purpose. One purpose of this is to reduce how tightly coupled unique code is to each other. Within any group of code, all the code is tightly coupled or linked together, and a single change will often require several changes unrelated to the original goal.
Furthermore, this principle hints that this responsibility should be encapsulated as a single entity, abstracting the details away from anything calling that object.
Here is a very simplistic example of a class designed for data wrangling that doesn’t follow the SRP:
import pandas as pd
import pyarrow.parquet as pqclass Wrangler:
def __init__(self):
self._data: pd.DataFrame = pd.DataFrame() def read_csv(self, url: str):
self._data: pd.DataFrame = pd.read_csv(url) def read_json(self, url: str):
self._data: pd.DataFrame = pd.read_json(url) def read_parquet(self, url: str):
table = pq.read_table(url)
self._data: pd.DataFrame = table.to_pandas() def apply_pipeline1(self):
print("applying super secret ETL process 1") def apply_pipeline2(self):
print("applying super secret ETL process 2") def write_data(self, url: str):
self._data.to_csv(url) def apply_quality_checks(self):
print("applying super secret ETL quality checks") @property
def data(self):
return self._data
The above code might seem at first to be very convenient, and all of it related in some way. After all, it’s a single class to work with, that handles all the components needed for a data wrangling task. Each method is sharing the instance fields, which keeps everything simple, neat, and, most of all, the code is compact.
One side effect of this approach is that everything in the class is tightly connected together. When a small change must be made to any
component of the class, there will be cascading effects that need to be tests and resolved.
Another criticism to consider is that whenever you may need just one method like reading a CSV, you are forced to bring all the extra code along. With all that extra code, comes the possibility of complications from assumptions made by the developer, which could be anything from formatting, to memory size. When you need only a small portion of the functionality, you must use the whole class. Each feature isn’t encapsulated; instead, it is tightly woven together with the rest of the class.
Consider how you would go about testing this class and all its methods? By separating out responsibility, you remove the need to create a spider web of dependencies when testing the class without even discussing dependency injection.
Now let us refactor the above code to follow the SRP:
import pandas as pd
import pyarrow.parquet as pqclass CsvManager:
def __init__(self):
self._data: pd.DataFrame = pd.DataFrame() def read(self, url: str):
self._data: pd.DataFrame = pd.read_csv(url) def write_data(self, url):
self._data.to_csv(url) @property
def data(self):
return self._dataclass JsonManager:
def __init__(self):
self._data: pd.DataFrame = pd.DataFrame() def read(self, url: str):
self._data: pd.DataFrame = pd.read_json(url) def write_data(self, url: str):
self._data.to_csv(url) @property
def data(self):
return self._dataclass ParquetManager:
def __init__(self):
self._data: pd.DataFrame = pd.DataFrame() def read(self, url: str):
table = pq.read_table(url)
self._data: pd.DataFrame = table.to_pandas() def write_data(self, url: str):
self._data.to_csv(url) @property
def data(self):
return self._data
def apply_pipeline1(df):
print("applying super secret ETL process 1")def apply_pipeline2(df):
print("applying super secret ETL process 2")def apply_quality_checks(df):
print("applying super secret ETL quality checks")
We can now see that each file format is handled separately using handler classes. Also, there are separate supporting functions that are generic to whatever format the data is read. Now at first, this seems like we have replaced simple, easy to read code with significantly more code to maintain. I believe this is a legitimate concern, but I also believe the gains outweigh any added code complexity. Furthermore, it could be argued that by separating the code into logical single-purpose units, you are increasing readability and reducing overall complexity.
TLDR
Break your code into single-purpose units and never have a group of code that has more than one responsibility. This principle can be applied to classes, functions, or really any grouping of code you have even modules/packages.