In line with the previous article “Structure, readability and efficiency in code development”, I add some practical tips to improve Python development practices.
As you know, in Capitole we have presence in many different industries. Many of us are in data processing projects, in Data Science / Development /Devops positions and work both on physical servers and on cloud machines in AWS, Azure or other cloud services. For us it is very important to work efficiently and follow good practices in development, leaving a good image of our company wherever we go. This allows us to perform our job the best we can and makes things easier for the end customers of the developed product.
In this article, we share some of the thoughts that we have acquired over time, that are meant to help as tips to organize the code. They are simple tricks that can save a lot of time and misunderstandings in the day-to-day work of the team of developers.
Inline Tests
I know you test your code; otherwise, how do you know it works? But here’s the question: do you keep track of the tests you do? If not, how can others trust your code?
Welcome to the amazing world of unit testing. This is one of those things that might not seem fun at the beginning, but once you’ve experienced long hours wasted debugging code, and then hours saved thanks to testing your code, it magically becomes fun and a must.
I would want to teach you about the assert statement, also known as “inline tests”. These tests are useful to check if the input and output of your functions are correct.
Let me show you an example where this comes in handy. Let’s say you are working with a vector of probabilities, and you want to project to 0 or 1 depending on a threshold. This function is implementing this:
def project_to_zero_or_one(probabilities, threshold):
# define empty array
projections = np.empty_like(probabilities)
# project
projections[probabilities < threshold] = 0
projections[probabilities >= threshold] = 1
return projections
But what if there are nans in your input vector? What if one of the entries is <0 or >1? (remember probabilities are not defined outside the range [0,1]) What if the input is a matrix and not a vector?
I would like the code to tell me if anything like that is happening, meaning there’s something wrong somewhere else I need to fix before it’s too late.
def project_to_zero_or_one(probabilities, threshold):
# check input
assert probabilities.ndim == 1, “Input must be a vector!”
assert np.isnan(probabilities).sum() == 0, “Input contains NaN values!”
assert np.sum(probabilities > 1) == 0, f”There are probabilities > 1!”
assert np.sum(probabilities < 0) == 0, f”There are probabilities < 0!”
# define empty array
projections = np.empty_like(probabilities)
# project
projections[probabilities < threshold] = 0
projections[probabilities >= threshold] = 1
return projections
One practice I like to follow is extracting all assert statements out of the main function. This is particularly useful when you have other functions that use the same argument, such as probabilities, allowing you to reuse the code.
def _check_probabilities(probabilities):
assert probabilities.ndim == 1, ‘Input must be a vector!’
assert np.isnan(probabilities).sum() == 0, ‘Input contains NaN values!’
assert np.sum(probabilities > 1) == 0, ‘There are probabilities > 1!’
assert np.sum(probabilities < 0) == 0, ‘There are probabilities < 0!’
Code formatters and Stylers
You may not realize it yet, but you’ll spend most of your career reading code instead of writing it. Whether you work in a team and review your colleagues’ code, or when you are trying to solve a problem by looking for an answer on StackOverflow, or even when you come back to debug code you wrote months ago. In all those situations, you will be reading a lot of code.
For that reason, it is important to write code in a consistent and uniform way. This includes decisions such as maximum line length, empty lines between function definitions, and syntax conventions like vector[:-1] or vector[: -1]. These may seem like small details, but they have a significant impact on code readability for humans. The big question is, can all these small decisions be automated? Yes, indeed.
- A code formatter is a tool that automatically modifies the layout and style of source code to adhere to a specific set of formatting rules or guidelines. I highly recommend Black.
- On the other hand, a code styler is a tool that assists developers in applying a specific coding style or set of guidelines to their code. While similar to code formatters, code stylers are more flexible and suggest changes to the code instead of modifying it directly. For example, they may suggest renaming variables or removing unused libraries. I highly recommend flake8.
Structuring code as a package
Are you having trouble importing your own Python modules? Does the error ModuleNotFoundError: No module named ‘my_python_file’ look familiar? Have you already experienced the insecurity of knowing if you have installed your modules, where they are located or if you are using the correct path? It might be time to improve your code structure.
Whenever starting a new project, structure your code something like this:
my_project/
├── src/
│ ├── __init__.py
│ ├── my_module.py
│ └── my_folder/
│ ├── __init__.py
│ └── my_other_module.py
├── data/
│ ├── raw/
├── scripts/
│ ├── my_script.py
├── setup.py
└── README.md
A few things to note:
- When Python imports a package, it looks for the __init__.py file in the package directory and executes any code inside it.
- setup.py is a Python script that is used to define the metadata and dependencies of a Python package. The simplest it can be is:
from setuptools import setup, find_packages
setup(
name=’my_package’,
packages=find_packages(),
)
You can also specify dependencies, authors, versions, etc:
from setuptools import setup, find_packages
setup(
name=’my_package’,
version=’0.1′,
author=’John Doe’,
author_email=’john.doe@example.com’,
description=’A simple Python package’,
packages=find_packages(),
install_requires=[
‘numpy>=1.16.0’,
‘pandas>=0.23.4’,
],
)
Once your folders look like this (and you are in your virtual environment) type pip install -e path/to/my_project/. This will install your package in editable mode. This means that as you change your code your installed package is automatically updated, and you won’t need to reinstall anything.
Conclusion
In summary, good coding structure and practices not only improve development efficiency, but also facilitate collaboration and long-term code maintenance.
- The practice of testing (in an ordered and consistent manner) is essential to ensure in a reliable and controlled way that the code complies with the defined functionalities correctly.
- The use of code stylizers and formatters are essential habits to homogenize criteria in any development team. The key is to write code that is easily understandable, replicable, and adaptable, which will benefit both you and your teammates and customers.
- Structuring your own code as a package is a good practice that will make it easier to share and publish the code in the future and installing it in editable mode saves a lot of time, as it updates automatically.
Efficiency in code is ultimately efficiency in results.