Spring Cleaning Code
Tips and tricks after refactoring two Python projects.

Note: I’ll expand this blog post with more tips over time, to have it all in one location

At work it seems to be spring cleaning all-year round: This year I have been tasked with refactoring two Python projects from others. The first was based on a now-defunct GitHub project. It encompassed several py-files that somehow worked together by sheer will, approximately 2500 lines of code all in all, and lots of duplicated or even dead code. The project was in such a dire state that I basically refactored it to a new version with bits and pieces of the original strewn throughout the new code base. The second project loads data from an API, transforms, then outputs it in a new format. It was around 1000 lines of code, structured by comments.

The article here summarizes some of the refactoring I’ve done and commonalities I saw in both projects. Please note: The people who wrote the code did their best to their knowledge, capabilities and given time. Sometimes pressure is high for things to just run now, without any consideration of clean code guidelines. That’s okay! I don’t blame them in any way for how their code looks or behaves. What I share here are my insights for other people to benefit, whether they are writing or refactoring code. The examples have been anonymized, the projects are sadly closed-source.

Initial Steps - If Possible

Refactoring means that the new codebase should do the same thing as the initial one, even if the whole has been massively changed. Before touching the code in any way, I enabled the following if not given:

  • Initialized a git repo (one project did not use version control - the horror!)
  • Added logging: Logging Howto
  • Wrote black-box tests with Pytest, checking coverage with pytest-cov

Succeeding tests assert my changes are correct, failing ones either uncover existing bugs in the code or new ones introduced by me.

Writing More Pythonic Code

Write code with the reader in mind: “Eighty percent of the time, code is read” has become my mantra, taught by my excellent Software Engineering professor during my computer science studies. As a consequence, code should be written as if it was text (hello journalism, my old friend!): Succinct, precise and with the reader in mind. I applied the mantra when I read variables like f_d_brb or check_value_init, when all I needed was checked_files and delta_initial_check_last_check. Python is not Assembly, your code is probably not meant for a demo party, and IDEs have code completion - choose variable names that are readable.

Update January ‘24: Michael Hart writes Writing Code is the Same Thing as Writing Prose. This.

Documentation is your friend: Learn how to read Python’s documentation. Or check Stack Overflow, ask an AI, then cross-reference with the OG documentation. It contains excellent examples, hidden gems and best practices. I reduced a helper function of fifteen lines down to abs(a, b), which returns the absolute difference between two numbers.

Documentation will make you friends: Writing a function of 500 lines without a docstring is evil. How I am supposed to know what this make_endpoint_happy() is doing? Annotate at least your bigger functions with docstrings - it will also help generating documentation. And a function of 500 lines is a code smell in itself.

Magic Numbers Code Smell: “Eighty percent of the time, code is read”. What is -90? Degrees? Freezer temperature? My bank account? No, it’s THREE_MONTHS_AGO. Refactoring Guru has more on Magic Numbers.

Dataclasses to the rescue: Code contains a zoo of variables. Some are - like animals - closely related, some not. You want to group the monkeys together and separate them from the birds, thus create more cohesion and signal this to the reader (“Code is read”, remember?). Dataclasses are a fantastic device for that - better than dicts, less work than regular classes. With dataclasses, you can push around variables that belong together in one function argument only (as part of a dataclass) and the reader will know that these variables belong togehter.

Modern Python, less pain:

  • f-strings are great and were introduced in (gasp) Python 3.6. They make strings, variables, everything more readable (“Code is…” - you got it). Here’s a tutorial from Real Python and a quick cheatsheet. Concatenating Python strings with + will have me cry.
  • Pathlib makes your life so, so much easier when dealing with filepaths. Your code becomes easily portable to Linux and back to Windows without even thinking of backslashes. Use it as often as you can instead of os pains.
  • enumerate() instead of ugly loop counter variables. Real Python teaches how.

Order your imports: Most IDEs do this automatically, but, well, I’ve seen different. First come the built-in libraries, then third-party libraries, then yours. PEP8 describes it.

PEP8 is your friend: IDEs can do PEP8 checks for beautiful code. There exist also nice formatters such as black.

By reading just this section, look how many new friends you can make! Go and seize the opportunity!

Project Structure

Generally, separate the configuration of your app from the code itself, following a principle of the “Twelve-Factor App”. It avoids such basic issues as having credentials in your code, then in your repository, which attackers love to steal.

Here’s the structure that I use for projects most of the time and has crystallized out of practice, adapt as needed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
projectname
├── config
├── .env
├── .gitignore
├── pyproject.toml
├── src
│   └── myapp
│       ├── cli.py
│       ├── core.py
│       ├── definitions.py
│       ├── errors.py
│       ├── __init__.py
│       ├── __main__.py
│       └── model.py
└── tests

cli.py: Contains the entry point to the application.

core.py: Main logic of application.

definitions.py: Inspired by Django, a file that contains constants used all over the application, like filepaths or strings. For strings over several lines, consider using Jinja templates instead.

errors.py: Subclassed Exception containing various classes to handle errors in the application.

model.py: Classes, dataclasses.

Epilogue on ‘Reading Code’

While writing this blog post I discovered by chance that my former Software Engineering professor has actually open-sourced the textbook which he used in my classes (German version, English version). Thank you! It’s excellent - it teaches the basics of Software Engineering and how to write clean code. A lot of my code practice today is thanks to his classes, and whatever I’ve learned back then frequently pops up in my mind when writing (and refactoring) code.

Image source: “a python doing spring cleaning”, Nightcafe/SDXL-MeinaCafe


Last modified on 2023-10-21

Comments Disabled.