Categorieën
Tech

Moving all our Python code to a monorepo: pytendi

Organizing code across a company is not an easy task. Even as a relatively young startup with only five developers, we quickly found ourselves tangled in a web of Python repositories. The increasing number of code repositories made it more difficult to discover and reuse existing code, and streamline our development process.

To tackle these issues, we decided to migrate most of our Python code into a single monolithic repository—a monorepo. In this article, we dive deeper into the reasoning behind this decision, the tradeoffs we considered, how we structured our codebase using the Polylith architecture, what benefits and problems we’ve observed so far, and plans for the future.

TL;DR: We migrated our Python codebase to a monorepo, improving discoverability, reusability, and developer experience. The migration wasn’t without challenges, but the benefits in code organization, development speed, and overall productivity have made it worthwhile for our growing team.

The challenges of a growing codebase

As a provider of speech recognition APIs for healthcare, our work involves a significant amount of Python code, particularly for data science and machine learning tasks. Initially, individual repositories seemed manageable, but as our projects and team grew, we encountered several issues:

  • Discoverability: It is hard to find if there already exists code that does (something similar to) what you want.
  • Reusability: Code is more difficult to reuse when it’s in another repository.
  • Developer experience: If a functionality you’re working on depends on an internal library, and that internal library has to be changed, the feedback cycle is long.

We recognized that these problems would only worsen as we continue to grow.

Discoverability

In a multi-repo environment, discovering what is and isn’t already out there can be difficult. To know if some functionality already has been implemented by someone else, or how some function in a library is actually implemented, developers must manually search through multiple repositories, often with inconsistent naming conventions and documentation. With a large amount of repositories, this might take a while.

Another option to navigate your codebase is search. However, how well you can do this over multiple repositories depends on the quality of your version control platform’s code search. We use Bitbucket, which has no regex search, fuzzy matching, or “go to reference / definition”, making more sophisticated searches across the codebase difficult, especially if you don’t exactly know what you’re looking for yet.

How a monorepo solves this

With a monorepo it becomes a lot easier to download all possibly relevant code with a single git clone. Even though the cloning itself can take a bit longer (there are optimizations for this in git), it’s not necessary to navigate to tens of different repos in a web UI or clone tens of different repos.

Once the code sits on your filesystem, command line tools such as grep and fzf provide powerful pattern-based matching and fuzzy search for code searchability. Your favorite IDE’s features such as go to definition and find references are a great way to quickly hop between different parts of the codebase and understand how they are connected.

Of course, there is no silver bullet. As the size of the codebase grows, tooling such as language servers can start to slow down. At some point, specialized tooling will need to be developed to handle the scale. However, if you’re not at Google or Meta scale, this point might take a long time to reach.

Reusability

How do you reuse code that is scattered across various repos? The obvious first solution is simple: copying and pasting code. However, we know that such an approach is unlikely to be very maintainable in the long term. While copy-paste can be very valuable and even preferable as a technique to prevent the creation of premature abstractions, we do start to need abstractions and code sharing at some point.

For most Python software, code sharing is facilitated through the use of packages. Code that is meant to be shared internally can be hosted on an internal package registry. This is definitely a good approach, as it works seamlessly with existing project management tooling like pip and poetry. Still, there are some downsides: we now need to maintain an internal packaging server, and installing the packages might take some time.

How a monorepo solves this

In a monorepo, the code is already right there on your system. If you structure the repository in a good way, all code is reachable by all other code through direct module imports (more on this later). For example, we can do:

# components/pytendi/mylib/__init__.py
def cool_function():
    print("I'm a cool function!")

# consumer.py
from pytendi.mylib import cool_function

cool_function()

It is therefore trivial to use any existing code in another project. One possible downside to this approach is that it’s easy to reach into components’ internal implementation, which can result in an overly coupled codebase. We touch on how we tackle this later.

Developer experience

New code mostly builds on top of existing code. If such existing code is distributed as packages, as described in the previous section, it’s not trivial to quickly patch these dependencies. Say you’re calling a function money_printer from a package whose source is located elsewhere:

# remote_package.py
def money_printer():
    return 100

# consumer.py
from remote_package import money_printer


print(f"What can I do with {money_printer()} euros?")

You need the implementation of money_printer to be updated slightly:

def money_printer():
    return 1000

You need to take the following steps to have these changes properly merged into your project:

  1. Install remote_package as editable (pip install -e) such that changes you make to it are directly visible without re-installing. Other options: directly edit the source code in venv/lib/pythonX.Y/site-packages/remote_package; or clone the original repo and import functionality from that folder.
  2. Make your changes and test.
  3. Upstream your changes by creating a pull request and merging. With a different repo, it is also more likely that this repo is maintained by another team, which possibly creates more friction and overhead.
  4. The remote_package package version should be updated, and published to your registry under the new version.
  5. In your project, install the latest version of the package.

That’s a whole lot of steps. This process introduces delays and decreases productivity, especially for quick fixes or iterative development. What steps would we need to take in the monorepo setup?

  1. Make your changes and test.
  2. Create a pull request and merge.
  3. Pull the latest changes from main.

That’s 60% the amount of steps! This immediate feedback loop accelerates development and reduces the coordination overhead. While it may not seem like much, over time the added steps needed to make even minor changes to packages add up.

There are also downsides:

  • Your repo’s HEAD is your latest version, and there are no package versions you can use as “checkpoints”. That means the consumer of whatever dependencies it uses is in principle forced to upgrade when a dependencies’ interface is changed. This can actually make it harder to make updates to dependencies, as they have to ensure that consumers are also updated.

    One thing that makes this less of an issue in a monorepo, is that we can use our IDE refactorings to be relatively sure that any change is applied in both dependency and any consumers. We can for instance use the update function name refactoring, which should (in theory) update all references to that function.

  • The ease with which implementations and interfaces are changed might make it too easy to make updates. If the development team is not disciplined, this might lead to more changes that break consumers. An important tool to combat this is a good testing suite.

Tradeoffs, tradeoffs, tradeoffs

Everything we do has tradeoffs, especially in software engineering. We mentioned some of them in the previous sections, and want to explore a couple more general pros and cons.

The good

  • It’s easier to enforce a consistent code style with a centralized point for documentation and coding guidelines. Linting rules and tests are applied in a consistent way as their configuration only has to be specified once.
  • We’re able to unify otherwise disparate CI pipelines. As an example, some of our infrastructure uses Azure Functions (the Azure equivalent of AWS lambda), which have a specific deployment pipeline. As we had multiple Azure Functions in different repos, pretty much the same deployment pipeline was replicated over different repos. This increases maintenance burden and can result in unexpected deviations to deployments when these pipelines are only updated in specific repos. One downside here is that the pipeline itself has become a bit more complex to be able to deploy different projects. The same holds for linting and testing pipelines.
  • It’s easier to keep up to date with other developer’s work and progress — all changes are located in one central git history. Forcing people to not work in isolation encourages more frequent peer/code review, so that issues in logic, documentation, coding style or lack of clarity of a component are raised and addressed more quickly.
  • New devs have a single repository that contains all the code, tools and documentation they need to get started.

The bad

One call away

Every function is a single import away. While this makes reusability easier, it allows developers to touch a module’s internals that they shouldn’t be able to touch. If this happens, consumer code is coupled to that module’s internals, meaning that any change there could potentially break the consumer. This also puts pressure on the consumed module to avoid changing its implementation, leading to components that are harder to refactor. In other words, it’s easier for the codebase to devolve into a big ball of mud ™.

As Python does not have built-in features to prevent certain implementation details to be inaccessible, preventing this from happening relies on the developers being disciplined. As this is not scalable, this requires either special tooling or structuring the code base in a specific way.

Currently, we try to solve this by having each module only export a specific set of functions in its __init__.py. We only allow other modules to import functionality exported in other modules’ __init__.py, which we can think of this as the module’s interface. For example:

## components/pytendi/example/internal.py
def can_be_exported():
    print("I can be exported!")
    
    shy_function()
    
def shy_function():
    print("And I'm too shy!")

## components/pytendi/example/__init__.py
from pytendi.example.internal import can_be_exported

# Only `can_be_exported` is part of the module's "interface".
__all__ = ["can_be_exported"]


## bases/pytendi/consumer/core.py

# Now we can directly import `can_be_exported` from `example`.
from pytendi.example import can_be_imported

# The below is not possible!
from pytendi.example import shy_function

This way, the consumed module is the one in control of what part of its implementation it exposes as its interface, allowing encapsulation and decoupling of modules to be maintained. Currently, we don’t have the tooling to enforce this on the code level, so for now it’s mostly trusting developers on this and code reviews.

As the codebase grows large enough, tooling like language servers, type checkers, and linters can get slow. Google’s famous monorepo google3 has tons of bespoke tooling to keep their codebase productive. Unfortunately we don’t have that amount of resources to invest in developer tooling (yet), so we can only hope this doesn’t become an issue too quickly. In the meantime, there are some quick wins to be had, like only running tests for those components that have changed, and their consumers.

The ugly

pytorch poetry meme

As we will cover in a later section on structuring the monorepo, we have per-project pyproject.toml files (containing the project’s dependencies), and one master pyproject.toml containing all the projects’ dependencies. One issue with a master list of dependencies is that all modules and projects basically have to use the same versions of dependencies, and these have to be kept in lockstep. The longer the list of dependencies becomes, the larger the probability of dependency version conflicts becomes.

While Python has one of the most vibrant language packaging ecosystems, the package managing tooling leaves much to be desired. We won’t go too much into this here, much has been written about it already. The most glaring potential issues we expect are regarding dependencies and (supposedly) incompatible transitive dependencies.

We currently use Poetry as package and project manager. Poetry might fail to install dependencies when it deems different packages to be incompatible with each other, while in actuality they can work perfectly fine with each other. A large part of this is caused by overspecified dependency constraints by package authors. This is a problem in the python packaging ecosystem in general. See https://iscinumpy.dev/post/bound-version-constraints/#tldr for more on this.

Poetry doesn’t allow overriding a package’s transitive dependency constraints, and the maintainers don’t plan on adding this feature. We therefore hope that new tooling like uv, that does support overriding dependency versions, will mature fast enough to be usable in production.

Structuring the monorepo: Polylith

Once we’d made the decision to move to a monorepo, we needed to decide how to structure it. We had the following requirements:

  • Reusing code needs to be painless, from any part of the codebase. All of us have encountered the dreaded ImportError: attempted relative import with no known parent package error.
  • It should be clear to each developer what code they can find where, and modularity should be incentivized.

Luckily, some people had already thought about this. After some digging around, we came across the Polylith architecture. Originating from the Clojure community:

Polylith is a software architecture that applies functional thinking at the system scale. It helps us build simple, maintainable, testable, and scalable backend systems. (official docs)

I will only give a brief introduction here, I encourage the interested reader to read more about it on its website. David Vujic, who ported some of the tooling to Python, has also written an excellent blogpost about it. To quote him

A Polylith code-base is structured in a components-first architecture. Similar to LEGO, components are building blocks. A component can be shared across apps, tools, libraries, serverless functions and services. The components live in the same repository; a Polylith monorepo. The Polylith architecture is becoming popular in the Clojure community.

The Polylith architecture has four main components (which equate to folders):

  1. Components: Small building blocks consisting of actual business logic. These can be combined like lego bricks to build more complicated functionality. Importantly, they shouldn’t contain application or infrastructure-level concerns. This keeps them reusable. Components expose their interface in their __init__.py.
  2. Bases: Entrypoints or gateways into your business logic (components). A base can for example be an API or a script. The base should contain as little business logic as possible, delegating the actual implementation to components. That way, components can be reused and business logic can be implemented without being coupled to a specific application. For instance, if you want to build both an API and a CLI tool that mostly contain the same functionality, they can both call the same components.
  3. Projects: Represent all information necessary to build a deployable artifact, combining one or more bases (usually one). For instance, if you’re deploying an API using Docker, your Dockerfile would be here. Project-specific dependencies are located in a project-specific pyproject.toml. This way, infrastructure and application-level concerns are clearly separated.
  4. Development: Code used for development. Can use all dependencies specified in the main pyproject.toml.

As an example, we have a project speech_to_text_postprocessing_app, which uses a base with the same name, which in turn uses components such as text_transforms and auto_punctuator.

The good

  • It’s very easy to reuse code, imports simply work.
  • Developing and iterating is greatly accelerated. The pyproject.toml in the root of the project contains all dependencies. Therefore, we can easily create a notebook and have instant access to all code in the codebase and their dependencies. This is a powerful feeling, as we don’t have to waste time setting up new packages, downloading packages from a package registry and more.
  • David Vujic has taken the effort to create some tooling to make it easy to work with. His tooling is distributed as, for instance, a Poetry plugin. It simplifies creation of new bases, components, and projects, and can show the dependency tree for each project.
  • The lego-brick philosophy is quite natural and simple to work with, promoting modularity.
  • The forced structure ensures consistency across the repo.

The bad

The Polylith architecture was originally designed to build backend systems. In our case, while we have some backend applications, a lot of the code we write is for research purposes. Different projects might need conflicting versions of libraries such as PyTorch. This is hard to accommodate with a shared pyproject.toml file. We’re still looking for better ways to deal with this.

After some discussion about this, David suggests that:

A temporary solution to work around it is excluding some components from the root pyproject.toml, letting the individual project use the one that differs. If there are dependencies that are incompatible and need to be that permanently, then it could be a use case for having the specific project in a separate repo with its own dependencies. If that is needed for a project, then it is a simple task to extract it from the Monorepo.

With Polylith, we also had some difficulty is to keep bases thin, with business logic in components. Some business logic will only be used by a single base. For instance, we have a base priority_queue_functionapp, whose implementation is located in the priority_queue component. This component exports functions that are only used by the priority_queue_functionapp. Since these two modules are coupled already, moving the implementation to the component seems a bit artificial and unnecessary.

The migration

After we decided to move forward with Polylith, we started migrating our existing code to our new monorepo ✨ pytendi ✨. The basic process is to copy all the application code to a new base, factoring out (reusable) implementation logic into components. We also took the opportunity to add tests to some legacy code and improve their structure. In total, the migration took around 3 months, most of which was spent refactoring. I’m happy to say that we’ve migrated all of our production applications successfully!

In the meantime, we’ve onboarded a machine learning engineer and data engineer, both of whom where able to quickly start contributing. While it’s hard to properly measure, I do feel the monorepo has helped them find their way around the codebase more quickly, and has led to increased developer productivity, including myself.

Of course, there are not only upsides. On the flip side:

  • For devs that have never worked with Polylith, the difference between projects, bases, and components is not always as clear. This led to some initial confusion on what should go where.
  • Finding a set of dependencies that work well together can sometimes be tricky as certain projects require specific and obscure versions of dependencies that do not work well with others.
  • Such a big repo can be a bit overwhelming, especially for new joiners. Enforcing strict code standards, especially on PRs, is a must to prevent tech debt being buried deep inside the repo, where future developers will probably find it intimidating to take on without fresh context of all the surrounding components.

Onwards

There’s still a lot to learn and questions to be asked. We’ll have to figure out how we can keep the monorepo organized as more code is added and more devs will be working with it, and how we’ll handle potential scaling issues. A first step in this process will be creating a styleguide, so the code itself becomes more consistent. In the future, we’ll consider migrating to a package manager like uv that hopefully has more options to deal with conflicting dependency constraints. Overall, we’re happy with the migration and the tradeoffs it presents, and are excited to continue evolving our codebase to support our growth.

Special thanks to David Vujic for creating the Python Polylith tools and discussion.

About Attendi

Attendi is an Amsterdam-based startup focussed on bringing best-in-class speech-to-text to healthcare professionals, optimized specifically for healthcare. We’re always looking for good engineers! Interested in what we’re doing? Feel free to drop us an email at omar[at]attendi.nl