Skip to content

Following a transparent and DevOps-centric approach, the documentation of CodeFloe's infrastructure setup is essential to the platform's identity. CodeFloe wants its users to know how everything is wired together and why certain architectural decisions have been made.

This open approach also allows for improvements and contributions from the community to make the platform better over time. Additionally, it also helps to understand its limitations and challenges.

Important

Besides the "how", we also aim to explain the "why" behind certain decisions. If a "why" is missing, let us know and we will try to improve the documentation!

Deployment Concept

CodeFloe is fully committed to using "Infrastructure as Code" (IaC) for all parts of its infrastructure. OpenTofu and Ansible are being used to provision resources and configure such. Both tools are the de-facto standard for IaC and configuration management.

Besides a "local apply" concept, which is primarily aimed to be used for development and rescue purposes, the main development and apply workflow is done through (public) CI/CD workflows. This allows for transparency and organized, ideally also reviewed, deployment flow across different environments.

"Isn't it risky to publicly share the full infrastructure concept including IPs"?

Valid question! It can be, it should not 😉️

If the infra is properly designed, with respect to security-related aspects, such as access control and automated CVE patching, the risk is very small. And in no way substantially higher than for non-transparent environments.

Surely, a fully disclosed architecture makes it a bit easier for malicious actors to probe certain parts. Yet, it also allows "white hats" to do so and report on possible leaks before a potential abuse takes place.

In the end, the biggest threat to IT systems is humans making mistakes during manual actions and planning. This happens on a daily basis, be it in a direct way by applying a faulty config or by being unaware of side-effects of a specific change. The goal is to avoid such manual actions by minimizing the need for them and not fearing possible attacks from unknowns.

Last but not least, we believe that more transparency related to infrastructure and IT architecture is needed globally to improve the overall understanding of it in the community. Public code sharing already has its place. It is time to also find a place to share/show public IT architectures 🚀️.

Hardware

CodeFloe runs on Hetzner Cloud1. Hetzner brings many essential points to the table:

  • great value/cost ratio
  • GDPR compliance
  • Terraform provider
  • VMs and bare-metal servers
  • S3

These are effectively all the parts which are needed to run any service. Yes, a managed database is missing in that list. However, we are brave (enough) to roll our own database 🤓️, both to save costs and "because we can" 😉️.

CodeFloe uses a mix of dedicated servers and cloud instances. The latter are great for small services and dev instances. However, they only have a small root disk which cannot be used for a Ceph cluster (as Ceph requires a dedicated storage device). Also the shared NVMEs on the cloud instances are good - but not as good as NVMEs from dedicated servers. This is why we acquired three dedicated servers. These come with two disks, of which one is used for a Ceph cluster and the other one for the database storage.

For CI/CD, we know that speed matters. Hence, an AX42 (artemis) and a Mac Mini M4 (gaia) process the builds for amd64 and arm64, respectively.

(The server names were inspired by the game "Horizon Forbidden West" 🎮️)

Name Env CPU Mem Disk OS GB 6 SC GB 6 MC Used for Costs/m (€)
demeter prod Intel XEON E-2176G 64 GB DDR4 ECC 2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME Alma10 1749 7352 Git, DB 40.7
hades prod Intel XEON E-2276G 64 GB DDR4 ECC 2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME Alma10 1749 7352 Git, DB 37.7
minerva prod Intel XEON E-2176G 64 GB DDR4 ECC 2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME Alma10 1749 7352 Git, DB 36.7
artemis prod AMD Ryzen 7 PRO 8700GE 64 GB DDR5 ECC 2x 500 GB SAMSUNG MZVL2512HCJQ-00B0 NVME Alma9 2676 11864 CI/CD 47.3
gaia prod Apple M4 32 GB DDR5 1x 500 GB APPLE SSD AP0512Z macOS 15 3781 14858 CI/CD -
misc prod Shared ARM V8 8 GB DDR5 80 GB NVME SSD Alma9 1079 3490 CI/CD, status, Forum 6.49
cf-dev dev Shared ARM V8 4 GB DDR5 40 GB NVME SSD Alma9 1035 1869 3.79
misc-dev dev Shared ARM V8 4 GB DDR5 40 GB NVME SSD Alma9 1035 1869 CI/CD, status, Forum 3.79
pgnode1-dev dev Shared ARM V8 4 GB DDR5 40 GB NVME SSD Alma9 1035 1869 CI/CD, status, Forum 3.79
pgnode2-dev dev Shared ARM V8 4 GB DDR5 40 GB NVME SSD Alma9 1035 1869 CI/CD, status, Forum 3.79
pgnode3-dev dev Shared ARM V8 4 GB DDR5 40 GB NVME SSD Alma9 1035 1869 CI/CD, status, Forum 3.79
187.84

Database

Alright, let's talk about the "heart" of each service: the database ⛁!

To start off, let's address the most prominent point: using a managed service or rolling our own. This question will likely be there until humanity exists and surely, we also thought about it. Outlining all arguments for each case would fill multiple pages here, so we'll focus on explaining what we've decided to do and why:

We went with Postgres deployed through autobase on individual VMs in a HA-setup.

Reasoning

Postgres is the de-facto standard for large-scale DB needs. It is FOSS, used by the majority of projects (likely) and has a strong ecosystem around extensions and tools (like autobase) which make it realistic to administrate a large DB at scale. In addition, maintaining a self-hosted instance likely saves the project from financial death. The costs for managed DBs (compared to self-hosted ones) are super high. And once the instance requirements grow, the costs grow with it exponentially.

We believe that it must be possible to run a production-grade Postgres service in 2025+ without a managed service.

And just to get something clear: we are not just starting out with Postgres 😉️ We have substantial experience in running databases, including Postgres, in production environments. Yet, "production" is always different and a project like CodeFloe has the potential to grow to another level we have not experienced before. This means both excitement and challenges ahead! But we are confident that we can handle it, together with the community and an open & transparent approach.

Technical

The database is run in a three-node setup with automatic failover and replication being orchestrated through patroni. Backups are executed via pgbackrest to two distinct (S3) locations. autobase allows to add additional nodes to the cluster if needed. We can also perform seamless major upgrades by just executing an Ansible playbook.

Each node has a connection pooler (pgbouncer) in front. On top, each pooler has a load balancer (LB) (HAProxy) in front which serves as the central entrypoint for any traffic. The LB is able to contact the connection pooler instances in a "round-robin" fashion, ensuring it always finds a healthy one in case one of the poolers is down. In the same manner, any client can use the LBs as the primary connection point(s) for the database.

For example, Forgejo is able to make use of a so-called "EngineGroup" connection (starting with v12) which allows specifying multiple connections for primary and replica nodes, respectively. The connection pool subsequently contains the respective entrypoints of the HAProxy LBs sitting in front of the connection poolers.

Latency and IOPS

Both of these characteristics are known to be highly important for a database system.

To get some numbers, we ran the following pgbench benchmark:

sudo -u postgres createdb pgbench_test
sudo -u postgres pgbench -i -s 10 pgbench_test
sudo -u postgres pgbench -c 30 -j 4 -T 120 pgbench_test # write
sudo -u postgres pgbench -S -c 30 -j 4 -T 120 pgbench_test # read

which resulted in

Host Write TPS Read TPS Read Latency (ms) Write Latency (ms)
demeter 19971 160344 0.187 0.256

While the results for a pgbench benchmark are only comparable across the same test setup, the numbers give some indication of the overall performance characteristics of the database system.

Storage

Together with the database topic, this topic is surely among the most important ones. Both because storage will become the primary cost source at some point and also needs to deliver high performance, as the repository data is queried a lot.

To keep costs small, it is important to outsource all assets which are of secondary importance (packages, avatars, images) to an alternative storage provider, i.e. S3. This allows for highly reduced storage costs without any user-facing downsides.

As the repository files are stored on disk, disk speed matters. Also, a HA-based architecture is required to prevent hardware failures and allow for seamless node updates. This is why we are opting for a Ceph cluster across the three nodes, which we are currently working on. Right now, the Git service is running on a single node with frequent backups.

With all the HA ideas in mind, there is still a lot of work to be done on the Forgejo side. Having a HA-storage setup and a HA-database is only half the battle. Forgejo is not yet HA-ready, i.e. custom adaptions are required to make the individual components (queue, cache, cron) work reliably in HA mode.

Backups

Backups are a vital part of a public platform. At CodeFloe, we have backups for the following parts:

  • Database: via pgbackrest. Both diff backups every hour and full backups every day.
  • Repositories: backed up via restic every hour to two distinct S3 locations.
  • Other assets: Assets which are stored natively in S3 are mirrored to another S3 location.

Because of the IaC approach, no backups of the complete VM or its operating system are needed. Any server can be restored to its current configuration by running an Ansible playbook.

Monitoring

Uptime monitoring is done through Gatus and available on status.codefloe.com.

Metrics monitoring is done through a k3s-hosted "Prometheus Stack" instance (which also comes with Grafana) provided by devXY.

Note

This instance is being used temporarily until there is a dedicated monitoring instance available within CodeFloe's own infrastructure.

The following Grafana dashboards are available (and require a login, as the public dashboard sharing functionality of Grafana does not work with templated variables):

There are also common alert rules (disk space, CPU pressure, etc.) in place which are being transmitted through ntfy.

Secret Management

A super important topic for any environment focusing on (semi-)automated deployments. The secret provider must allow a way to distribute secrets securely to CI jobs and individuals, so they can possibly access and use these for local troubleshooting and rescue purposes.

It is still unclear whether we want to use OpenBao or HashiCorp Vault (which OpenBao was forked from).

Scaling Strategy

Being on Hetzner, we can scale vertically on the VM level up to 16 cores and 32 GB RAM. However, the primary limiting factor there will likely be the disk space (320 GB max). This mainly applies to the instance running Forgejo itself.

The database should be fine in terms of disk space for the foreseeable future. At some point, we might be forced to migrate the DB to bare-metal/dedicated servers, as their disks have twice the IOPS as the VM disks.

All of these options will be evaluated when we see bottlenecks in relation to performance. We should be able to catch these early through the metrics monitoring in place.

With respect to horizontal scaling of the Forgejo instance: as of today, Forgejo isn't cluster-ready yet, i.e. Forgejo cannot be run in HA-mode in a sophisticated manner. While it can be done technically in k3s, there is no active communication between the instances. This means that certain operations would be run redundantly, which could lead to inconsistencies. Additionally, this approach currently requires a RWX volume, which makes the instance substantially slower than using a RWO mount for the repository data.

Defense & Protection Measures

We have witnessed the historic attacks on the Codeberg infrastructure, primarily through DDOS attacks (on the technical side) and user spam (on the moderation side). DDOS is hard to prevent in the first place, even with active support rails from the underlying cloud provider. It highly depends on the magnitude of the attack whether it will be classified as "DDOS" or just as "heavy traffic".

We want to be frank and open here: we can't say what will happen until the first attack ;)

Note

Yes, we conducted load-tests upfront (and will continue to do so in the future (in the dev environment)), though these are hard to compare to a real DDOS attack.

In general, there are many ways one can attack a public service. One cannot be prepared for all of them, especially not as a project of this scale, without "big-tech" driving it. We are trying our best to put practices in place that prevent unauthorized access, resource abuse and other known risk factors, such as SQL injection attacks and cross-site scripting (XSS) attacks. This goes in line with a clear and transparent RBAC-system for all satellite-services in place which can eventually have an influence on the primary instance and its data.

Due to the use of HAProxy, we have followed their article on bot protection and implemented the following measures:

Following HAProxy's bot protection guidelines, we have implemented a multi-layered defense system with the following measures:

Rate Limiting & Traffic Control

  • General rate limiting per IP address: 10,000 requests per 30 minutes (~5.5 requests/second sustained)
  • Per IP+URL rate limiting over a 24-hour window to prevent targeted resource abuse
  • Path-specific rate limiting with different thresholds:
  • Static assets (/assets, /avatars): 4,000 requests per 30 minutes
  • Large repositories (containing linux, bsd, kernel): 1,200 requests per 30 minutes
  • Git operations (commit, branch, compare): 2,000 requests per 30 minutes
  • Error-based rate limiting blocking IPs with >10 error responses over 5 minutes
  • Brute force protection: 10 POST requests to /login per 3 minutess

Attack-Specific Protection

  • Brute force attack prevention for login endpoints (>10 POST requests to /login within 3 minutes)
  • WordPress/CMS attack blocking for common attack vectors (/wp-admin/, /wordpress/)
  • Bot persistence tracking that tags and continues blocking suspected bots until stick-table expiration

Bot Detection & Blocking

  • User-Agent based detection blocking known malicious bots (semrus, AhrefsBot, MJ12bot, ZoominfoBot, DotBot, MauiBot)
  • Outdated browser blocking preventing access from Chrome versions more than 10 major versions behind current
  • IP-based blocking using curated lists from the Ultimate Hosts Blacklist project

Security Headers & Request Sanitization

  • HSTS enforcement with 2-year max-age, subdomain inclusion, and preload directive
  • Anti-clickjacking protection via X-Frame-Options
  • MIME type sniffing prevention through X-Content-Type-Options
  • Secure cookie enforcement with Secure and SameSite=Lax attributes
  • Privacy protection including FLoC opt-out
  • Request header sanitization removing client-provided IP headers to prevent spoofing

Allowlist for good bots

On top, we also have a dedicated allowlist for "good bots". These are helpful to let projects be found by search engines and alikes. We are using a curated list from "AnTheMaker/GoodBots" for that.

Abuse Prevention

Abuse can happen in many ways, the most common are:

  • Abusing storage (both for packages and Git repositories)
  • Abusing CPU/Memory in CI/CD

For the first one, we have quotas in place per user/org. To prevent the creation of many accounts to avoid this limitation and scale horizontally, we reserve the right to monitor potential abuse and take appropriate action. For example, if we see a user/org hitting the limit and just opening other accounts to store similar content, this will fall into the "abuse" category. We'll moderate such happenings and email the responsible people first. Should we take further actions (e.g. deleting/suspending accounts), we will transparently do so in the CodeFloe Forum and outline the reasoning behind it.

Note

We are fully aware that such actions can lead to potential distrust in the platform and a feeling of being treated unfairly or similar. Which is why we want to be fully clear and transparent about it. Most users (99.x%) will never even come close to being affected by such actions. As long as you use the platform in a "normal" way, there is no risk of being limited or suspended.

On the other hand, we (the platform owners) must have the right and options to take actions when it's appropriate, to also protect other users on the platform and the platform itself.

Last, a reminder: if it is about resources per se, you can always reach out to us to lift your quota limit in exchange for a monthly donation.


  1. There is no relationship/affiliation between CodeFloe and Hetzner Cloud.