Following a transparent and DevOps-centric approach, the documentation of CodeFloe's infrastructure setup is essential to the platform's identity. CodeFloe wants its users to know how everything is wired together and why certain architectural decisions have been made.
This open approach also allows for improvements and contributions from the community to make the platform better over time. Additionally, it also helps to understand its limitations and challenges.
Important
Besides the "how", we also aim to explain the "why" behind certain decisions. If a "why" is missing, let us know and we will try to improve the documentation!
Deployment Concept
CodeFloe is fully committed to using "Infrastructure as Code" (IaC) for all parts of its infrastructure. OpenTofu and Ansible are being used to provision resources and configure such. Both tools are the de-facto standard for IaC and configuration management.
Besides a "local apply" concept, which is primarily aimed to be used for development and rescue purposes, the main development and apply workflow is done through (public) CI/CD workflows. This allows for transparency and organized, ideally also reviewed, deployment flow across different environments.
"Isn't it risky to publicly share the full infrastructure concept including IPs"?
Valid question! It can be, it should not 😉️
If the infra is properly designed, with respect to security-related aspects, such as access control and automated CVE patching, the risk is very small. And in no way substantially higher than for non-transparent environments.
Surely, a fully disclosed architecture makes it a bit easier for malicious actors to probe certain parts. Yet, it also allows "white hats" to do so and report on possible leaks before a potential abuse takes place.
In the end, the biggest threat to IT systems is humans making mistakes during manual actions and planning. This happens on a daily basis, be it in a direct way by applying a faulty config or by being unaware of side-effects of a specific change. The goal is to avoid such manual actions by minimizing the need for them and not fearing possible attacks from unknowns.
Last but not least, we believe that more transparency related to infrastructure and IT architecture is needed globally to improve the overall understanding of it in the community. Public code sharing already has its place. It is time to also find a place to share/show public IT architectures 🚀️.
Hardware
CodeFloe runs on Hetzner Cloud2. Hetzner brings many essential points to the table:
- great value/cost ratio
- GDPR compliance
- Terraform provider
- VMs and bare-metal servers
- S3
These are effectively all the parts which are needed to run any service. Yes, a managed database is missing in that list. However, we are brave (enough) to roll our own database 🤓️, both to save costs and "because we can" 😉️.
Currently, most parts of the service run on virtual machines (VM), which start at 2 CPU/4GB Mem and can be scaled up to 32 GB memory. One CI runner is deployed on a bare-metal server (AX42). As storage is bound to the VM size on Hetzner, the VM size will likely need to be increased for storage reasons rather than for compute needs. Once we hit storage limits for VMs or face other resource related limitations, we'll migrate to bare-metal servers. These have the ability to attach multiple datacenter-grade NVME drives, which come with substantial performance improvements over the ones attached to the VMs.
Database
Alright, let's talk about the "heart" of each service: the database ⛁!
To start off, let's address the most prominent point: using a managed service or rolling our own. This question will likely be there until humanity exists and surely, we also thought about it. Outlining all arguments for each case would fill multiple pages here, so we'll focus on explaining what we've decided to do and why:
We went with Postgres deployed through autobase on individual VMs in a HA-setup.
Reasoning
Postgres is the de-facto standard for large-scale DB needs. It is FOSS, used by the majority of projects (likely) and has a strong ecosystem around extensions and tools (like autobase
) which make it realistic to administrate a large DB at scale. In addition, maintaining a self-hosted instance likely saves the project from financial death. The costs for managed DBs (compared to self-hosted) are super high. And once the instance requirements grow, the costs grow with it exponentially.
We believe that it must be possible to run a production-grade Postgres service in 2025 without a managed service.
And just to get something clear: we are not just starting out with Postgres 😉️ We have substantial experience in running databases, including Postgres, in production environments. Yet, "production" is always different and a project like CodeFloe has the potential to grow to another level we have not experienced before. This means both excitement and challenges ahead! But we are confident that we can handle it, together with the community and an open & transparent approach.
Technical
The database is run in a three-node setup with automatic failover and replication being orchestrated through patroni
. Backups are executed via pgbackrest
to two distinct (S3) locations. autobase
allows to add additional nodes to the cluster if needed. We can also perform seamless major upgrades by just executing a playbook.
Each node has a connection pooler (pgbouncer
) in front. On top, each pooler has a load balancer (LB) (HAProxy
) in front which serves as the central entrypoint for any traffic. The LB is able to contact the connection pooler instances in a "round-robin" fashion, ensuring it always finds a healthy one in case one of the poolers is down. In the same manner, any client can use the LBs as the primary connection point(s) for the database.
For example, Forgejo is able to make use of a so-called "EngineGroup" connection (starting with v12) which allows specifying multiple connections for primary and replica nodes, respectively. The connection pool subsequently contains the respective entrypoints of the HAProxy LBs sitting in front of the connection poolers.
Latency and IOPS
Both of these characteristics are known to be highly important for a database system.
During initial tests, the reported scores of the Hetzner VMs were OK to good (~ 22k IOPS) (the bare-metal servers have ~ 2x the IOPS), especially for the price offered. There is surely still room for improvement and this will also be needed once the platform experiences greater load. However, for the start and the foreseeable future, the current setup should be sufficient. Once we see potential bottlenecks related to disk performance, we can upgrade to better hardware at any time!
Storage
Together with the database topic, this topic is surely among the most important ones. Both because storage will become the primary cost source at some point and also needs to deliver high performance, as the repository data is queried a lot.
To keep costs small, it is important to outsource all assets which are of secondary importance (packages, avatars, images) to an alternative storage provider, i.e. S3. This allows for highly reduced storage costs without any user-facing downsides. In fact, it even allows putting a CDN in front of the S3 bucket, which can further improve performance. As mentioned in the database section, the disks attached to the Hetzner VMs are "okayish" in terms of speed. The most important fact is that their size is tightly coupled to the VM size. One can scale up to 320 GB. From there onwards, a switch to bare-metal servers is required.
This is also pretty much the roadmap for the future: once CodeFloe approaches a substantial disk usage of the largest VM, we will start planning the migration to a bare-metal server with appropriate disks. The good thing is: disks can be added sequentially when needed, i.e. we can start with a 1-2 TB disk and take it from there. Yes, the bare-metal servers do not have any redundant backups by default, but given that we are rolling our own backups to two distinct S3 locations and the rest of the server storage is not of importance, we can safely rely on these backups.
Backups
Backups are a vital part of a public platform. At CodeFloe, we have backups for the following parts:
- Database: via
pgbackrest
. Bothdiff
backups every hour andfull
backups every day. - Repositories: backed up via
restic
every hour to two distinct S3 locations. - Other assets: Assets which are stored natively in S3 are mirrored to another S3 location.
Because of the IaC approach, no backups of the complete VM or its operating system are needed. Any server can be restored to its current configuration by running an Ansible playbook.
Monitoring
Uptime monitoring is done through Gatus and available on status.codefloe.com.
Metrics monitoring is done through a k3s-hosted "Prometheus Stack" instance (which also comes with Grafana) provided by devXY.
Note
This instance is being used temporarily until there is a dedicated monitoring instance available within CodeFloe's own infrastructure.
The following Grafana dashboards are available (and require a login, as the public dashboard sharing functionality of Grafana does not work with templated variables):
- "Node Exporter Full"
- "PostgreSQL" (showing PostgreSQL-specific metrics)
- "HAProxy" (showing HAProxy-specific metrics)
- "Forgejo" (showing Forgejo-specific usage metrics)
There are also common alert rules (disk space, CPU pressure, etc.) in place which are being transmitted through ntfy
.
Secret Management
A super important topic for any environment focusing on (semi-)automated deployments. The secret provider must allow a way to distribute secrets securely to CI jobs and individuals, so they can possibly access and use these for local troubleshooting and rescue purposes.
It is still unclear whether we want to use OpenBao or HashiCorp Vault (which OpenBao was forked from).
Scaling Strategy
Being on Hetzner, we can scale vertically on the VM level up to 16 cores and 32 GB RAM. However, the primary limiting factor there will likely be the disk space (320 GB max). This mainly applies to the instance running Forgejo itself.
The database should be fine in terms of disk space for the foreseeable future. At some point, we might be forced to migrate the DB to bare-metal/dedicated servers, as their disks have twice the IOPS as the VM disks.
All of these options will be evaluated when we see bottlenecks in relation to performance. We should be able to catch these early through the metrics monitoring in place.
With respect to horizontal scaling of the Forgejo instance: as of today, Forgejo isn't cluster-ready yet, i.e. Forgejo cannot be run in HA-mode in a sophisticated manner. While it can be done technically in k3s, there is no active communication between the instances. This means that certain operations would be run redundantly, which could lead to inconsistencies. Additionally, this approach currently requires a RWX volume, which makes the instance substantially slower than using a RWO mount for the repository data.
Defense & Protection Measures
We have witnessed the historic attacks on the Codeberg infrastructure, primarily through DDOS attacks (on the technical side) and user spam (on the moderation side). DDOS is hard to prevent in the first place, even with active support rails from the underlying cloud provider. It highly depends on the magnitude of the attack whether it will be classified as "DDOS" or just as "heavy traffic".
We want to be frank and open here: we can't say what will happen until the first attack ;)
Note
Yes, we conducted load-tests upfront (and will continue to do so in the future (in the dev environment)), though these are hard to compare to a real DDOS attack.
In general, there are many ways one can attack a public service. One cannot be prepared for all of them, especially not as a project of this scale, without "big-tech" driving it. We are trying our best to put practices in place that prevent unauthorized access, resource abuse and other known risk factors, such as SQL injection attacks and cross-site scripting (XSS) attacks. This goes in line with a clear and transparent RBAC-system for all satellite-services in place which can eventually have an influence on the primary instance and its data.
Due to the use of HAProxy, we have followed their article on bot protection and implemented the following measures:
Following HAProxy's bot protection guidelines, we have implemented a multi-layered defense system with the following measures:
Rate Limiting & Traffic Control
- Smart rate limiting per IP address over a 30-second sliding window with progressive penalties
- Per IP+URL rate limiting over a 24-hour window to prevent targeted resource abuse
- Path-specific rate limiting with different thresholds:
- Static assets (
/assets
,/avatars
): 4,000 requests per interval - Large repositories (containing
linux
,bsd
,kernel
): 1,200 requests per interval - Git operations (
commit
,branch
,compare
): 2,000 requests per interval
- Static assets (
- Error-based rate limiting blocking IPs with >10 4xx responses over 5 minutes
- New page visit tracking with denial after 30 unique pages per 30s
Attack-Specific Protection
- Brute force attack prevention for login endpoints (>10 POST requests to
/login
within 3 minutes) - WordPress/CMS attack blocking for common attack vectors (
/wp-admin/
,/wordpress/
) - Bot persistence tracking that tags and continues blocking suspected bots until stick-table expiration
Bot Detection & Blocking
- User-Agent based detection blocking known malicious bots (
semrus
,AhrefsBot
,MJ12bot
,ZoominfoBot
,DotBot
,MauiBot
) - Outdated browser blocking preventing access from Chrome versions more than 10 major versions behind current
- IP-based blocking using curated lists from the Ultimate Hosts Blacklist project
Security Headers & Request Sanitization
- HSTS enforcement with 2-year max-age, subdomain inclusion, and preload directive
- Anti-clickjacking protection via X-Frame-Options
- MIME type sniffing prevention through X-Content-Type-Options
- Secure cookie enforcement with Secure and SameSite=Lax attributes
- Privacy protection including FLoC opt-out
- Request header sanitization removing client-provided IP headers to prevent spoofing
Allowlist for good bots
On top, we also have a dedicated allowlist for "good bots". These are helpful to let projects be found by search engines and alikes. We are using a curated list from "AnTheMaker/GoodBots" for that.
Abuse Prevention
Abuse can happen in many ways, the most common are:
- Abusing storage (both for packages and Git repositories)
- Abusing CPU/Memory in CI/CD
For the first one, we have quotas in place per user/org. To prevent the creation of many accounts to avoid this limitation and scale horizontally, we reserve the right to monitor potential abuse and take appropriate action. For example, if we see a user/org hitting the limit and just opening other accounts to store similar content, this will fall into the "abuse" category. We'll moderate such happenings and email the responsible people first. Should we take further actions (e.g. deleting/suspending accounts), we will transparently do so in the CodeFloe Forum and outline the reasoning behind it.
Note
We are fully aware that such actions can lead to potential distrust in the platform and a feeling of being treated unfairly or similar. Which is why we want to be fully clear and transparent about it. Most users (99.x%) will never even come close to being affected by such actions. As long as you use the platform in a "normal" way, there is no risk of being limited or suspended.
On the other hand, we (the platform owners) must have the right and options to take actions when it's appropriate, to also protect other users on the platform and the platform itself.
Last, a reminder: if it is about resources per se, you can always reach out to us to lift your quota limit in exchange for a monthly donation.