Edit: it seems like my explanation turned out to be too confusing. In simple terms, my topology would look something like this:
I would have a reverse proxy hosted in front of multiple instances of git servers (let’s take 5 for now). When a client performs an action, like pulling a repo/pushing to a repo, it would go through the reverse proxy and to one of the 5 instances. The changes would then be synced from that instance to the rest, achieving a highly available architecture.
Basically, I want a highly available git server. Is this possible?
I have been reading GitHub’s blog on Spokes, their distributed system for Git. It’s a great idea except I can’t find where I can pull and self-host it from.
Any ideas on how I can run a distributed cluster of Git servers? I’d like to run it in 3+ VMs + a VPS in the cloud so if something dies I still have a git server running somewhere to pull from.
Thanks
I think I messed up my explanation again.
The load-balancer in front of my git servers doesn’t really matter. I can use whatever, really. What matters is: how do I make sure that when the client writes to a repo in one of the 5 servers, the changes are synced in real-time to the other 4 as well? Running rsync every 0.5 second doesn’t seem to be a viable solution
Why do you want 5 git servers instead of, say, 2? Are you after something more than high availability? Are you trying to run something like GitHub where some repos might have stupendous concurrent read traffic? What about update traffic?
What happens if the servers sometimes get out of sync for 0.5 sec or whatever, as long as each is in a consistent state at all times?
Anyway my first idea isn’t rsync, but rather, use update hooks to replicate pushes to the other servers, so the updates will still look atomic to clients. Alternatively, use a replicated file system under Ceph or the like, so you can quickly migrate failed servers. That’s a standard cloud hosting setup.
What real world workload do you have, that appeared suddenly enough that your devs couldn’t stay in top of it, and you find yourself seeking advice from us relatively clueless dweebs on Lemmy? It’s not a problem most git users deal with. Git is pretty fast and most users are ok with a single server and a backup.
Thanks for the comment. There’s no special use-case: it’ll just be me and a couple of friends using it anyway. But I would like to make it highly available. It doesn’t need to be 5 - 2 or 3 would be fine too but I don’t think the number would change the concept.
Ideally I’d want all servers to be updated in real-time, but it’s not necessary. I simply want to run it like so because I want to experience what the big cloud providers run for their distributed git services.
Thanks for the idea about update hooks, I’ll read more about it.
Well the other choice was Reddit so I decided to post here (Reddit flags my IP and doesn’t let me create an account easily). I might ask on a couple of other forums too.
Thanks
I see, fair enough. Replication is never instantaneous, so do you have definite bounds on how much latency you’ll accept? Do you really want independent git servers online? Most HA systems have a primary and a failover, so users only see one server. If you want to use Ceph, in practice all servers would be in the same DC. Is that ok?
I think I’d look in one of the many git books out there to see what they say about replication schemes. This sounds like something that must have been done before.
Well it’s a tougher question to answer when it’s an active-active config rather than a master slave config because the former would need minimum latency possible as requests are bounced all over the place. For the latter, I’ll probably set up to pull every 5 minutes, so 5 minutes of latency (assuming someone doesn’t try to push right when the master node is going down).
I don’t think the likes of Github work on a master-slave configuration. They’re probably on the active-active side of things for performance. I’m surprised I couldn’t find anything on this from Codeberg though, you’d think they have already solved this problem and might have published something. Maybe I missed it.
I didn’t find anything in the official git book either, which one do you recommend?
Are you familiar with git hooks? See
https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks
Scroll to the part about server side hooks. The idea is to automatically propagate updates when you receive them. So git-level replication instead of rsync.