We’re Brendan and Michael, the creators of Sourcebot (https://github.com/sourcebot-dev/sourcebot). Sourcebot is an open-source code search tool that allows you to quickly search across many large codebases. Check out our demo video here: https://youtu.be/mrIFYSB_1F4, or try it for yourself on our demo site here: https://demo.sourcebot.dev
While at prior roles, we’ve both felt the pain of searching across hundreds of multi-million line codebases. Using local tools like grep were ill-suited since you often only had a handful of codebases checked out at a time. Sourcegraph (https://sourcegraph.com/) solves this issue by indexing a collection of codebases in the background and exposing a web-based search interface. It is the de-facto search solution for medium to large orgs, but is often cited as expensive ($49 per user / month) and recently went closed source (https://news.ycombinator.com/item?id=41296481). That’s why we built Sourcebot.
We designed Sourcebot to be:
- Easily deployed: we provide a single, self-contained Docker image (https://github.com/sourcebot-dev/sourcebot/pkgs/container/so...).
- Fast & scalable: designed to minimize search times (current average is ~73ms) across many large repositories.
- Cross code-host support: we currently support syncing public & private repositories in GitHub and GitLab.
- Quality UI: we like to think that a good looking dev-tool is more pleasant to use.
- Open source: Sourcebot is free to use by anyone.
Under the hood, we use Zoekt (https://github.com/sourcegraph/zoekt) as our code search engine, which was originally authored by Han-Wen Nienhuys and now maintained by Sourcegraph (https://sourcegraph.com/blog/sourcegraph-accepting-zoekt-mai...). Zoekt works by building a trigram index from the source code enabling extremely fast regular expression matching. Russ Cox has a great article on how trigram indexes work if you’re interested: https://swtch.com/~rsc/regexp/regexp4.html
In the shorter-term, there are several improvements we want to make, like:
- Improving how we communicate indexing progress (this is currently non-existent so it’s not obvious how long things will take)
- UX improvements like search history, query syntax highlighting & suggestions, etc.
- Small QOL improvements like bookmarking code snippets.
- Support for more code hosts (e.g., BitBucket, SourceForge, ADO, etc.)
In the longer-term, we want to investigate how we could go beyond just traditional code search by leveraging machine learning to enable experiences like semantic code search (“where is system X located?”) and code explanations (”how does system X interact with system Y?”). You could think of this as a copilot being embedded into Sourcebot. Our hunch is that will be useful to devs, especially when packaged with the traditional code search, but let us know what you think.
Give it a try: https://github.com/sourcebot-dev/sourcebot. Cheers!
I know that intentions can change, but I'm curious how you see it. Sourcegraph was pretty clearly always going to be a business-type-of-project, and like most business projects, relicensed everything to their custom enterprise license. Originally it was Apache 2 [1].
I love open source and I write a lot of it myself [2]. I use the MIT license, just like you've done here, and I admire that. I don't think you owe me or anyone else anything, and the MIT license makes that clear.
I am very interested in this project and I'd love to extend and contribute to it, but only if it's an actual open source project. Seems like every devtools-focused startup these days calls themselves "open source" but fails to actually build a community, because in reality it's just a marketing gimmick. Because the project is actually a company, the people involved never try very hard to build a community of contributors. When the company invariably cannot make money with an open source product, the code gets relicensed to be closed-source. The few people who had contributed end up getting played. That's what happened to Sourcegraph!
So: open source, or open source "for now"?
[0]: https://news.ycombinator.com/item?id=41715776
[1]: https://github.com/sourcegraph/sourcegraph-public-snapshot/c...
What they've described smells a lot like a thing that needs to become a business — see Sourcegraph — and Brendan [0] and Michael [1] are currently working together at a startup they founded.
I'm getting tired of seeing other businesses pissing in the pool by claiming to be "open source" purely for the marketing benefits, so I figured I'd ask up front and see what they say.
Should be a simple answer either way!
This is still day 1, so we honestly don't have an answer if we will get to a point where we can monetize - it's too early to tell. However if we do end up going down that road, I don't think generating revenue and being a good steward of open source is mutually exclusive.
My view is that there is a balance that can exist between open source and building a profitable business that doesn't negatively impact the open source community. Companies that come to mind that I think are striking this balance are PostHog & GitLab.
Great work so far; best of luck!
sorry for not responding to your email, I was swamped.
I looked through the sourcecode, but I can only find UI (ie. browser) code. Does this do anything beyond delivering a more functional and prettier UI on top of an existing zoekt deployment? If no, everybody would be better served if you tried to improve the UI inside Zoekt, which currently is a live demonstration of (my lack of) web app programming skills.
Have you thought of how you will achieve your further goals (eg. semantic search)? That will require server-side changes, but you currently have no Go code at all.
Yea that is correct - in its current state, it's functionally a UI wrapper on top of the zoekt-webserver api. One of the reasons why we decided to go with a separate app is that we have much more experience with Typescript, React, and NextJS (the web framework we are using), so it felt like we could move allot quicker using what we know.
In terms of semantic search, that is still very early days - my intuition is that having a separate "semantic code indexer" server written in Python would again allow us to move quickly (since all of the ML libraries are written in Python).
If you’re curious about the source, as I was, here it is: https://github.com/sourcegraph/zoekt/blob/main/web/templates...
It looks like you're working on this full-time (and it's a lot of work to build great code search, as I know from working on my own product).
What are your plans for monetizing / building a sustainable business without inevitably going closed source like Sourcegraph?
I understand intentions can change, but there's a difference, and I'm curious to know the answer.
Looking beyond the immediate, I think there is allot of fertile ground with respect to making engineering teams more efficient beyond just regular code search. Semantic code search for example is one of those features that I really wish I had when I was at my last job - would have made onboarding onto new codebases much easier.
Would love to hear more about your use cases: brendan@sourcebot.dev
Based on regexp
However, Hound does the job well.
--- a/README.md
+++ b/README.md
@@ -1,256 +1,256 @@
- We do not collect or transmit [any information related to your codebase](https://github.com/search?q=repo:sourcebot-dev/sourcebot++captureEvent&type=code)
+ We do not collect or transmit [any information related to your codebase](https://demo.sourcebot.dev/search?query=repo%3Asourcebot-dev%2Fsourcebot%20captureEvent)
which regrettably currently says "No results found" :-(but there are a few things that need fixing, at least repo redirects and case-insensitive `repo:` arguments.
It's not open source but I use it all the time. Far superior to Github's search.
I don't have experience to know if that's cheaper (for the hoster) than just periodically calling the $(git fetch --mirror) endpoint. I could see opening a conversation with the major providers asking which they would prefer, since it's in everyone's best interest to not unduely hammer them
To the best of my knowledge, any such quotas are per API key. It's possible they are per account, but creating accounts is free.
Also, any such mechanism would only be to advise the sync process that a commit (or push) had occurred, and it would still use the $(git fetch --mirror) process but would just be an optimization of not running it (all the time|too infrequently)
Can it work against in-place repos, for example if hosted on the same server as a code forge installation?
Currently we don't support in-place repos, but feel free to file a issue and we'd be happy to take a look.
For example I’d like to index branches release1, release2, etc. but not have it index developer temporary gitlab MR branches.
I assume HEAD is referred to the head of the default branch when cloning the repository.
Still, neat. Glad to have an easy to deploy open source tool like this.
And you can persist indexes across restarts by mounting a volume to the `/data` directory (e.g., `-v $(pwd):/data`). Indexes are stored in a `.sourcebot` cache directory.
Thanks for the interest!