Show HN: Transform your codebase into a single Markdown doc for feeding into AI

tesserato.web.app

279 points by tesserato 8 days ago

CodeWeaver is a command-line tool designed to weave your codebase into a single, easy-to-navigate Markdown document. It recursively scans a directory, generating a structured representation of your project's file hierarchy and embedding the content of each file within code blocks. This tool simplifies codebase sharing, documentation, and integration with AI/ML code analysis tools by providing a consolidated and readable Markdown output.

Terretta 7 days ago

CodeWeavers is a software company that focuses on Wine development and sells a proprietary version of Wine called CrossOver for running Windows applications on macOS, ChromeOS and Linux.

https://en.wikipedia.org/wiki/CodeWeavers

Trademark is active. It's an Ⓡ not just a ™, registered not just trademarked. To keep it, they have to demonstrate they defend it.

https://www.trademarkia.com/codeweavers-76546826

While this project drops the final "s", you don't get to launch an OS called "Window". The test is a fuzzy match based on likelihood of confusion.

  • jychang 7 days ago

    Yeah, I was thinking "what does the Wine guys have to do with this?"

    This project is definitely going to get C&D'd.

    • williamcotton 7 days ago

      Do you think they would actually litigate? They seem like different products serving entirely different markets so I am not sure that the trademark infringement claim is very defensible. And how do they prove damages?

  • pstuart 7 days ago

    CodeLoom could work instead

crisbal_ 8 days ago

I use the following for feeding into AI

   find . -print -exec cat {} \; -exec echo \;
Which will return for each file (and subfolders) the filename and then the content of the file.

Then `| pbcopy` to copy to clipboard and paste it into ChatGPT or similar.

  • singpolyma3 8 days ago

    I guess this only works for very small codebase?

    • OsrsNeedsf2P 8 days ago

      Correct, but it's the same as what OP shared.

      You should use Aider/Cursor for proper indexing/intelligent codebase referencing

      • starfezzy 7 days ago

        Cramming thousands of tokens of potentially irrelevant context through unclear indexing paths isn't "proper".

        The best results come from feeding precisely targeted context directly into the prompt, where you know exactly what the model sees and how it processes it. The prompt receives the most accurate use of attention—whereas god knows what the pipeline is for cursor or what extra layers and context restrictions they add on top of base Claude.

        Giving the model a clean project hierarchy accomplishes a lot efficiently in terms of context tokens. The key is ensuring it only sees what's relevant, without diluting its attention.

        Tools like reopmix and OP's version, feeding targeted context straight into models like Claude or Google's offerings, outperform Copilot and Cursor in my experience, even though they use the same base models. Use the highest-quality attention (the prompt context) directly, rather than layers of uncertainty and "proper indexing".

      • soco 7 days ago

        I'm still puzzled how come people are convinced by Cursor, while my experience was meh at best. Can it index your stuff? okay it can. Can it refactor a simple function? No it cannot, it can't even rename a damn Java class. How can I trust it to generate then code based on my codebase? So, what is your use case then? Or can anybody point me to some blog/articles/videos showing some real use cases for Cursor? Real as in, something that it provenly can do?

        • risyachka 7 days ago

          I think you know the correct answer:)

          • soco 4 days ago

            I hoped to be wrong but no comment so far even tried to bring a based argument... eh, maybe I'll try again in a year.

        • kristiandupont 7 days ago

          >Can it refactor a simple function?

          Certainly, I do that several times a day.

          • rob 7 days ago

            Listen, I don't want to brag too much, but it even made me a function today.

        • astar1 7 days ago

          >Java

          found the problem

        • jcgrillo 7 days ago

          Correct. It's Web 3.0 2.0. You're supposed to play along to make the stock prices go up and to the right.

      • boredemployee 8 days ago

        not sure if it's cursor's fault, but very often it doesn't give me the real or complete code of my codebase when auto editing/auto completing.

        any tips?

  • usagisushi 7 days ago

    Or, for a lazier approach:

        $ head -10000 *
        ==> package.json <==
        {
          "name": ...
          ...
        ==> tsconfig.json <==
        {
           "extends": ...
          ...
    
        $ head -10000 * | llm -s "generate a patch to switch this project to esm"
  • DrPhish 8 days ago

    That's very nice and compact. I do the same with a short bash script, but wrap each file in triple-backticks and attempt to put the correct language label on each eg:

    Filename: demo.py

    ```python

       ...python code here...
    
    ```
    • genewitch 7 days ago

      Seconded because just having something autowrapped like that and putting the clipboard would save me time: release the snyder cut, er, bash script!

beklein 8 days ago

Tip: If you ever need to do this on a public GitHub repository you can use "gitingest".

This will open a website that creates a copy of all the file contents of the repo (code, docs, ...) It's a great tool to use when using new/obscure code with LLMs in my opinion.

The UX is so just easy and great, change the URL from <https://github.com/user_name/repo_name> to <https://gitingest.com/user_name/repo_name>

//edit: fixed URLs

  • mkagenius 8 days ago

    I copied the UX to my https://gitpodcast.com (creates podcast on a github repo, same replace `hub` with `podcast`)

    • pratyahava 7 days ago

      i am very impressed by gitpodcast, i just listened to one podcast and first of all i am pleased with the idea, the voices are also pleasant to listen to. thanks for sharing!

reddalo 8 days ago

Unfortunate naming, given that CodeWeavers is already a company making a Windows "emulator" for Linux and macOS. [1]

[1] https://www.codeweavers.com/

  • Arch-TK 7 days ago

    CodeWeavers are actually making wine, not just some "emulator". They then distribute this along with some QOL tools as a commercial product called CrossOver.

  • lgas 7 days ago

    All names are taken. There's no need to point this out every time.

    • anamexis 7 days ago

      Not all names are registered trademarks for software.

    • Rexxar 7 days ago

      Some are more confusing than others.

    • gloosx 7 days ago

      Huewoblfan is not taken! Noiewoidc is free. XIONqlic – totally available, can mean a range of things! Ciohupoij – a bit of asian flavour but still a valid free name.

pmx 8 days ago

How does this compare to / differ from https://github.com/yamadashy/repomix ?

  • tesserato 8 days ago

    Some advantages of CodeWeaver are that it is compiled, so it might be faster; you can grab a compatible executable from the releases section instead of using `go install` so, no dependencies. You can manually specify what to exclude via a comma-separated list of regular expressions so it might be more flexible. I never used Repomix so, those assumptions might not hold. On the other hand, remix seems to be awfully more complete, a full-fledged solution to convert source code to monolithic representations. I wrote CodeWeaver because I only needed something that worked and, occasionally, I could trust to keep sensitive data away from sketchy LLMs (And wasn't aware of other solutions).

  • akoculu 8 days ago
    • imdsm 8 days ago

      I simply have a bash script called printall which takes in some args, and outputs markdown codeblocks with filenames and a tree. One of hundreds of scripts built up over the years.

      • akoculu 8 days ago

        if you add fzf to speed up file / folder selection, you'll have your own llmcat :)

  • ycombinatornews 8 days ago

    My question exactly. Repomix seems to be tested util for something like that.

  • ActVen 8 days ago

    Same question here. I have found repomix to get the job done really well.

retropragma 8 days ago

I really want a tool like this that can extract a function and its dependency graph (to a certain depth maybe, and/or exclude node_modules).

I wrote this library [1] and hope to add the fine-grained "reference resolution" utility to it at some point, which could make implementing such a tool a lot simpler.

[1]: https://github.com/aleclarson/ts-module-graph

therealmarv 8 days ago

I use aider /copy-context command for that

https://aider.chat/docs/usage/copypaste.html

and with /paste you can apply the changes.

  • anotherpaulg 7 days ago

    Thanks for letting folks know about aider's /copy-context command.

    To add some more detail, aider has a mode/UX that is optimized for "copy and paste" coding with LLM web chats. The "big brain" LLM in the web chat does the hard work, and a cheap/local LLM works with aider to apply edits to your local files.

    There's a little demo video in the link above that should give you the gist.

juunge 7 days ago

I’ve made a CLI tool that does something similar, called Copcon:

https://github.com/kasperjunge/copcon

Point it at a code project directory to get a file tree and content, optionally with a git diff, copied to the clipboard - ready for copy pasting into ChatGPT.

It is very true that this only works for small projects, as you will bloat the LLM’s context with large codebases.

My solution to this is two files you can use to steer the tool’s behavior:

- .copconignore: For ignoring specific files and directories.

- .copcontarget: For targeting specific files and directories (applied before .copconignore).

These two files provide great control over what to include and exclude in the copied context.

tempoponet 8 days ago

A new tool like this comes out every week, and that's great! But I think it's fair to ask how this compares to popular ones like RepoMix? Anyone keeping an eye on this space will want to know why this is different from what's already out there and being used.

  • tesserato 8 days ago

    I actually wrote this a couple of months ago, so perhaps nothing similar existed back then (I remember doing some research back then, mostly focused on VS Code plugins). Nevertheless, the idea was also to test how Golang could facilitate the distribution of such micro tools throughout the internal team, so I probably would have still made it. It is nice to know that similar tools exist. I'll take a look at them.

maurycy 8 days ago

  find . -type f -name '*.py' -exec sh -c 'echo "# $1"; cat "$1"; echo ""' _ {} \; | pbcopy
rapind 8 days ago

Somewhat related. I built an Elm app all in one file as an experiment and to see if I like it. It's a little over 7k lines and I'm occasionally adding more to it.

It's actually pretty straightforward if you're in a language with lexical scoping, and it simplifies some things, like includes / cyclical, no modules, no hunting through files, etc.

I feel like this set up could integrate really well w/ AI models.

I've found that the only real limitation, at least in my experiment, was a lack of decent editor support. I use vim so this wasn't really much of an issue for me with many great ways to navigate a file, and a combination of vertical and horizontal splits on a large screen, but when I opened it up in other "modern" editors the ergonomics fell apart quite a bit.

I think the biggest downside was re-using variable names between large scopes occasionally made it hard to find the reference I wanted (E.g. i, x, key, val), but again, better editor support allowing you to limit your search to within the current scope would help. Also easily mitigated with more verbose throwaway variable naming.

  • squeegee_scream 8 days ago

    I write Elm and use Emacs primarily, and sometimes neovim. Are you using lsp in vim? You’re doing it right by staying in one file until it hurts, that’s the recommendation for Elm, but I can’t recall if I’ve had issues using go-to-def or other lsp functions like your describing

    • rapind 7 days ago

      No LSP. It honestly doesn’t speed me up any. I already have the standard library memorized, plus some of the common community lib methods (List.Extra) and my typing speed is faster than I can think anyways.

      I’m thinking the same approach would also work well in F#, Haskell, OCaml.

  • Aurornis 8 days ago

    > no hunting through files, etc.

    It’s easy to switch to files by name with a few keystrokes. Files are names to group things I’m looking for.

    I would much rather do that than try to search through a 7,000 line file for what I need.

    > I feel like this set up could integrate really well w/ AI models.

    Massive files or too many files break AI models. Grouping functionality into smaller files and including only relevant files is key. The file and folder names can be hints about where to find the right files to include.

    • rapind 7 days ago

      > I would much rather do that than try to search through a 7,000 line file for what I need.

      I mean I'm not arguing for it as a best practice. I did it as an experiment (as I stated), and discovered it's actually really easy, and snappy for me to navigate in Vim. Mileage may vary with other editors. Have you tried it?

      > Massive files or too many files break AI models

      It's growing faster than I code! With the latest Gemeni at least it's much larger at 1-2 mil tokens. I'm sure we'll hit a ceiling though, but I also think we may find some context caching / rag type optimizations eventually.

  • cruffle_duffle 8 days ago

    The big problem with that is you’ll eventually blow your context window feeding the model with stuff that it mostly doesn’t need in order to complete its task.

    • rapind 7 days ago

      I can’t think of anything I’d want to add to the context for Elm at least, assuming the standard libraries are already in the model (or can be added via RAG). Gemeni is 2m tokens now and I expect this will grow at least until it’s no longer meaningful.

stan_kirdey 8 days ago

Nice! Built something similar in Rust that supports local and remote repos: https://crates.io/crates/r2md

  • tesserato 8 days ago

    I thought of using Rust, but ultimately chose Go. I'll take a look and see how something similar came out in Rust!

    • jdironman 8 days ago

      Something I didn't dig to find, but is it possible for these applications to also respect .gitignores? Might be a handy flag!

      • fullstackchris 7 days ago

        In any node project that basically _must_ be done or your source code will be eclipsed by whatever is in node_modules

skeledrew 8 days ago

This is like a rediscovery of an org-mode capability that has existed for decades, and doesn't do as much.

  • hatmatrix 7 days ago

    Is it? I use org-babel regularly but wasn't aware of it - what's the function called? As great as org-mode / org-babel is, the user base is too small to not be overlooked.

    • skeledrew 7 days ago

      Well in general I've put entire projects into org docs, and ran the code blocks, essentially using it like a Jupyter notebook (although honestly it wasn't always as smooth as I'd like). And I haven't done this myself, but there's a neat literate programming talk from the last EmacsConf[0] in which the presenter showed some custom capabilities which improved the experience even more for him.

      [0] https://emacsconf.org/2024/talks/literate/

Pawamoy 7 days ago

Following the /llms.txt standard proposition, I create a MkDocs plugin that generates an /llms.txt file at the root of your site. So, same thing, but generates the Markdown document from your docs (possibly containing API reference) instead of your code.

hatmatrix 7 days ago

Such a functionality would be useful for developing some scripts and then converting to a Quarto document [1].

[1] https://quarto.org/

  • tesserato 5 days ago

    I've never used Quarto, but I might give it a go someday. I currently have a convoluted workflow for generating math-heavy documents that involves generating equations using SymPy in a notebook, accumulating them in a string, and ultimately dumping the string into a Markdown. I would love to simplify this sooner rather than later. I'm also keeping an eye on https://typst.app/ and hoping for a sane alternative to LaTeX to emerge.

  • mbonnet 7 days ago

    Second hooray for Quarto. Great tool.

causal 8 days ago

This could be a lot better. The example linked in the Github README is a markdown file full of binary garbage because it also tried to convert gzip files to markdown.

Pretty big flag that this isn't ready for primetime.

  • tesserato 8 days ago

    Thank you for pointing that out. Just fixed it.

ainiriand 8 days ago

My codebase sitting at 4M lines: hold my spaghetti.

  • nahco314 8 days ago

    This is self-promotional, but https://github.com/nahco314/feed-llm has TUI to choose what to give to llm. There are many similar tools out there, but I think this approach is relatively effective for larger code bases.

  • ycombinatornews 8 days ago

    You can ask Cursor to use information from specific folder (aka your 4M lines) and it would summarize it and use that.

    Not a replacement for full 4M lines but it might work for some tasks/prompts

nunodonato 7 days ago

This kind of context is really useful for LLMs, but in any significant project, including all code in this manner will easily exceed context limitations. I've been wanting to do something like this for my php projects, but instead of dumping the entire files, would just create a map of its methods signatures, variables, etc. That should give good enough information of what each file is used for and can do, while being small enough to be ingested by AI.

  • panarky 7 days ago

    > including all code in this manner will easily exceed context limitations

    The context window for Gemini 2.0 Flash can handle roughly 50000 lines of code, and 2.0 Pro can handle twice that.

    • nunodonato 7 days ago

      that goes faster than you think. Also, diminishing attention/memory of facts in context also goes down together with its length. Which might hurt when you just want to dump everything at once.

lornajane 8 days ago

For extra points, compile your docs into one file and feed it that as well.

(unless the reason you're giving AI the code is that you don't have any docs for either humans or machines)

mtrovo 8 days ago

Anybody with experience of using something like this with a big codebase and Gemini 2M context window? I tried a while ago (before 2.0 Flash) to solve some refactoring tasks and even after spending some time on prompt wrangling I didn't manage to get good results out of it.

I don't know what kind of agent architecture Cursor uses internally but it seems much better designed at finding where changes need to be made.

  • tesserato 8 days ago

    In my experience with feeding large codebases to Gemini, simple tasks work ok (enumerate where such and such happens, find where a certain function is called, list TODOs throughout the code, etc), but tasks that require a bit more logic are trickier. Nevertheless, I had some success with moderate complex refactoring tasks in Python codebases.

OsrsNeedsf2P 8 days ago

This thread has convinced me that Aider/Cursor need to do more marketing.

  • larusso 8 days ago

    Maybe. But maybe some like the more disconnected way of coding with ai.

    • lgas 7 days ago

      Why? It's just moving more of the grunt work of shuffling things around to the human?

      • larusso 7 days ago

        For me it’s still to feel under control. And the fact that I don‘t want to inject it into every workflow. I‘m open to AI and use it daily. But my terms may be different then others. I want to control what I share and how. People have secrets and other things in a project. I sometimes rename things because the AI should only deal with the big picture. Paint me paranoid but that’s how it is for me.

  • ako 7 days ago

    Same for windsurf, I’ve been using it to generate documentation for code bases. It will generate markdown with mermaid diagrams to explain whatever you want to know; from the component architecture of an entire application, to the sequence diagram for a specific button, and data and ER diagrams.

    But the approach to fit your entire codebase into one document so you can include it in your prompt context seems a dead end, instead the llm can use an agent to do targeted search through your code.

  • rane 7 days ago

    Cursor is all the rage. Nobody talks about Aider, sadly.

    • replwoacause 7 days ago

      I partially disagree. Maybe it depends what circles you run in but at least here on HN I’ve seen Aider mentioned more times than I can count. Is cursor more popular? Yeah…but the people here are talking about Aider. That’s how I learned about it.

  • esafak 8 days ago

    The future is not evenly distributed.

Conasg 8 days ago

I made a similar tool in Golang, https://github.com/foresturquhart/grimoire. It tries to be a bit cleverer, by prioritising files that have had many commits, respecting .gitignore files, and excluding useless content like binaries or vector images.

  • tesserato 8 days ago

    I can think of no use case where binaries are desired in such representation, so I might bake binary exclusion into CodeWeaver as well. SVGs, on the other hand, might be wanted sometimes, in web design contexts. I'll take a look at your implementation and see what I can learn.

    • franze 7 days ago

      thisismy has a -g option for greedy which then also takes binaries

  • codecraze 8 days ago

    Nice! Written in go. I like that :)

Keyframe 8 days ago

Wouldn't it be wonderful to have a tool where you interact with AI interactively through the codebase via IDE / vim / emacs tree? Say, you open your codebase and start with prompts and AI+tool navigates to a function or a place where it needs to and modifies stuff while chatting to you about it? Or you jump to somewhere, highlight where you are to scope down the focus of it (while it still retains all of the code in history/memory). Sort of like pair programming. It sounds so obvious that I'm almost sure I've missed that already existing somewhere. I think I tried google's thing (forgot the name) but it sucked / wasn't that.

  • squeegee_scream 8 days ago

    I think you’re describing Aider.chat. There are 2 Emacs packages for it, one official and a very recent fork. Aider is a cli so it works great with vim as well.

    In Emacs I’ve had good experience with gptel as well but I prefer aider for the coding workflow

    • skeledrew 7 days ago

      Yep, I've particularly been enjoying the recent "watchfiles" feature where a comment can be added to the source file, and ending it with "ai?" or "ai!" triggers use of said comment as a prompt to ask about or change that section upon save.

    • Keyframe 8 days ago

      I'll check it out, thanks!

  • zknowledge 8 days ago

    Apologies if I'm missing something, but aren't you describing Cursor/Copilot/Windsurf?

    • Keyframe 8 days ago

      you're not. looks like that's kind of it, but would the thing have the context of the whole project when I'm in a file/class/function? With copilot, in my case, it was so far mostly like a fancy autocomplete that has immediate vicinity in its memory where it would be vastly more useful if it had the context of the whole project / all files.

      • cjonas 8 days ago

        Cursor indexes the entire code use with embeddings. It works well in small single app projects

        • kohlerm 8 days ago

          it is also the "right thing to do" IMHO.

    • ajoseps 8 days ago

      the vscode extension cline also does this

  • meesles 8 days ago

    This doesn't sound good to me, you end up with a large codebase that no human has actually laid eyes on. When you get a bug weird enough that you can't reason the LLM through it, then what? What if a bug is because of interactions between two systems, and you don't own one of them? What if there's an issue due to convoluted business process failures, that just end in a bug report like "my data is missing!"? I honestly think in the latter case, the LLM will just fix a 'bug' and miss the forest for the trees.

    I prefer the idea of the other comment reply where you use AI as a tool to explore a codebase and assist you, not something you instruct to do the work. It can accelerate you building that experience and intuition at a level we've never been able to do before.

    • ako 7 days ago

      An llm itself is a large codebase that no human has laid eyes on, instead you validate it through testing.

      Regarding testing, I’ve had an interaction with windsurf where I told it there was a bug in the application it generated. It replied “I’ve added some log statements, can you run it and tell me what you see, then I’ll know what to fix”… The llm was instructing me…

    • Keyframe 8 days ago

      Nothing like that at all. For example I have a few codebases kind of large (for certain quantity of large) where I know the code since either I wrote it or participated heavily in. Talking snippets at a time loses a ton of context which would yield better offered solutions if you had, well.. the whole context.

  • hk__2 8 days ago

    I tried various solutions but I still haven’t found a chat tool that allows me to navigate a large monorepo. I’d like to be able to say "open the file where there is the function to do <xyz>", but current tools don’t understand that.

    • lgas 7 days ago

      This works fine in Cursor. As far as I know, you can't say "open the file..." but you can say "where is the function to do <xyz>" and it'll include a link to the file in it's response and then you can click to open it.

_puk 7 days ago

Whilst the pendulum seems well on its way to be swinging from microservices back to monoliths, I'm thinking we'll end up in a place that limits the volume and complexity of the code in a single service so that it's just large enough to encompass a point of single responsibility.

Then we can easily drop in and out of using LLMs in the code space.

Service Oriented Architecture lends itself well to the limited context of these models.

Maybe we can revive literate programming and simply build everything from a single markdown document..

  • azthecx 7 days ago

    Microservices lend themselves to architectural decisions that LLMs are just not trained to understand.

    It's one thing to have it be trained in billions of loc and be useful, its another for it to have enough quality dataset to have enough context and understanding of something like Kafka partition ordering and its possible interactions with something like a database and at-least once delivery. It will give you an explanation of those things in isolation, but not in combination.

ActVen 8 days ago

Any unique benefits over using this vs something like Repomix? https://github.com/yamadashy/repomix

  • tesserato 8 days ago

    CodeWeaver is compiled, so it might be faster. Also, you can grab a compatible executable from the releases section, and you're good to go, instead of using `go install` so, no dependencies. Personally, I considered following the `.gitignore` route but found that manually specifying what to exclude via a comma-separated list of regular expressions provided me with the flexibility I needed (initial setup might be a bit tedious, though, but, then again, you can use an LLM for that).

cjonas 8 days ago

I could see this being quite useful in the background for apps like cursor when they need to perform a full codebase search. I imagine it could be more effective in breaking up larger codebases where embeddings start to fall out. If you could fit the entire document into context, you'd be able to "point the model" in the right direction.

The challenge is maintaining it... But you'd maybe ask the model to do that incrementally on every commit, or just throw it away and regenerate from scratch occasionally.

resters 8 days ago

See the script I created that does something similar with a few improvements for large projects:

https://paste.mozilla.org/9rD95yAy

I would like to be able to create sets of files that I can easily send to the clipboard in this kind of format. The files could correspond to the ones relevant to a particular feature, etc. They don't always fall under the same subtree of the source code, and the entire source code is too big for the context.

fragmede 8 days ago

Which like, kinda neat that it exists, but who's using tooling that bad that they're manually copying and pasting that much code into, what, a web browser text entry box?

Use better tools people!

  • nahco314 8 days ago

    I have always used o1 pro and deep research, but these are only available through the web UI. there is no doubt that cursor and others have a better UI, but the demand for this type of tool exists because OpenAI does not release an API

  • rorytbyrne 8 days ago

    This seems useful for building new tools. It's not strictly an end-user tool.

    • dazzawazza 8 days ago

      Exactly, the LLM-RAG boffins are all over stuff like this.

davidbarker 7 days ago

If it's useful to anyone, I made a VS Code/Cursor extension that combines all open files into one big text document.

I use it with ChatGPT's o1 pro (which can handle around 100,000 tokens).

1. Open all of the files I think are relevant

2. Use the extension to combine them

3. Copy and paste into ChatGPT

https://marketplace.visualstudio.com/items?itemName=DVYIO.co...

rorytbyrne 8 days ago

Does anyone know of tools that go the other direction? i.e. taking a technical writeup (scientific paper, architecture docs, or similar) and emitting a candidate codebase.

  • elashri 8 days ago

    Maybe I don't understand but isn't this what you use LLMs for?

  • lgas 7 days ago

    Yes, I often use one LLM to generate a PRD and the include it in the codebase, then ask Cursor agent to implement some part of the system using the PRD as a reference. It can't emit an entire codebase in one-shot (unless it's trivial project like "build me a flappy bird clone") but you can use it as scaffolding to manage implementing a whole project in chunks.

  • codazoda 8 days ago

    I don’t know of a tool but I’ve had some success doing this with a one shot short prompt. I say something like, “Here’s a readme. Develop this in Go.” Followed by the readme.

    I’ve been getting complete working code with this strategy but I’m creating projects that are relatively simple.

    I also notice that I have to give a little deeper context about “how” it should work, which I normally wouldn’t do.

shipp02 7 days ago

Given the limited context length of most LLMs, is there value in turning in an entire codebase into a doc to feed it into an LLM?

I think cherry-picking relevant sections would be necessary to make it function effectively. Has anyone tried using tree-sitter to recursively feed it the source for functions used in the section we want to analyze to optimize for this?

__mharrison__ 8 days ago

Interesting. I've been converting Jupyter notebooks into markdown for the same purpose. Am considering making a custom tool.

  • tesserato 8 days ago

    I also have this use case, and would be interested in such a tool. If you intend to write your tool in Golang, consider instead extending CodeWeaver.

narmiouh 8 days ago

If I'm reading this correctly, why include all code into the markdown? It's almost like the AI model that would use this is necessarily using all concatenated code plus explanation of the code, I'm not sure which is better because the LLM then already has access to the entire code as part of markdown?

mkagenius 8 days ago

I have one for CVEs in case there are security folks here - recursively finds details like code commit diff which fixed the vulnerability in references links too to generate one single json.

1. https://github.com/BandarLabs/cveingest

bkyan 7 days ago

Oh, cool -- this is made with golang! I'll have see if I could wrap it in a desktop gui using wails.

replwoacause 7 days ago

I see lots of folks here using LLMs in their codebases. Does that mean there isn’t much concern about sharing your app’s code with an LLM? Have people just gotten comfortable with this now? Or does it only matter for closed source or proprietary code bases ?

  • ako 7 days ago

    You can run an llm on your local machine, and you can get llm sandboxes for your company.

emmelaich 8 days ago

Is this related to https://gitingest.com/ at all? Which seems to be a service doing a similar thing.

  • BoorishBears 8 days ago

    There are a ridiculous number of projects doing this.

    I'm always baffled by the response they get since doing this is also the most impractical, poorly scaling, way to insert an LLM into your development process.

    On one hand if you realize that, there may be times where you get lucky with the size of a codebase and the nature of your questions and it works acceptably.

    But on the other, this feels like the kind of thing someone who's hearing others rave about the utility of AI will try with too large of a codebase, insert the result into ChatGPT, and then get an LLM underperforming because it's being flooded with irrelevant context for every basic operation it's being asked to do.

    There are very few times when providing the entire codebase in the context window instead of the relevant code to a single operation makes sense.

  • tesserato 8 days ago

    It is not. Others have commented pointing to services similar to this one, though.

Alifatisk 8 days ago

There is also repo2txt.simplebasedomain.com/local.html

squeegee_scream 8 days ago

This is great, but I’m pretty sure this is trivial using Emacs and org mode. You could then use pandoc to convert org to markdown

  • lgas 7 days ago

    It's trivial using a number of approaches, eg. a simple bash or python script. But I think there's still a fair amount of value in building a common tool for these sorts of things. Everyone that builds their own one off solution will inevitably encounter more and more of the edge cases (oh I need to honor .gitignore... oh, I need to be able to override .gitignore and include some ignored things... oh I need to deal with huge files... etc) and with a common tool the tool can collect the ways of dealing with all of these edge cases.

    Now no one will need something that can handle all of the edge cases, but whatever edge cases they need to be handled will already be handled. The overall time and frustration saved this way can be huge.

novemp 7 days ago

How do you do the opposite of this? Transform your markdown files into a codebase that AI can't leech off of?

strizzo 8 days ago

There’s ClipSource for VSCode that does this

  • tesserato 3 days ago

    From the description, seems to only work with Python codebases.

atum47 8 days ago

Damn, I did that the other day but manually. I just cat everything from a folder in the order that I wanted and fed it to ChatGPT so it could write a README for tiny.js

mmanfrin 7 days ago

I built a simple tool to do something similar (it's meant for a monorepo and will build each subfolder in to a (subfolder-code.txt) text file that you can upload to AIs.

https://github.com/manfrin/bundle-codebases

I don't see much merit in things like markdown or syntax highlighting as that's just extra noise for the AI. My script tries to cut down on any extraneous data since the things I'm working on are near the context limit of consumer AIs.

My script also ignores anything in .gitignore and will take a .codebundlerwhitelist (i hate this name and have meant to change it) to only bundle files matching patterns you specify.

  • antirez 7 days ago

    Not just extra noise, but also extra tokens.

sandGorgon 7 days ago

how does this compare to code2prompt or files2prompt ? any benchmarks on which one works better for LLMs ?

croes 7 days ago

So only the US is allowed to get data directly from the companies.

Got it.

schaefer 8 days ago

Wait, just one question…

Can I call this c++ code “machine code” now?