Show HN: Transform your codebase into a single Markdown doc for feeding into AI

tesserato.web.app

281 points by tesserato 5 months ago

CodeWeaver is a command-line tool designed to weave your codebase into a single, easy-to-navigate Markdown document. It recursively scans a directory, generating a structured representation of your project's file hierarchy and embedding the content of each file within code blocks. This tool simplifies codebase sharing, documentation, and integration with AI/ML code analysis tools by providing a consolidated and readable Markdown output.

Terretta 5 months ago

CodeWeavers is a software company that focuses on Wine development and sells a proprietary version of Wine called CrossOver for running Windows applications on macOS, ChromeOS and Linux.

https://en.wikipedia.org/wiki/CodeWeavers

Trademark is active. It's an Ⓡ not just a ™, registered not just trademarked. To keep it, they have to demonstrate they defend it.

https://www.trademarkia.com/codeweavers-76546826

While this project drops the final "s", you don't get to launch an OS called "Window". The test is a fuzzy match based on likelihood of confusion.

jychang 5 months ago

Yeah, I was thinking "what does the Wine guys have to do with this?"
This project is definitely going to get C&D'd.
- williamcotton 5 months ago
  
  Do you think they would actually litigate? They seem like different products serving entirely different markets so I am not sure that the trademark infringement claim is very defensible. And how do they prove damages?
pstuart 5 months ago

CodeLoom could work instead

crisbal_ 5 months ago

I use the following for feeding into AI

   find . -print -exec cat {} \; -exec echo \;

Which will return for each file (and subfolders) the filename and then the content of the file.

Then `| pbcopy` to copy to clipboard and paste it into ChatGPT or similar.

singpolyma3 5 months ago

I guess this only works for very small codebase?
- OsrsNeedsf2P 5 months ago
  
  Correct, but it's the same as what OP shared.
  You should use Aider/Cursor for proper indexing/intelligent codebase referencing
  - starfezzy 5 months ago
    
    Cramming thousands of tokens of potentially irrelevant context through unclear indexing paths isn't "proper".
    The best results come from feeding precisely targeted context directly into the prompt, where you know exactly what the model sees and how it processes it. The prompt receives the most accurate use of attention—whereas god knows what the pipeline is for cursor or what extra layers and context restrictions they add on top of base Claude.
    Giving the model a clean project hierarchy accomplishes a lot efficiently in terms of context tokens. The key is ensuring it only sees what's relevant, without diluting its attention.
    Tools like reopmix and OP's version, feeding targeted context straight into models like Claude or Google's offerings, outperform Copilot and Cursor in my experience, even though they use the same base models. Use the highest-quality attention (the prompt context) directly, rather than layers of uncertainty and "proper indexing".
  - soco 5 months ago
    
    I'm still puzzled how come people are convinced by Cursor, while my experience was meh at best. Can it index your stuff? okay it can. Can it refactor a simple function? No it cannot, it can't even rename a damn Java class. How can I trust it to generate then code based on my codebase? So, what is your use case then? Or can anybody point me to some blog/articles/videos showing some real use cases for Cursor? Real as in, something that it provenly can do?
    
    risyachka 5 months ago
    
    I think you know the correct answer:)
    
    soco 4 months ago
    
    I hoped to be wrong but no comment so far even tried to bring a based argument... eh, maybe I'll try again in a year.
    
    kristiandupont 5 months ago
    
    >Can it refactor a simple function?
    Certainly, I do that several times a day.
    
    rob 5 months ago
    
    Listen, I don't want to brag too much, but it even made me a function today.
    
    astar1 5 months ago
    
    >Java
    found the problem
    
    jcgrillo 5 months ago
    
    Correct. It's Web 3.0 2.0. You're supposed to play along to make the stock prices go up and to the right.
  - boredemployee 5 months ago
    
    not sure if it's cursor's fault, but very often it doesn't give me the real or complete code of my codebase when auto editing/auto completing.
    any tips?

usagisushi 5 months ago

Or, for a lazier approach:

    $ head -10000 *
    ==> package.json <==
    {
      "name": ...
      ...
    ==> tsconfig.json <==
    {
       "extends": ...
      ...

    $ head -10000 * | llm -s "generate a patch to switch this project to esm"

mohsen1 5 months ago

yek makes it a bit quicker and you can do all your piping with it:
https://github.com/bodo-run/yek
DrPhish 5 months ago
That's very nice and compact. I do the same with a short bash script, but wrap each file in triple-backticks and attempt to put the correct language label on each eg:
Filename: demo.py
```python
```
   ...python code here...
```
```
- genewitch 5 months ago
  
  Seconded because just having something autowrapped like that and putting the clipboard would save me time: release the snyder cut, er, bash script!
- mbonnet 5 months ago
  
  Mind sharing the script?
  - jzombie 5 months ago
    
    I have something similar.
    https://github.com/jzombie/globcat.sh
    Nothing fancy, but gets the job done.

beklein 5 months ago

Tip: If you ever need to do this on a public GitHub repository you can use "gitingest".

This will open a website that creates a copy of all the file contents of the repo (code, docs, ...) It's a great tool to use when using new/obscure code with LLMs in my opinion.

The UX is so just easy and great, change the URL from <https://github.com/user_name/repo_name> to <https://gitingest.com/user_name/repo_name>

//edit: fixed URLs

mkagenius 5 months ago

I copied the UX to my https://gitpodcast.com (creates podcast on a github repo, same replace `hub` with `podcast`)
- pratyahava 5 months ago
  
  i am very impressed by gitpodcast, i just listened to one podcast and first of all i am pleased with the idea, the voices are also pleasant to listen to. thanks for sharing!

reddalo 5 months ago

Unfortunate naming, given that CodeWeavers is already a company making a Windows "emulator" for Linux and macOS. [1]

[1] https://www.codeweavers.com/

Arch-TK 5 months ago

CodeWeavers are actually making wine, not just some "emulator". They then distribute this along with some QOL tools as a commercial product called CrossOver.
lgas 5 months ago

All names are taken. There's no need to point this out every time.
- anamexis 5 months ago
  
  Not all names are registered trademarks for software.
- Rexxar 5 months ago
  
  Some are more confusing than others.
- gloosx 5 months ago
  
  Huewoblfan is not taken! Noiewoidc is free. XIONqlic – totally available, can mean a range of things! Ciohupoij – a bit of asian flavour but still a valid free name.

tecleandor 5 months ago

As a note, CodeWeaver might be a confusing name, as CodeWeavers (the Wine development company) exists since 1996... ( https://en.wikipedia.org/wiki/CodeWeavers )

teekert 5 months ago

My first though: Is this somehow using Wine?
It's not mentioned on the page but is it using [0] in the background? Edit -> It's a Go program so I guess not.
[0] https://github.com/microsoft/markitdown

pmx 5 months ago

How does this compare to / differ from https://github.com/yamadashy/repomix ?

tesserato 5 months ago

Some advantages of CodeWeaver are that it is compiled, so it might be faster; you can grab a compatible executable from the releases section instead of using `go install` so, no dependencies. You can manually specify what to exclude via a comma-separated list of regular expressions so it might be more flexible. I never used Repomix so, those assumptions might not hold. On the other hand, remix seems to be awfully more complete, a full-fledged solution to convert source code to monolithic representations. I wrote CodeWeaver because I only needed something that worked and, occasionally, I could trust to keep sensitive data away from sketchy LLMs (And wasn't aware of other solutions).
apineda 5 months ago

There is also https://github.com/regenrek/codefetch which I personally like
akoculu 5 months ago

or https://github.com/azer/llmcat
- imdsm 5 months ago
  
  I simply have a bash script called printall which takes in some args, and outputs markdown codeblocks with filenames and a tree. One of hundreds of scripts built up over the years.
  - akoculu 5 months ago
    
    if you add fzf to speed up file / folder selection, you'll have your own llmcat :)
ycombinatornews 5 months ago

My question exactly. Repomix seems to be tested util for something like that.
ActVen 5 months ago

Same question here. I have found repomix to get the job done really well.
SillyUsername 5 months ago

or https://github.com/bodo-run/yek

retropragma 5 months ago

I really want a tool like this that can extract a function and its dependency graph (to a certain depth maybe, and/or exclude node_modules).

I wrote this library [1] and hope to add the fine-grained "reference resolution" utility to it at some point, which could make implementing such a tool a lot simpler.

[1]: https://github.com/aleclarson/ts-module-graph

therealmarv 5 months ago

I use aider /copy-context command for that

https://aider.chat/docs/usage/copypaste.html

and with /paste you can apply the changes.

anotherpaulg 5 months ago

Thanks for letting folks know about aider's /copy-context command.
To add some more detail, aider has a mode/UX that is optimized for "copy and paste" coding with LLM web chats. The "big brain" LLM in the web chat does the hard work, and a cheap/local LLM works with aider to apply edits to your local files.
There's a little demo video in the link above that should give you the gist.

tempoponet 5 months ago

A new tool like this comes out every week, and that's great! But I think it's fair to ask how this compares to popular ones like RepoMix? Anyone keeping an eye on this space will want to know why this is different from what's already out there and being used.

tesserato 5 months ago

I actually wrote this a couple of months ago, so perhaps nothing similar existed back then (I remember doing some research back then, mostly focused on VS Code plugins). Nevertheless, the idea was also to test how Golang could facilitate the distribution of such micro tools throughout the internal team, so I probably would have still made it. It is nice to know that similar tools exist. I'll take a look at them.

juunge 5 months ago

I’ve made a CLI tool that does something similar, called Copcon:

https://github.com/kasperjunge/copcon

Point it at a code project directory to get a file tree and content, optionally with a git diff, copied to the clipboard - ready for copy pasting into ChatGPT.

It is very true that this only works for small projects, as you will bloat the LLM’s context with large codebases.

My solution to this is two files you can use to steer the tool’s behavior:

- .copconignore: For ignoring specific files and directories.

- .copcontarget: For targeting specific files and directories (applied before .copconignore).

These two files provide great control over what to include and exclude in the copied context.

maurycy 5 months ago

  find . -type f -name '*.py' -exec sh -c 'echo "# $1"; cat "$1"; echo ""' _ {} \; | pbcopy

rapind 5 months ago

Somewhat related. I built an Elm app all in one file as an experiment and to see if I like it. It's a little over 7k lines and I'm occasionally adding more to it.

It's actually pretty straightforward if you're in a language with lexical scoping, and it simplifies some things, like includes / cyclical, no modules, no hunting through files, etc.

I feel like this set up could integrate really well w/ AI models.

I've found that the only real limitation, at least in my experiment, was a lack of decent editor support. I use vim so this wasn't really much of an issue for me with many great ways to navigate a file, and a combination of vertical and horizontal splits on a large screen, but when I opened it up in other "modern" editors the ergonomics fell apart quite a bit.

I think the biggest downside was re-using variable names between large scopes occasionally made it hard to find the reference I wanted (E.g. i, x, key, val), but again, better editor support allowing you to limit your search to within the current scope would help. Also easily mitigated with more verbose throwaway variable naming.

squeegee_scream 5 months ago

I write Elm and use Emacs primarily, and sometimes neovim. Are you using lsp in vim? You’re doing it right by staying in one file until it hurts, that’s the recommendation for Elm, but I can’t recall if I’ve had issues using go-to-def or other lsp functions like your describing
- rapind 5 months ago
  
  No LSP. It honestly doesn’t speed me up any. I already have the standard library memorized, plus some of the common community lib methods (List.Extra) and my typing speed is faster than I can think anyways.
  I’m thinking the same approach would also work well in F#, Haskell, OCaml.
Aurornis 5 months ago

> no hunting through files, etc.
It’s easy to switch to files by name with a few keystrokes. Files are names to group things I’m looking for.
I would much rather do that than try to search through a 7,000 line file for what I need.
> I feel like this set up could integrate really well w/ AI models.
Massive files or too many files break AI models. Grouping functionality into smaller files and including only relevant files is key. The file and folder names can be hints about where to find the right files to include.
- rapind 5 months ago
  
  > I would much rather do that than try to search through a 7,000 line file for what I need.
  I mean I'm not arguing for it as a best practice. I did it as an experiment (as I stated), and discovered it's actually really easy, and snappy for me to navigate in Vim. Mileage may vary with other editors. Have you tried it?
  > Massive files or too many files break AI models
  It's growing faster than I code! With the latest Gemeni at least it's much larger at 1-2 mil tokens. I'm sure we'll hit a ceiling though, but I also think we may find some context caching / rag type optimizations eventually.
cruffle_duffle 5 months ago

The big problem with that is you’ll eventually blow your context window feeding the model with stuff that it mostly doesn’t need in order to complete its task.
- rapind 5 months ago
  
  I can’t think of anything I’d want to add to the context for Elm at least, assuming the standard libraries are already in the model (or can be added via RAG). Gemeni is 2m tokens now and I expect this will grow at least until it’s no longer meaningful.

stan_kirdey 5 months ago

Nice! Built something similar in Rust that supports local and remote repos: https://crates.io/crates/r2md

tesserato 5 months ago

I thought of using Rust, but ultimately chose Go. I'll take a look and see how something similar came out in Rust!
- jdironman 5 months ago
  
  Something I didn't dig to find, but is it possible for these applications to also respect .gitignores? Might be a handy flag!
  - fullstackchris 5 months ago
    
    In any node project that basically _must_ be done or your source code will be eclipsed by whatever is in node_modules

tesserato 4 months ago

Updated the project readme with links to (hopefully) all alternatives listed here. Feel free to add new ones via pull requests.

https://github.com/tesserato/CodeWeaver

skeledrew 5 months ago

This is like a rediscovery of an org-mode capability that has existed for decades, and doesn't do as much.

hatmatrix 5 months ago

Is it? I use org-babel regularly but wasn't aware of it - what's the function called? As great as org-mode / org-babel is, the user base is too small to not be overlooked.
- skeledrew 5 months ago
  
  Well in general I've put entire projects into org docs, and ran the code blocks, essentially using it like a Jupyter notebook (although honestly it wasn't always as smooth as I'd like). And I haven't done this myself, but there's a neat literate programming talk from the last EmacsConf[0] in which the presenter showed some custom capabilities which improved the experience even more for him.
  [0] https://emacsconf.org/2024/talks/literate/

Pawamoy 5 months ago

Following the /llms.txt standard proposition, I create a MkDocs plugin that generates an /llms.txt file at the root of your site. So, same thing, but generates the Markdown document from your docs (possibly containing API reference) instead of your code.

causal 5 months ago

This could be a lot better. The example linked in the Github README is a markdown file full of binary garbage because it also tried to convert gzip files to markdown.

Pretty big flag that this isn't ready for primetime.

tesserato 5 months ago

Thank you for pointing that out. Just fixed it.

hatmatrix 5 months ago

Such a functionality would be useful for developing some scripts and then converting to a Quarto document [1].

[1] https://quarto.org/

tesserato 5 months ago

I've never used Quarto, but I might give it a go someday. I currently have a convoluted workflow for generating math-heavy documents that involves generating equations using SymPy in a notebook, accumulating them in a string, and ultimately dumping the string into a Markdown. I would love to simplify this sooner rather than later. I'm also keeping an eye on https://typst.app/ and hoping for a sane alternative to LaTeX to emerge.
mbonnet 5 months ago

Second hooray for Quarto. Great tool.

lars512 5 months ago

I've been enjoying `files-to-prompt` by Simon Willison: https://github.com/simonw/files-to-prompt

franze 5 months ago

here is mine
https://github.com/franzenzenhofer/thisismy
supports files, resursive directories, .gitignor and .thisismyignore and online ressources / URLs + tree commands
also available as a chrome extension https://thisismy.franzai.com/

keizo 5 months ago

I'll be the 10th person to add, I made something like this too! https://github.com/keizo/ggrab

ainiriand 5 months ago

My codebase sitting at 4M lines: hold my spaghetti.

nahco314 5 months ago

This is self-promotional, but https://github.com/nahco314/feed-llm has TUI to choose what to give to llm. There are many similar tools out there, but I think this approach is relatively effective for larger code bases.
ycombinatornews 5 months ago

You can ask Cursor to use information from specific folder (aka your 4M lines) and it would summarize it and use that.
Not a replacement for full 4M lines but it might work for some tasks/prompts

nunodonato 5 months ago

This kind of context is really useful for LLMs, but in any significant project, including all code in this manner will easily exceed context limitations. I've been wanting to do something like this for my php projects, but instead of dumping the entire files, would just create a map of its methods signatures, variables, etc. That should give good enough information of what each file is used for and can do, while being small enough to be ingested by AI.

panarky 5 months ago

> including all code in this manner will easily exceed context limitations
The context window for Gemini 2.0 Flash can handle roughly 50000 lines of code, and 2.0 Pro can handle twice that.
- nunodonato 5 months ago
  
  that goes faster than you think. Also, diminishing attention/memory of facts in context also goes down together with its length. Which might hurt when you just want to dump everything at once.

lornajane 5 months ago

For extra points, compile your docs into one file and feed it that as well.

(unless the reason you're giving AI the code is that you don't have any docs for either humans or machines)

mtrovo 5 months ago

Anybody with experience of using something like this with a big codebase and Gemini 2M context window? I tried a while ago (before 2.0 Flash) to solve some refactoring tasks and even after spending some time on prompt wrangling I didn't manage to get good results out of it.

I don't know what kind of agent architecture Cursor uses internally but it seems much better designed at finding where changes need to be made.

tesserato 5 months ago

In my experience with feeding large codebases to Gemini, simple tasks work ok (enumerate where such and such happens, find where a certain function is called, list TODOs throughout the code, etc), but tasks that require a bit more logic are trickier. Nevertheless, I had some success with moderate complex refactoring tasks in Python codebases.

strizzo 5 months ago

I made the same but for VSCode two weeks ago, called it ClipSource it’s in the extensions marketplace https://marketplace.visualstudio.com/items?itemName=Strizzo.... You can right click on a directory in the workspace and copy all content in markdown

OsrsNeedsf2P 5 months ago

This thread has convinced me that Aider/Cursor need to do more marketing.

larusso 5 months ago

Maybe. But maybe some like the more disconnected way of coding with ai.
- lgas 5 months ago
  
  Why? It's just moving more of the grunt work of shuffling things around to the human?
  - larusso 5 months ago
    
    For me it’s still to feel under control. And the fact that I don‘t want to inject it into every workflow. I‘m open to AI and use it daily. But my terms may be different then others. I want to control what I share and how. People have secrets and other things in a project. I sometimes rename things because the AI should only deal with the big picture. Paint me paranoid but that’s how it is for me.
ako 5 months ago

Same for windsurf, I’ve been using it to generate documentation for code bases. It will generate markdown with mermaid diagrams to explain whatever you want to know; from the component architecture of an entire application, to the sequence diagram for a specific button, and data and ER diagrams.
But the approach to fit your entire codebase into one document so you can include it in your prompt context seems a dead end, instead the llm can use an agent to do targeted search through your code.
rane 5 months ago

Cursor is all the rage. Nobody talks about Aider, sadly.
- replwoacause 5 months ago
  
  I partially disagree. Maybe it depends what circles you run in but at least here on HN I’ve seen Aider mentioned more times than I can count. Is cursor more popular? Yeah…but the people here are talking about Aider. That’s how I learned about it.
esafak 5 months ago

The future is not evenly distributed.

Conasg 5 months ago

I made a similar tool in Golang, https://github.com/foresturquhart/grimoire. It tries to be a bit cleverer, by prioritising files that have had many commits, respecting .gitignore files, and excluding useless content like binaries or vector images.

tesserato 5 months ago

I can think of no use case where binaries are desired in such representation, so I might bake binary exclusion into CodeWeaver as well. SVGs, on the other hand, might be wanted sometimes, in web design contexts. I'll take a look at your implementation and see what I can learn.
- franze 5 months ago
  
  thisismy has a -g option for greedy which then also takes binaries
codecraze 5 months ago

Nice! Written in go. I like that :)

megadragon9 5 months ago

I would say the demand for this kind of tool definitely exists. Good work! From a rough glance it looks pretty similar to another tool that I've been using https://github.com/mufeedvh/code2prompt

thesurlydev 5 months ago

I literally just wrote something similar called techdocs[1] in Rust and uses Claude to generate a README. It includes API and CLI.

[1] https://github.com/thesurlydev/techdocs

tesserato 5 months ago

Nice! I thought of using Rust. I'll check how you implemented it.

Keyframe 5 months ago

Wouldn't it be wonderful to have a tool where you interact with AI interactively through the codebase via IDE / vim / emacs tree? Say, you open your codebase and start with prompts and AI+tool navigates to a function or a place where it needs to and modifies stuff while chatting to you about it? Or you jump to somewhere, highlight where you are to scope down the focus of it (while it still retains all of the code in history/memory). Sort of like pair programming. It sounds so obvious that I'm almost sure I've missed that already existing somewhere. I think I tried google's thing (forgot the name) but it sucked / wasn't that.

squeegee_scream 5 months ago

I think you’re describing Aider.chat. There are 2 Emacs packages for it, one official and a very recent fork. Aider is a cli so it works great with vim as well.
In Emacs I’ve had good experience with gptel as well but I prefer aider for the coding workflow
- skeledrew 5 months ago
  
  Yep, I've particularly been enjoying the recent "watchfiles" feature where a comment can be added to the source file, and ending it with "ai?" or "ai!" triggers use of said comment as a prompt to ask about or change that section upon save.
- Keyframe 5 months ago
  
  I'll check it out, thanks!
zknowledge 5 months ago

Apologies if I'm missing something, but aren't you describing Cursor/Copilot/Windsurf?
- Keyframe 5 months ago
  
  you're not. looks like that's kind of it, but would the thing have the context of the whole project when I'm in a file/class/function? With copilot, in my case, it was so far mostly like a fancy autocomplete that has immediate vicinity in its memory where it would be vastly more useful if it had the context of the whole project / all files.
  - cjonas 5 months ago
    
    Cursor indexes the entire code use with embeddings. It works well in small single app projects
    
    kohlerm 5 months ago
    
    it is also the "right thing to do" IMHO.
- ajoseps 5 months ago
  
  the vscode extension cline also does this
meesles 5 months ago

This doesn't sound good to me, you end up with a large codebase that no human has actually laid eyes on. When you get a bug weird enough that you can't reason the LLM through it, then what? What if a bug is because of interactions between two systems, and you don't own one of them? What if there's an issue due to convoluted business process failures, that just end in a bug report like "my data is missing!"? I honestly think in the latter case, the LLM will just fix a 'bug' and miss the forest for the trees.
I prefer the idea of the other comment reply where you use AI as a tool to explore a codebase and assist you, not something you instruct to do the work. It can accelerate you building that experience and intuition at a level we've never been able to do before.
- ako 5 months ago
  
  An llm itself is a large codebase that no human has laid eyes on, instead you validate it through testing.
  Regarding testing, I’ve had an interaction with windsurf where I told it there was a bug in the application it generated. It replied “I’ve added some log statements, can you run it and tell me what you see, then I’ll know what to fix”… The llm was instructing me…
- Keyframe 5 months ago
  
  Nothing like that at all. For example I have a few codebases kind of large (for certain quantity of large) where I know the code since either I wrote it or participated heavily in. Talking snippets at a time loses a ton of context which would yield better offered solutions if you had, well.. the whole context.
hk__2 5 months ago

I tried various solutions but I still haven’t found a chat tool that allows me to navigate a large monorepo. I’d like to be able to say "open the file where there is the function to do <xyz>", but current tools don’t understand that.
- lgas 5 months ago
  
  This works fine in Cursor. As far as I know, you can't say "open the file..." but you can say "where is the function to do <xyz>" and it'll include a link to the file in it's response and then you can click to open it.

ActVen 5 months ago

Any unique benefits over using this vs something like Repomix? https://github.com/yamadashy/repomix

tesserato 5 months ago

CodeWeaver is compiled, so it might be faster. Also, you can grab a compatible executable from the releases section, and you're good to go, instead of using `go install` so, no dependencies. Personally, I considered following the `.gitignore` route but found that manually specifying what to exclude via a comma-separated list of regular expressions provided me with the flexibility I needed (initial setup might be a bit tedious, though, but, then again, you can use an LLM for that).

cjonas 5 months ago

I could see this being quite useful in the background for apps like cursor when they need to perform a full codebase search. I imagine it could be more effective in breaking up larger codebases where embeddings start to fall out. If you could fit the entire document into context, you'd be able to "point the model" in the right direction.

The challenge is maintaining it... But you'd maybe ask the model to do that incrementally on every commit, or just throw it away and regenerate from scratch occasionally.

_puk 5 months ago

Whilst the pendulum seems well on its way to be swinging from microservices back to monoliths, I'm thinking we'll end up in a place that limits the volume and complexity of the code in a single service so that it's just large enough to encompass a point of single responsibility.

Then we can easily drop in and out of using LLMs in the code space.

Service Oriented Architecture lends itself well to the limited context of these models.

Maybe we can revive literate programming and simply build everything from a single markdown document..

azthecx 5 months ago

Microservices lend themselves to architectural decisions that LLMs are just not trained to understand.
It's one thing to have it be trained in billions of loc and be useful, its another for it to have enough quality dataset to have enough context and understanding of something like Kafka partition ordering and its possible interactions with something like a database and at-least once delivery. It will give you an explanation of those things in isolation, but not in combination.

resters 5 months ago

See the script I created that does something similar with a few improvements for large projects:

https://paste.mozilla.org/9rD95yAy

I would like to be able to create sets of files that I can easily send to the clipboard in this kind of format. The files could correspond to the ones relevant to a particular feature, etc. They don't always fall under the same subtree of the source code, and the entire source code is too big for the context.

roskelld 5 months ago

Link says snippet deleted.
- resters 5 months ago
  
  I made a better one that lets you add the files/paths and refresh and copy to the clipboard:
  https://paste.mozilla.org/omP4EKE8

fragmede 5 months ago

Which like, kinda neat that it exists, but who's using tooling that bad that they're manually copying and pasting that much code into, what, a web browser text entry box?

Use better tools people!

nahco314 5 months ago

I have always used o1 pro and deep research, but these are only available through the web UI. there is no doubt that cursor and others have a better UI, but the demand for this type of tool exists because OpenAI does not release an API
rorytbyrne 5 months ago

This seems useful for building new tools. It's not strictly an end-user tool.
- dazzawazza 5 months ago
  
  Exactly, the LLM-RAG boffins are all over stuff like this.

rorytbyrne 5 months ago

Does anyone know of tools that go the other direction? i.e. taking a technical writeup (scientific paper, architecture docs, or similar) and emitting a candidate codebase.

elashri 5 months ago

Maybe I don't understand but isn't this what you use LLMs for?
lgas 5 months ago

Yes, I often use one LLM to generate a PRD and the include it in the codebase, then ask Cursor agent to implement some part of the system using the PRD as a reference. It can't emit an entire codebase in one-shot (unless it's trivial project like "build me a flappy bird clone") but you can use it as scaffolding to manage implementing a whole project in chunks.
codazoda 5 months ago

I don’t know of a tool but I’ve had some success doing this with a one shot short prompt. I say something like, “Here’s a readme. Develop this in Go.” Followed by the readme.
I’ve been getting complete working code with this strategy but I’m creating projects that are relatively simple.
I also notice that I have to give a little deeper context about “how” it should work, which I normally wouldn’t do.

davidbarker 5 months ago

If it's useful to anyone, I made a VS Code/Cursor extension that combines all open files into one big text document.

I use it with ChatGPT's o1 pro (which can handle around 100,000 tokens).

1. Open all of the files I think are relevant

2. Use the extension to combine them

3. Copy and paste into ChatGPT

https://marketplace.visualstudio.com/items?itemName=DVYIO.co...

replwoacause 5 months ago

I’ll be using this, thank you!

shipp02 5 months ago

Given the limited context length of most LLMs, is there value in turning in an entire codebase into a doc to feed it into an LLM?

I think cherry-picking relevant sections would be necessary to make it function effectively. Has anyone tried using tree-sitter to recursively feed it the source for functions used in the section we want to analyze to optimize for this?

__mharrison__ 5 months ago

Interesting. I've been converting Jupyter notebooks into markdown for the same purpose. Am considering making a custom tool.

tesserato 5 months ago

I also have this use case, and would be interested in such a tool. If you intend to write your tool in Golang, consider instead extending CodeWeaver.

eigenvalue 5 months ago

If you want to be able to select certain files quickly and visually, and work with private repos or just local files, try this open source tool I made:

https://github.com/Dicklesworthstone/your-source-to-prompt.h...

janikvonrotz 5 months ago

HNYSF: I made a better solution based on the llm cli by Simon Willison: https://janikvonrotz.ch/2025/01/27/work-with-llms-on-the-com...

narmiouh 5 months ago

If I'm reading this correctly, why include all code into the markdown? It's almost like the AI model that would use this is necessarily using all concatenated code plus explanation of the code, I'm not sure which is better because the LLM then already has access to the entire code as part of markdown?

cowdingx 5 months ago

I've been using these two scripts with a similar outcome: https://gist.github.com/blairjordan/018aacc60bcc4fe07234d908...

treedump is particularly helpful.

mkagenius 5 months ago

I have one for CVEs in case there are security folks here - recursively finds details like code commit diff which fixed the vulnerability in references links too to generate one single json.

1. https://github.com/BandarLabs/cveingest

bkyan 5 months ago

Oh, cool -- this is made with golang! I'll have see if I could wrap it in a desktop gui using wails.

import-base64 5 months ago

nice! i made something similar that converts codebases (local and github urls), as well as youtube videos (transcripts) and blog posts to markdown.

https://github.com/tanq16/ai-context

replwoacause 5 months ago

I see lots of folks here using LLMs in their codebases. Does that mean there isn’t much concern about sharing your app’s code with an LLM? Have people just gotten comfortable with this now? Or does it only matter for closed source or proprietary code bases ?

ako 5 months ago

You can run an llm on your local machine, and you can get llm sandboxes for your company.

andes314 5 months ago

A better alternative, which uses a .gitignore-like file to ignore specific files: https://github.com/rodlaf/describe

tribeca18 5 months ago

https://www.repoprompt.com is better. You need more granular control if you're planning to use this in real large codebases.

the_king 5 months ago

Files-to-prompt has been a surprisingly useful tool for this kind of workflow.

https://github.com/simonw/files-to-prompt

emmelaich 5 months ago

Is this related to https://gitingest.com/ at all? Which seems to be a service doing a similar thing.

BoorishBears 5 months ago

There are a ridiculous number of projects doing this.
I'm always baffled by the response they get since doing this is also the most impractical, poorly scaling, way to insert an LLM into your development process.
On one hand if you realize that, there may be times where you get lucky with the size of a codebase and the nature of your questions and it works acceptably.
But on the other, this feels like the kind of thing someone who's hearing others rave about the utility of AI will try with too large of a codebase, insert the result into ChatGPT, and then get an LLM underperforming because it's being flooded with irrelevant context for every basic operation it's being asked to do.
There are very few times when providing the entire codebase in the context window instead of the relevant code to a single operation makes sense.
tesserato 5 months ago

It is not. Others have commented pointing to services similar to this one, though.

Alifatisk 5 months ago

There is also repo2txt.simplebasedomain.com/local.html

wonderfuly 5 months ago

You can also use https://chathub.gg/repo2txt

squeegee_scream 5 months ago

This is great, but I’m pretty sure this is trivial using Emacs and org mode. You could then use pandoc to convert org to markdown

lgas 5 months ago

It's trivial using a number of approaches, eg. a simple bash or python script. But I think there's still a fair amount of value in building a common tool for these sorts of things. Everyone that builds their own one off solution will inevitably encounter more and more of the edge cases (oh I need to honor .gitignore... oh, I need to be able to override .gitignore and include some ignored things... oh I need to deal with huge files... etc) and with a common tool the tool can collect the ways of dealing with all of these edge cases.
Now no one will need something that can handle all of the edge cases, but whatever edge cases they need to be handled will already be handled. The overall time and frustration saved this way can be huge.

novemp 5 months ago

How do you do the opposite of this? Transform your markdown files into a codebase that AI can't leech off of?

atum47 5 months ago

Damn, I did that the other day but manually. I just cat everything from a folder in the order that I wanted and fed it to ChatGPT so it could write a README for tiny.js

strizzo 5 months ago

There’s ClipSource for VSCode that does this

tesserato 4 months ago

From the description, seems to only work with Python codebases.

mmanfrin 5 months ago

I built a simple tool to do something similar (it's meant for a monorepo and will build each subfolder in to a (subfolder-code.txt) text file that you can upload to AIs.

https://github.com/manfrin/bundle-codebases

I don't see much merit in things like markdown or syntax highlighting as that's just extra noise for the AI. My script tries to cut down on any extraneous data since the things I'm working on are near the context limit of consumer AIs.

My script also ignores anything in .gitignore and will take a .codebundlerwhitelist (i hate this name and have meant to change it) to only bundle files matching patterns you specify.

antirez 5 months ago

Not just extra noise, but also extra tokens.
- mmanfrin 5 months ago
  
  Exactly.

sandGorgon 5 months ago

how does this compare to code2prompt or files2prompt ? any benchmarks on which one works better for LLMs ?

forrest321 5 months ago

I created something similar. https://github.com/forrest321/code2text

croes 5 months ago

So only the US is allowed to get data directly from the companies.

Got it.

adityamwagh 5 months ago

Another alternative is Gitingest [0]. What are the differences?

[0] https://gitingest.com/

bingzhuwuhen 5 months ago

[dead]

samueljames324 5 months ago

[dead]

schaefer 5 months ago

Wait, just one question…

Can I call this c++ code “machine code” now?