My AI Use Cases | Retrospective
From blog re-platforming, static/taint analysis across multiple languages, SAST tooling + AI Assisted code review. I discuss my use cases and experiences with AI.
- Overview
- Projects
- Project 1: Blog
- Speed
- Maintainability
- Cost
- Project 2: Taint Tracking (20+ languages)
- Methodology
- Problems During Development
- Cost
- Project 3: AI Audit Loop + MCP Tooling
- Results
- Methodology & Iteration
- Issues
- Cost
- Project 4: Bug Bounty Targets
- Cost & Output Quality
- Other Thoughts on AI
- Security
- Claude Code’s Quota Model
- Conclusion
TL;DR: If you haven’t started using AI tooling, now is a great time to start. Tooling and models are to a point where output is at least usable and interesting, plans like Claude Code are highly subsidized and offer a large amount of tokens to play with as opposed to api pricing. I would say it makes sense to collect a backlog of projects you’ve wanted to create, get a 200/mo subscription for 1 month if you can afford it, and just go wild.
Overview
Since GPT 3.5 I have been using ChatGPT on the web interface to copy/paste code snippets. It has helped me quickly build out components that involved somewhat complicated setups like Cytoscape.js and greatly sped up ramp time to get side projects up and running. Instead of reading docs it was nice to prompt for examples and see how to manipulate the API within the context of my problem. Since then, I have been using GitHub Copilot for IDE integrated auto-complete assistance. While these have been nice to use (with all their shortcomings), this blog mainly focuses on my recent usage with Claude Code - a product I took a more serious look at now that Opus 4.5, MCP, Skills, and other standards/tooling have had time to mature. At this point I have spent almost 2 months of time with the Claude 200/mo plan.
Projects
Project 1: Blog
I have used Claude Code to rewrite this site from a Jekyll build engine to Astro. I always avoided writing blog posts because my Jekyll process was so customized that even with the few pages on my site it would take minutes to build and hot reloading was not working properly. Now that I am no longer hosting on Github Pages, Jekyll is no longer needed. Astro provides a nice base to get the faster builds, hot reloading, and the templates are more readable than other established products like Hugo. I am happy with this decision so far.
Speed
I mean what can I say, I did all this re-platforming work on my blog in 2 days of part time work. That’s crazy.
Maintainability
I can see how this is a problem. Claude Code has done a wonderful job of implementing a pretty cool glitch system where the screen glitches out every once in a while and gives the site character. I did have to ask for a refactor and split logic into files, but otherwise I notice there are a lot of what I would consider “magic” values and CSS all over the place. I have a feeling if I want to edit this in the future it is not only a problem that I do not know the codebase, but the structure makes me think I’m effectively vendor locked into AI products for making future updates to the system without significant investment of time and effort. While this glitch system is just for fun, I can imagine production software that gets equally complex may run into similar issues.
Cost
From a cost perspective, the Claude Code 200/mo plan seems to be heavily subsidized. I’ve used maybe 3% of a weekly allowance to completely re-platform, migrate, fix a few bugs, and iterate on the glitch engine. Honestly pretty cheap for the amount of time it saved.
Project 2: Taint Tracking (20+ languages)
I have a few other tools to assist with static analysis of source code. I thought as a good test case of Claude Code I could try to create a tool that uses Tree Sitter to look for taint tracking across function calls for multiple languages and also try to detect state variable writes from tainted input. If we step back, there are tools like CodeQL, Semgrep, etc. that will try to map your codebase. These tools work across multiple languages but ultimately every languages is different. Tools like these as well as tree-sitter (usually used as the parsing engine for the aforementioned tooling, language servers, etc.) that work across multiple languages have always had the problem of maintaining support across several languages. Often times this requires community involvement or relatively large teams of domain specific developers that can write these tools, and as such the APIs within each tool deviate per language, even for languages that are semantically similar.
Due to this variation, a side project with so many languages to support would be impossible for me to complete alone, even when relying on a single parsing engine like tree-sitter.
What a cool use case for AI!
I thought this would be a nice to test Claude Code because even though the language support is extensive, the task the program needs to do is quite constrained. No complex GUI or overlapping features/actions.
And it did it! I’m not saying what I have built is perfect, it has bugs, a lot of them. I’m sure it has coverage issues. That said, the fact I have a working tool of any sort is pretty nifty. I pipe the output into my AI Assisted Code Review tooling and hope that it helps guide to a few hot spots.
Methodology
For this particular project I was largely using the Anthropic Web Portal where you can spin up instances of Claude Code on your GitHub repos and commit & push back. It was actually quite nice to code from my phone while out and about or lying in bed.
Problems During Development
- Claude Web Portal doesn’t seem to support long running tasks
- Skill & Spec files (e.g.: CLAUDE.md) not being respected.
- Tests would constantly be implemented to validate current state of the tool, not validate proper output. I had to constantly tell it to make sure the LLM evaluated what test output should be based on the source code input, otherwise it would generate tests that were very loose and not actually validate that we implemented the functionality properly.
- Opus would not thoroughly think through all edge cases, I would constantly have to offer “what about this” type prompts, like when dealing with decorators or other language specific features, frameworks or coding practices like dynamic callstacks that can be registered.
- Newly released Claude remote control feature doesn’t respect YOLO mode settings so babysitting consent prompts causes friction.
Cost
This project ended up creating a massive amount of tests, like 5000+ that would have to be refactored due to the problems I listed above. I think it took the equivalent of 1.5 weeks worth of 200/mo Claude Code tokens. Was it worth it? Maybe? But overall when I consider this built something I wouldn’t have time to put together myself at all, this isn’t too expensive - and hiring a separate developer to do this work would have been significantly more expensive.
Project 3: AI Audit Loop + MCP Tooling
I have a previous project where I use LSP servers to parse a codebase and pull out callstacks, state variable usage (reads/writes), etc. My goal was to have a Ralph like audit loop to leverage this data through an MCP server for use during an AI audit loop.
For this project I created an audit loop of tasks that can dynamically generate more tasks as it explores the codebase, the MCP server was also vibe coded, as well as a pre-analysis script that just looks for common sources/sinks, and the taint tracking tool mentioned above was also integrated. With all this data, the goal is for the AI Assisted SAST tool to find some issues in source code that has been pre-indexed with the other tooling.
Results
I have not used the tool too much but it seems pretty nice. It finds things the new claude-code-security-review does not find, although at a much higher token cost. I have found RCE, SSRF, Authorization issues, parallel data structure desyncs, race conditions, etc. Some pretty interesting bugs, actually.
I notice it struggles with IDORs, I’m not sure why here. It also struggles with some state/logic based issues in web3 projects, maybe because some of those complex, math heavy systems usually have fuzzing or other methods to find crazy states. It may be a good idea to try and use the MCP data to facilitate fuzzing ideas for further testing.
Methodology & Iteration
For building the tool, I have mentioned at a high level the tooling that I have that plugs into the audit loop.
To iterate on a better audit loop:
- I added a —debug parameter that instructs the LLM to output issues it sees during the audit, how it thinks we should improve our MCP server and tooling, etc.
- I have another project to track open source vulnerabilities and their code patches based on osv.dev data. Additionally, web3 companies have open source reports like code4rena reports that can be used to pull the source code, scan it, and see what the tool missed. This has been a good way to iterate as well.
- After an audit I will ask Claude Code to use the report to try and create dynamic POCs against locally hosted versions of the apps for dynamic testing, and sometimes it finds things we missed. So then I try to feed that learning back into the audit process as well.
Issues
I’m just going to say it: spec driven development sucks!!
We all know LLMs have issues, and many tout using Test Driven Development (TTD) to get around some of the rough edges. As mentioned before, LLMs like to make flaky tests, but what happens when your codebase is skill .md files? Well then you end up with spec.md files to make sure anything that is added to the skill doesn’t go against the spec. Except now you have a non-deterministic next token generator making sure you didn’t mess your project up with an update which is functionally untestable in practice. Context bloat is a real thing, deleting things it shouldn’t have is a problem, duplication which is somewhat fine (sometimes) in a normal codebase is a larger issue here due to context constraints, the AI doesn’t always think through changes in totality, how everything integrates together, etc. I don’t want to say it’s been a large point of frustration because the fact is I DON’T KNOW WHAT I’VE BROKEN. I’ve caught a few things myself, but otherwise, who knows if the audit loop is in it’s best state or not. Is the MCP server hosting too many tools or are their APIs clear and optimized for our purposes? Probably not.
Again, it’s cool I have something that works, it finds real bugs, it’s a great first pass at code review. Is it perfect? No, and that’s ok. For all my gripes about the process, I think it’s pretty nifty.
Cost
From a cost perspective, there is quite a lot of iteration on building the tool, but I’m seeing about 8-20% of weekly usage per audit for the Claude Code 200/mo plan (the higher end has to do with decompiled apps with a lot of libraries and source code after trying to strip out large minified files, unrelated files, etc.). If we assume 4 weeks of usage, that is 100% quota per week, so 400% of usage total/mo for $200. So 8-20% of usage is anywhere from $4-10. For a first pass at a codebase that you maybe scan once or twice before deployment, this is super worth it from the output I’m seeing. That said, this is only on the highly subsidized Anthropic plan, using the API this number would balloon so I am hopeful ad-hoc pricing can drop to this level without having to commit to paying monthly fees.
Project 4: Bug Bounty Targets
The idea was to take something like this bounty-targets-data repo and have the LLM navigate to each program page to evaluate which targets had good open source targets or had things like Dockerhub or AWS Marketplace containers where source was available through other means.
For this I leveraged a Ralph loop and vibe coded a small script to generate the PRD file and go do the data scraping and analysis, then consolidate and output an HTML report. A ralph loop was necessary as the web portal for Claude Code did not offer long enough running instances for agents and I did not want to give input or have checkpoints mid way through the analysis.
Cost & Output Quality
The cost was quite steep here, with all the analysis it had to do it has probably used the equivalent of a week’s worth of tokens for the 200/mo Claude Code plan - I ran it a couple times and had a couple tweaks for re-runs. Output was ok but not ideal, it would get confused on what github repos and docker containers were associated with each scope and project, some output may be usable but I’m still investigating. It somewhat feels like it took a lot of tokens and I’m not super happy with the output quality but again it’s kind of cool and better than nothing, if I can find one decent target from it maybe it will be worth it.
Other Thoughts on AI
Security
Obviously there are multiple issues with the tech: granting too many permissions to these models, putting the runtimes in environments that have access to secrets or other services, malicious MCP servers, malicious skills, the skill induced typo squatting situations, the fact that prompt injection and non-determinism are still largely unsolved as safeguards are more band-aids than true fixes, etc. etc. etc. I’m not going to list all the risks here as it’s a bit of a moving target but just know that you should be aware of the risks before using AI and be careful of what attack surface you open yourself up to. Companies like Trail of Bits are trying to put together frameworks to make more usage more secure like their claude-code-devcontainer project.
Claude Code’s Quota Model
Usage starts at first message so you need to be on top of sending a message to reset your weekly quota. I just wanted to mention it here because it’s unlike any product I’ve seen before. Very strange. I don’t love it. During my second month of $200/mo usage I had to wait a good 4-5 days before I could start using the quota I just purchased again.. it kind of stressed me out, really. But for the pricing of tokens you get with the plan it’s just something to live with.
Conclusion
AI is awesome and has its place. It’s great until it’s not. Use with caution and be aware of its edges. I would say at this point if you are totally against AI, you should open up a bit more, but don’t listen to those that say it’s a panacea, it’s not.
AI is a force multiplier for side projects and POCs, probably production work too although use with caution. I’ve seen people gripe about AI capabilities in real codebases but the reality is companies are outsourcing development work to cheaper countries anyway where code quality and attack surface suffer similarly, although in different ways and at different magnitudes.
As someone with a full time job and family with small kids at home, I have very limited time to work on these project ideas. AI feels like a bit of an equalizer for investigating ideas where, even if I feel like my learning has not been as deep as if I had hand coded all this, it has been a great experience to flow through so many projects and see a few nice results.