Archive

BRIEFING Nº 019

Anthropic's Mythos Model Too Dangerous to Release? Plus Ramp's Background Coding Agent Build

Plus how Ramp built a full-context coding agent on Modal and why this changes AI development workflows

Sunday, April 12, 2026businesstoolstutorialsresearchmodelsregulationgeneral
Social card for Anthropic's Mythos Model Too Dangerous to Release? Plus Ramp's Background Coding Agent Build

Quick Snapshot: Plus how Ramp built a full-context coding agent on Modal and why this changes AI development workflows

Anthropic made an unusual move this week: they built something so capable they won't let you use it. Their Mythos model reportedly finds security exploits faster than any previous AI system, but instead of celebrating, they're keeping it locked away. Meanwhile, Ramp showed exactly how to build practical AI agents that actually work in production.

This disconnect tells the real story. While frontier labs debate theoretical risks, builders are shipping real solutions. Let's break down what actually matters for your next project.

What Changed

Anthropic announced they've limited the release of Mythos, their newest model, because it's "too capable of finding security exploits in software relied upon by users around the world." The company claims Mythos discovered thousands of zero-day vulnerabilities during testing, creating what they call an "unacceptable risk" if widely deployed.

But here's the part that matters for builders: this isn't just about cybersecurity theater. According to sources familiar with the testing, Mythos demonstrated a fundamentally different approach to code analysis. Instead of pattern matching against known vulnerability signatures, it appears to reason about code logic in ways that mirror how experienced security researchers think.

The model reportedly scored 89% on the CyberBench evaluation suite, compared to GPT-4's 67% and Claude 3.5 Sonnet's 71%. More importantly, in blind testing against real-world codebases, Mythos found 3.2x more exploitable vulnerabilities than the previous best AI system.

Simultaneously, Ramp published detailed documentation on how they built what they call a "full context background coding agent" using . Their system runs continuously, analyzing codebases, suggesting refactors, and even implementing small fixes without developer intervention.

The timing isn't coincidental. As frontier models become capable enough to find serious security flaws, the same reasoning capabilities make them powerful coding assistants. The question becomes: do you build with these capabilities, or do you wait for permission?

Why It Matters

For beginner and intermediate developers, this week reveals three practical shifts worth understanding:

First, the capability plateau is breaking. Mythos suggests we're moving past the "better autocomplete" phase of AI coding tools. When a model can reason about exploit chains and code logic patterns, it can also reason about architecture decisions, performance bottlenecks, and refactoring opportunities in ways that previous systems couldn't.

Second, background agents are becoming viable. Ramp's implementation proves you don't need a massive team to build AI systems that work continuously on your codebase. Their agent runs on Modal's serverless infrastructure, scales automatically, and costs roughly $200 per month for a mid-sized engineering team. The key insight: instead of building one general-purpose agent, they built specialized agents for specific tasks like dependency updates, test generation, and code quality analysis.

Third, the risk calculation is changing. Anthropic's decision to limit Mythos reflects a broader industry tension. As AI systems become more capable, the delta between what's available to individual developers and what's available to well-funded teams or bad actors grows. This creates pressure to either democratize access quickly or restrict it entirely.

For small teams and individual builders, this means two things: get comfortable with the current generation of tools before they potentially become restricted, and learn to build systems that can adapt when better models become available.

The practical implication: if you're building AI-assisted workflows now, design them to be model-agnostic. Ramp's architecture works with GPT-4, Claude, or any reasonably capable language model because they focused on the orchestration layer, not model-specific features.

Tool Radar

Modal (Production Ready)

Ramp's case study makes Modal impossible to ignore. It's a serverless compute platform specifically designed for AI workloads, and their background agent implementation shows why traditional cloud platforms fall short for this use case. Modal handles GPU scaling, container management, and async job orchestration without the typical DevOps overhead.

The killer feature: you can deploy Python functions that scale from zero to hundreds of instances automatically. Ramp's agents run as Modal functions that trigger on code commits, scheduled intervals, or external webhooks. offers $30 in free credits, enough to replicate Ramp's basic setup.

GLM-5.1 (Worth Testing)

Zhipu AI released GLM-5.1 this week, and the benchmarks suggest it's competitive with GPT-4 on reasoning tasks while running inference 40% faster. More importantly for builders, it has a 128K context window and costs roughly half what you'd pay for equivalent GPT-4 usage.

The catch: documentation is sparse, and the API client libraries are still rough around the edges. But if you're building applications where cost matters more than polish, it's worth a weekend experiment.

Google's Gemini API Dials (Underrated)

Google quietly launched what they call "Flex and Priority Inference" for the Gemini API. Translation: you can now trade response speed for cost, similar to Anthropic's prompt caching but more granular.

The practical win: for batch processing tasks like code analysis or documentation generation, you can cut API costs by 60% by using the "flex" tier, which delivers results in 30-90 seconds instead of 3-5 seconds. Perfect for background agents where latency doesn't matter.

Build With It

Let's build a simplified version of Ramp's background coding agent. This implementation focuses on automated code review suggestions, something you can deploy and test this week.

Step 1: Set up the Modal environment

Create a Modal app that monitors your Git repository. The key insight from Ramp's implementation: don't try to analyze entire codebases at once. Instead, focus on diffs and recently changed files.

import modal
from modal import Image, App, webhook
import requests

# Define the Modal environment
image = Image.debian_slim().pip_install("openai", "gitpython", "requests")
app = App("code-review-agent", image=image)

@app.function()
def analyze_diff(repo_url: str, commit_hash: str, diff_content: str):
 # Your analysis logic here
 return suggestions

Step 2: Create the analysis pipeline

The core logic analyzes code diffs for common issues: security vulnerabilities, performance problems, and maintainability concerns. Ramp's approach uses multiple specialized prompts rather than one general-purpose review prompt.

Step 3: Set up GitHub webhook integration

Use Modal's webhook decorator to trigger analysis on every push. The webhook receives the payload, extracts the diff, and queues the analysis job.

Step 4: Output actionable suggestions

Instead of generic feedback, format suggestions as concrete pull request comments with specific line numbers and proposed fixes. Ramp's system includes confidence scores for each suggestion, allowing developers to filter by reliability.

The full implementation requires about 200 lines of Python and handles repository authentication, diff parsing, and result formatting. Deploy it once, and it runs continuously without server management.

Cost breakdown: For a team pushing 50 commits per week, expect roughly $15-25 monthly in Modal compute costs plus API usage for the language model. The time savings on code review typically pays for itself within the first week.

Prompt to Steal

Multi-Stage Code Analysis Prompt

Based on Ramp's approach, this prompt breaks code review into specific, actionable stages:

You are a senior software engineer reviewing a code diff. Analyze this change in three passes: PASS 1 - SECURITY SCAN: Scan for potential security vulnerabilities. Focus on: - SQL injection risks in database queries - Authentication/authorization bypasses - Input validation gaps - Sensitive data exposure PASS 2 - PERFORMANCE REVIEW: Identify performance concerns: - Inefficient database queries or N+1 problems - Memory leaks or resource management issues - Algorithmic complexity problems - Caching opportunities PASS 3 - MAINTAINABILITY CHECK: Evaluate code quality: - Code duplication or DRY violations - Unclear variable/function names - Missing error handling - Test coverage gaps For each issue found, provide: 1. Specific line number(s) 2. Severity level (Critical/High/Medium/Low) 3. Concrete fix suggestion 4. Confidence score (1-10) Code diff to analyze: [PASTE DIFF HERE]

Why this works: The three-pass structure mirrors how experienced developers actually review code, and the specific output format makes it easy to convert suggestions into actionable pull request comments. Ramp found that structured prompts like this reduce false positives by roughly 40% compared to general review prompts.

Worth Watching

The Mythos situation creates an interesting precedent. If Anthropic can justify withholding capabilities for security reasons, expect other labs to follow suit selectively. The question becomes: which capabilities get restricted, and who decides?

More immediately, watch for Google's response to Ramp's Modal implementation. Google Cloud has been pushing hard into the AI development platform space, and Ramp's case study shows exactly what developers want: simple deployment, automatic scaling, and cost transparency.

Also monitor the GLM-5.1 ecosystem development. If the Chinese model proves reliable at half the cost of GPT-4, it could force OpenAI and Anthropic to adjust their pricing strategies faster than expected.

The broader trend: AI development is moving from "can we build it" to "should we build it" to "how do we deploy it safely at scale." The teams that figure out the deployment and safety pieces will have lasting advantages over those focused purely on capability development.

Until next week,

Edward Yi

Get the briefing every week.

Concise, practical, no fluff. What happened in AI, and what you should do about it.

Signal
Tools that matter
Format
One issue weekly
Style
No sludge