Initial Study: AI-Powered vs Traditional Security Tools

The Challenge: Testing AI Against Real-World Vulnerabilities

For our initial study, we developed a custom Flask application in Python to rigorously test our AI-native security tool. The application contained 14 carefully crafted vulnerabilities, including several subtle logical issues that conventional static and dynamic analysis tools typically miss. This provided a challenging and realistic environment to evaluate the true capabilities of our contextual analysis approach.

What makes logical vulnerabilities particularly insidious is that they often arise from business logic flaws rather than coding mistakes. Traditional tools struggle with these because they lack the contextual understanding required to identify when application workflows violate security principles.

Our End-to-End Scanning Pipeline

Our framework enables systematic, repeatable scanning of software repositories—whether applied to entire codebases, scoped to specific pull requests, or extended to legacy systems that often lack complete documentation. This flexible design ensures scalability from large multi-service systems to targeted incremental changes.

Step 1: Repository Ingestion and Context Summarization

The scanning process begins with comprehensive ingestion of the repository, including source code, configuration files, documentation, and dependency manifests. Our Program Analysis Engine parses these artifacts to construct structural representations including:

Abstract Syntax Trees (ASTs): Capturing the grammatical structure of code
Control-Flow Graphs (CFGs): Mapping execution paths through the program
Data-Flow Graphs (DFGs): Tracking how data moves and transforms
Call Graphs: Understanding function relationships and dependencies

These graph structures are stored and maintained within the engine, allowing efficient reuse for future scans by focusing only on newly added or modified components. Our fine-tuned LLM then generates hierarchical summaries at multiple levels (file, module, and system) that describe functionality, interactions, and security-relevant mechanisms.

Step 2: Vulnerability Detection and Policy Enforcement

The vulnerability-detection agent dynamically queries relevant context before initiating its reasoning. It formulates precise code-level queries, executes them against the program analysis engine, and retrieves targeted structural slices of the code under review.

In parallel, it issues natural-language queries to our knowledge base to obtain supplementary security intelligence such as CVEs, CWEs, or organization-specific policies. By combining insights from both the program analysis artifacts and the evolving knowledge base, the agent identifies vulnerabilities, evaluates them against project-specific security policies, and ensures potential violations are surfaced with actionable detail.

Step 3: Automated Remediation Suggestions

The remediation agent generates secure code patches, configuration adjustments, or design recommendations to address identified vulnerabilities. Fixes are aligned with organizational policies and verified against functional requirements through automated test integration. Depending on organizational preference, patches can either be auto-applied or flagged for developer review and approval.

The Test Environment

In this first version, we built the pipeline using Qwen3 32B as the base model, a prototype of our program analysis engine providing code context, and a knowledge base combining general security intelligence with system-specific information. For this study, our evaluation focused specifically on vulnerability detection.

To establish a robust baseline for comparison, we tested the same Flask application against leading security tools including:

CodeQL: GitHub's semantic code analysis engine
SonarQube Cloud: Popular continuous code quality and security platform
Semgrep Community: Lightweight static analysis tool

All tools were run with their recommended or default settings. For CodeQL, we additionally enabled the extended set of queries to maximize coverage and ensure a fair comparison.

The Vulnerability Test Set

The test set was curated to evaluate performance across a comprehensive spectrum of vulnerability types:

Common Web Flaws: Stored Cross-Site Scripting (XSS) and Log Injection
File Upload Issues: Unchecked file size, dangerous file types, and file name overwrites
Cryptographic Weaknesses: Use of broken algorithms (MD5) for password hashing
Credential Exposure: Clear-text credentials stored in cookies
Logical Flaws: Insecure Direct Object Reference (IDOR) requiring workflow understanding

The inclusion of logical flaws like IDOR was specifically designed to challenge the tools' contextual analysis capabilities, as these vulnerabilities require understanding the application's intended behavior and access control model.

Initial Findings: A Promising Start

Our initial findings were quite promising. In the first phase, we reviewed the qualitative output from our tool, including the generated graphs and the file, folder, and repository-level summaries. The descriptive output and contextual analysis provided were accurate and met our expectations, demonstrating the AI's ability to grasp the codebase's architecture and purpose from a security-centric viewpoint.

Quantitative Results: The Numbers Speak

The second phase involved quantitative analysis of our tool's performance against the 14 known vulnerabilities, with a comparative baseline provided by other tools. Our tool successfully identified 13 of the 14 vulnerabilities, achieving:

Recall: 0.92 (92% of vulnerabilities detected)
Precision: 1.0 (zero false positives)
F1 Score: 0.96 (excellent balance of precision and recall)

This significantly outperformed the conventional security tools in our test environment.

Comparative Analysis: Head-to-Head Results

The table below presents the performance metrics across all tested tools. Our approach (Reware) demonstrated the highest F1 score, with nearly double the detection rate of the next-best tool:

┌────────────────────┬─────────────┬───────────┬────────────┬──────────────┐
│ Metric             │ Reware      │ CodeQL    │ Semgrep    │ SonarQube    │
├────────────────────┼─────────────┼───────────┼────────────┼──────────────┤
│ True Positives ↑   │ 13          │ 7         │ 5          │ 3            │
│ False Negatives ↓  │ 1           │ 7         │ 9          │ 11           │
│ False Positives ↓  │ 0           │ 1         │ 3          │ 2            │
│ Precision ↑        │ 1.00        │ 0.87      │ 0.62       │ 0.60         │
│ Recall ↑           │ 0.92        │ 0.50      │ 0.35       │ 0.21         │
│ F1 Score ↑         │ 0.96        │ 0.63      │ 0.45       │ 0.31         │
└────────────────────┴─────────────┴───────────┴────────────┴──────────────┘

↑ = higher is better  |  ↓ = lower is better

Key Insights from the Study

Several important observations emerged from our initial study:

Zero False Positives: Our tool achieved perfect precision, meaning developers can trust every alert without wasting time on false alarms
Contextual Understanding: The tool successfully identified logical flaws like IDOR that require understanding application workflows—something traditional tools consistently missed
Significant Performance Gap: We detected nearly twice as many vulnerabilities as CodeQL, and almost four times as many as SonarQube
Scalable Architecture: The graph-based analysis and incremental scanning approach demonstrated the potential for efficient large-scale deployments

What We Missed: The One That Got Away

Transparency is crucial in research. Our tool missed one vulnerability in the test set. This miss provides valuable insight into areas for improvement and highlights that even advanced AI systems require continuous refinement. We're using this finding to enhance our detection algorithms and expand our training data.

Looking Forward: Next Steps

While these results are encouraging, we acknowledge that this was an initial evaluation on a limited set of vulnerabilities in a single application. To fully assess the robustness and generalizability of our approach, further studies on a broader range of real-world repositories are necessary.

Such evaluations will help uncover potential limitations and guide systematic improvements to each component of our system, including:

The underlying language model and its fine-tuning
The program analysis engine's graph construction algorithms
The evolving knowledge base and retrieval mechanisms
Integration with existing development workflows and CI/CD pipelines

Our framework still requires scaling, hardening, and comprehensive validation before it can be reliably adopted in real-world software development environments. However, these initial results demonstrate the significant potential of AI-powered contextual analysis to transform application security.

The Path to Production

We're committed to rigorous testing and transparent reporting of our findings. Our roadmap includes:

Expanding testing to diverse open-source projects and languages
Conducting blind comparative studies with security researchers
Building integration points for popular development platforms
Developing comprehensive remediation and automated fixing capabilities
Creating detailed metrics and reporting dashboards for security teams

Join Our Research Program

We're looking for early adopters to help us test and refine our AI-powered security analysis. Get exclusive early access and help shape the future of application security.

Request Early Access

Initial Study: Outperforming Traditional Security Tools with AI-Powered Analysis