LLMs In Software Engineering: 2025's Top ArXiv Papers

Dec 5, 2025 by Alex Johnson 54 views

arXiv NEWS (2025-12-05): Software Engineering & LLMs

Stay up-to-date with the latest advancements in software engineering through this curated collection of arXiv papers from December 5th, 2025. This compilation focuses on the intersection of Large Language Models (LLMs) and various software engineering tasks, providing insights into cutting-edge research and practical applications.

PBFuzz: Agentic Directed Fuzzing for PoV Generation

LLM-based fuzzing takes center stage with PBFuzz, an agentic directed fuzzing framework designed for Proof-of-Vulnerability (PoV) generation. The authors, Haochen Zeng, Andrew Bao, Jiajun Cheng, and Chengyu Song, introduce a novel approach that mimics human experts in identifying and exploiting vulnerabilities. PBFuzz addresses the critical task of PoV input generation by tackling two sets of constraints: reachability and triggering. Existing methods often struggle to efficiently satisfy these constraints. PBFuzz automates the human expert process of code analysis, semantic constraint extraction, hypothesis formation, test input encoding, and refinement through debugging feedback. The framework tackles four key challenges: autonomous code reasoning, custom program-analysis tools, persistent memory to avoid hypothesis drift, and property-based testing for efficient constraint solving. The results on the Magma benchmark are compelling, with PBFuzz triggering 57 vulnerabilities, surpassing all baselines and uniquely exposing 17 vulnerabilities. It achieves this within a 30-minute budget per target, significantly faster than conventional approaches. This research highlights the potential of LLMs to revolutionize software security by automating vulnerability discovery and PoV generation. The efficiency gains demonstrated by PBFuzz showcase the power of agentic approaches in tackling complex security challenges, making it a valuable tool for software developers and security professionals. This agentic approach represents a significant step forward in automating the traditionally manual and time-consuming process of vulnerability discovery, offering a more efficient and effective means of securing software systems. The framework's ability to reason about code, extract semantic constraints, and adapt its strategies based on feedback demonstrates the potential of LLMs to enhance software security practices.

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

Code vulnerability detection receives a boost from retrieval-augmented few-shot prompting, as explored by Fouad Trad and Ali Chehab. Their work compares this technique against fine-tuning, a common method for adapting LLMs to specific tasks. LLMs and their effectiveness depend heavily on the selection and quality of in-context examples, particularly in complex domains. The study examines retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection. The goal is to identify security-relevant weaknesses in code snippets from a predefined set of vulnerability categories. The evaluation uses the Gemini-1.5-Flash model across three approaches: standard few-shot prompting with random examples, retrieval-augmented prompting with semantically similar examples, and retrieval-based labeling. The results demonstrate that retrieval-augmented prompting consistently outperforms the other prompting strategies, achieving an F1 score of 74.05% and a partial match accuracy of 83.90% at 20 shots. This approach also outperforms zero-shot prompting and fine-tuned Gemini, avoiding the training time and cost associated with model fine-tuning. However, fine-tuning CodeBERT yields higher performance but requires additional resources. This research offers valuable insights into the trade-offs between different approaches for code vulnerability detection, highlighting the potential of retrieval-augmented prompting as a cost-effective and efficient alternative to fine-tuning. The ability to leverage LLMs for vulnerability detection without extensive training opens up new possibilities for improving software security in resource-constrained environments. Retrieval-augmented prompting offers a compelling approach for leveraging LLMs in code vulnerability detection, balancing performance with efficiency and cost-effectiveness.

HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding

Hanjun Luo, Chiming Ni, Jiaheng Wen, and their colleagues introduce HAI-Eval, a benchmark designed to measure human-AI synergy in collaborative coding. This unified benchmark evaluates human-AI partnership in coding. LLM-powered coding agents are reshaping the development paradigm. Existing evaluation systems fail to capture this shift, remaining focused on well-defined algorithmic problems. HAI-Eval's core innovation lies in its