Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality
Abstract
Analysis of real-world developer-LLM conversations reveals patterns in task outcomes, code quality, and common issues across multiple programming languages.
Large Language Models (LLMs) are becoming integral to modern software development workflows, assisting developers with code generation, API explanation, and iterative problem-solving through natural language conversations. Despite widespread adoption, there is limited understanding of how developers interact with LLMs in practice and how these conversational dynamics influence task outcomes, code quality, and software engineering workflows. To address this, we leverage CodeChat, a large dataset comprising 82,845 real-world developer-LLM conversations, containing 368,506 code snippets generated across over 20 programming languages, derived from the WildChat dataset. We find that LLM responses are substantially longer than developer prompts, with a median token-length ratio of 14:1. Multi-turn conversations account for 68% of the dataset and often evolve due to shifting requirements, incomplete prompts, or clarification requests. Topic analysis identifies web design (9.6% of conversations) and neural network training (8.7% of conversations) as the most frequent LLM-assisted tasks. Evaluation across five languages (i.e., Python, JavaScript, C++, Java, and C#) reveals prevalent and language-specific issues in LLM-generated code: generated Python and JavaScript code often include undefined variables (83.4% and 75.3% of code snippets, respectively); Java code lacks required comments (75.9%); C++ code frequently omits headers (41.1%) and C# code shows unresolved namespaces (49.2%). During a conversation, syntax and import errors persist across turns; however, documentation quality in Java improves by up to 14.7%, and import handling in Python improves by 3.7% over 5 turns. Prompts that point out mistakes in code generated in prior turns and explicitly request a fix are most effective for resolving errors.
Community
We introduce CodeChat, a large-scale dataset of 82,845 real-world developer and LLM conversations with 368,506 generated code snippets in over 20 programming languages, derived from the WildChat dataset. Our analysis shows that developers use LLMs for tasks like web design and machine learning training, but LLM-generated code frequently contains defects such as syntax errors and undefined variables. Dataset: https://huggingface.co/datasets/Suzhen/CodeChat
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity (2025)
- ChatGPT for Code Refactoring: Analyzing Topics, Interaction, and Effective Prompts (2025)
- Is LLM-Generated Code More Maintainable & Reliable than Human-Written Code? (2025)
- The Impact of Large Language Models (LLMs) on Code Review Process (2025)
- Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects (2025)
- Exploring Direct Instruction and Summary-Mediated Prompting in LLM-Assisted Code Modification (2025)
- Exploring the Challenges and Opportunities of AI-assisted Codebase Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper