Dimitris Kyrkos

Posted on Apr 24

"Beyond Linting: A Data-Driven Approach to Suggesting Better Code, Not Just Flagging Bad Code"

#codequality #machinelearning #softwareengineering #tooling

Intro:

Every developer has experienced this loop: you run your linter or static analysis tool, it highlights a dozen issues – long methods, high cyclomatic complexity, tight coupling – and then… you're on your own. You know what's wrong. You just don't know what better looks like in your specific context.

A recently published paper in IET Software tackles this gap head-on. Titled "A Data-Driven Methodology for Quality Aware Code Fixing" by Thomas Karanikiotis and Andreas Symeonidis (Aristotle University of Thessaloniki), it presents a system that doesn't just detect code quality problems – it recommends concrete, higher-quality alternatives drawn from real-world code.

Here's how it works, and why it matters for developer tooling.

The Problem: Detection Without Direction

Static analysis has matured significantly. Tools like SonarQube, ESLint, Pylint, and platforms like Cyclopt can evaluate code across dimensions such as maintainability, security, readability, and reusability. They grade your codebase, flag violations, and prioritize technical debt.

But there's a disconnect. Once you know that a function has excessive complexity or poor cohesion, refactoring it still requires judgment, effort, and domain knowledge. For junior developers especially, the distance between "this method is too complex" and "here's how to decompose it properly" can be enormous.

The paper proposes bridging that gap with a recommendation engine built on top of quality-annotated code snippets.

The Approach: Functional Match + Quality Upgrade

The methodology works in three core stages:

Dataset Construction
The researchers built a rich dataset on top of the CodeSearchNet corpus, enriching each code snippet with static analysis metrics: complexity, coupling, cohesion, documentation quality, coding violations, readability scores, and source code similarity metrics.
Functional Similarity Assessment
When a developer submits a code snippet, the system identifies functionally equivalent alternatives – code that does the same thing, verified through advanced similarity techniques. This is the crucial step: the replacement must actually work for the same purpose.
Quality-Aware Ranking
Among the functionally equivalent candidates, the system ranks them by quality metrics. The top suggestions are snippets that not only match what your code does but score measurably better on maintainability, readability, and structural quality.

A key design decision: the system also evaluates syntactic similarity, prioritizing alternatives that look similar to the original. This minimizes the cognitive overhead of adopting a suggestion – you're not replacing your entire approach, just getting a cleaner version of it.

What Makes It Interesting for Practitioners

Language-agnostic architecture. The methodology isn't tied to a single language. The quality metrics and similarity assessments are designed to work across different programming languages, which matters in polyglot codebases.

Practical over theoretical. The evaluation shows the system produces alternatives that are both functionally equivalent and syntactically close to the originals – meaning they're actually usable, not academic curiosities that happen to score well on metrics.

Closes the feedback loop. If you're already using quality dashboards (Cyclopt's quality scoring, for instance, evaluates maintainability, security, readability, and reusability on every commit), this kind of recommendation system turns passive monitoring into active guidance. Instead of a grade, you get a path to a better grade.

The Bigger Picture

This research sits at the intersection of several trends in developer tooling:

AI-assisted coding is everywhere, but most tools focus on generation, not the improvement of existing code
Technical debt management is increasingly data-driven, yet remediation is still manual
Code reuse from open source is standard practice, but quality filtering is rarely systematic

The paper argues – and I think convincingly – that we have enough data in open-source repositories to build quality-aware recommendation systems that work. The CodeSearchNet corpus alone contains millions of functions across six languages. Enriching that data with quality metrics transforms it from a search index into a quality improvement engine.

Try the Research Yourself

The paper is published open access under CC BY 4.0:

Full paper: DOI: 10.1049/sfw2/4147669
Zenodo archive (PDF): zenodo.org/records/18269879

If you're building developer tools, working on code quality infrastructure, or just interested in where static analysis is heading, it's worth a read.

What's your experience with the gap between code quality detection and actual fixes? Do you trust automated suggestions, or do you prefer manual refactoring? Drop your thoughts below.

Top comments (2)

oleg kholin • Apr 25

The idea of “not just detecting problems, but actually suggesting better solutions” sounds compelling at first. But once you look closer, the notion of what counts as “better” turns out to be quite narrow and debatable. What is really happening here is not a true shift from analysis to meaningful improvement, but rather a more refined way of selecting similar pieces of code that look cleaner according to formal criteria. It doesn’t eliminate the gap between “what’s wrong” and “how to fix it” — it sidesteps it through analogy.
The most fragile assumption is the ability to reliably find functionally equivalent alternatives. In practice, the system almost inevitably relies on structural or textual similarity rather than genuine behavioral equivalence. This may work in simple cases, but as soon as the code becomes more complex, the risk increases that the suggested alternative behaves differently. This is especially true in the presence of hidden logic, side effects, or contextual dependencies. As a result, the “better” code may simply be different code.
The way quality is defined is also problematic. Metrics such as complexity, coupling, or readability create an impression of objectivity, but they are only indirect signals. Code that scores well on these metrics is not necessarily better designed or easier to understand. There is a real risk that developers start optimizing for the metrics themselves rather than for the actual needs of the system. In that case, what improves is the report, not the code.
The attempt to preserve syntactic similarity — so that suggestions do not feel too foreign — makes the tool more usable, but also more limited. In order to reduce cognitive friction, the system avoids substantial structural changes and stays within the boundaries of minor edits. As a result, it tends to “polish” the code rather than truly improve it.
Another issue lies in the reliance on open-source repositories as a source of “good” solutions. While they provide large volumes of code, the quality is highly uneven. They include outdated patterns, ad hoc workarounds, and simply poor implementations. At the same time, the context in which this code was written is largely lost. The system sees fragments and their frequency, but not the reasons behind them. In effect, it learns what is common, not what is actually better.
Because of this, refactoring is reduced to replacing one fragment with another. But real improvement in code involves working with structure and logic, not just selecting a cleaner-looking variant. What we see here is closer to pattern substitution than to genuine design thinking. This can be useful for small fixes, but it does not address deeper issues.
As a result, the scope of applicability is fairly limited. It can help with simple improvements, style normalization, or onboarding less experienced developers. But when it comes to complex logic or architecture, it stops providing meaningful guidance. The system does not understand the purpose of the code or its role within a larger system — it only operates on how the code looks.
In contrast to more generative approaches, this method is tightly constrained by its dataset. It does not create new solutions; it selects from existing ones. This makes it more predictable, but also less flexible. It avoids extreme errors, but at the cost of being unable to suggest anything fundamentally better.
In the end, everything happens at the surface level: code is compared, scored, and slightly improved according to formal signals. The deeper layer — meaning, structure, intent — remains outside its reach. And that is precisely where its limitations become most apparent.

Dimitris Kyrkos • Apr 27

9:14 AM
Claude responded: Hey, thanks for the detailed critique you raise fair points, especially around the limits of metric-based quality and the noise in open-source datasets.

Hey, thanks for the detailed critique you raise fair points, especially around the limits of metric-based quality and the noise in open-source datasets. You're right that this isn't "design thinking" and won't restructure your architecture for you. The paper is pretty upfront about that scope, though – it's not trying to replace a senior engineer's judgment on complex refactoring, it's targeting the layer where you have a working function that could be measurably cleaner, and it gives you a concrete option instead of just a warning.

On functional equivalence being fragile agreed, that's the hardest part, and the paper leans on CodeSearchNet's similarity techniques rather than claiming full behavioral verification. But I'd push back a bit on the "it just learns what's common" framing. The whole point of the quality-ranking step is to filter away from the average and surface the better-scoring alternatives, not the most frequent ones. Is it a solved problem? No. But as a complement to existing tooling – especially for junior devs staring at a SonarQube report with no idea what to do next – I think it's a meaningful step, not a replacement for deeper work.