Grafana 장애 알림에서 Stack Trace 기반 소스 코드 수정 권고안까지 연결하기

Grafana Alerting 기반의 장애 분석 기능을 만들다 보면, 처음에는 로그와 메트릭만으로도 꽤 많은 정보를 얻을 수 있다.

알림이 발생한 시간, Loki에서 조회한 ERROR 로그, Prometheus에서 조회한 메트릭 추이 정도를 LLM에게 전달하면 근본 원인 후보와 조치 방향을 어느 정도는 뽑아준다.

그런데 실제로 장애를 고치려면 결국 소스 코드를 봐야 한다.

특히 JVM 애플리케이션에서는 로그에 stack trace가 남아 있는 경우가 많고, stack trace에는 이미 클래스명, 메서드명, 파일명, 라인 번호가 들어 있다.

그래서 이번에는 Spring-AI-Ops의 Grafana 장애 분석 흐름에 다음 기능을 추가했다.

Loki 로그에서 JVM stack trace 추출
stack trace frame을 실제 repository의 source file로 resolve
장애 라인 주변 source snippet만 LLM prompt에 추가
LLM이 source code 수정 권고안을 JSON으로 반환
AI 분석 결과 하단에서 파일 path를 클릭하면 원본 코드와 수정 제안을 popup으로 비교

전체 repository를 LLM에게 보내는 방식은 사용하지 않았다.
장애 로그가 이미 지목한 파일과 라인 근처만 잘라서 보내는 방식으로 구현했다.

1. 기존 Grafana 장애 분석 흐름

기존 흐름은 대략 아래와 같았다.

Grafana Alert Webhook
        │
        ├─ Loki 로그 조회
        ├─ Prometheus 메트릭 조회 (optional)
        │
        ├─ LLM 분석 요청
        │    alert context
        │    + log section
        │    + metric section
        │
        └─ AnalyzeFiringRecord 저장 및 WebSocket push

ObservabilityFacade.analyzeFiring()에서 Loki, Prometheus 조회는 이미 병렬로 수행하고 있었다.

이번 작업에서는 여기에 source checkout을 병렬 작업으로 추가했다.

val logFuture = CompletableFuture.supplyAsync({ executeFindLog(request) }, executor)
val metricFuture = CompletableFuture.supplyAsync({ executeFindMetric(request) }, executor)
val checkoutFuture = CompletableFuture.supplyAsync({ getSourcePath(targetApplication) }, executor)

val logResults: LokiQueryResult = logFuture.get()
val metricResults: PrometheusQueryResult? = metricFuture.get()
val sourcePath: Path? = checkoutFuture.get()

로그와 메트릭을 조회하는 동안 Git repository도 같이 checkout한다.

checkout 대상은 애플리케이션 설정에 등록된 Git URL과 deploy branch다.
GitHub 또는 GitLab URL이면 저장된 token을 사용하고, token이 없으면 null로 넘긴다. 즉 token은 optional이다.

private fun resolveAccessToken(gitUrl: String): String? {
    val lower = gitUrl.lowercase()
    return when {
        lower.contains("github") -> githubService.getToken()
        lower.contains("gitlab") -> gitlabService.getToken()
        else -> null
    }
}

2. Loki 로그에서 원문 로그만 추출하기

기존 LokiQueryResult.createLogSectionPrompt()는 LLM에게 보여줄 prompt용 문자열을 만든다.

하지만 stack trace parser 입장에서는 timestamp, stream label 같은 UI/prompt 장식이 섞인 문자열보다 순수 로그 본문만 있는 편이 다루기 쉽다.

그래서 LokiQueryResult에 raw log text 추출 메서드를 추가했다.

fun rawLogText(): String {
    return data?.result
        ?.asSequence()
        ?.flatMap { stream -> stream.values.asSequence() }
        ?.mapNotNull { entry -> entry.getOrNull(1) }
        ?.joinToString(System.lineSeparator())
        ?: ""
}

Loki의 values는 보통 아래 형태다.

[
  ["1710000000000000000", "actual log line"]
]

여기서 두 번째 값만 모아서 parser에게 넘긴다.

3. JVM Stack Trace Parser 구현

대상으로 삼은 JVM frame 포맷은 아래와 같다.

at com.example.service.FooService.doWork(FooService.kt:42)
at com.example.controller.FooController.handle(FooController.java:19)

정규식은 아래처럼 구성했다.

private val frameRegex =
    Regex("""\s*at\s+([\w.$]+)\.([\w$<>]+)\(([^():]+)(?::(\d+))?\)""")

파싱 결과는 StackTraceFrame DTO로 담았다.

data class StackTraceFrame(
    val className: String,
    val methodName: String,
    val fileName: String?,
    val lineNumber: Int?,
) {
    val packageName: String
        get() = className.substringBeforeLast('.', "")

    val simpleClassName: String
        get() = className.substringAfterLast('.').substringBefore('$')
}

MatchResult를 StackTraceFrame으로 바꾸는 부분은 public 확장 함수로 분리했다.

fun MatchResult.toStackTraceFrame(): StackTraceFrame {
    return StackTraceFrame(
        className = groupValues[1],
        methodName = groupValues[2],
        fileName = groupValues[3].takeIf { it != "Unknown Source" && it != "Native Method" },
        lineNumber = groupValues.getOrNull(4)?.takeIf { it.isNotBlank() }?.toIntOrNull(),
    )
}

그리고 parser는 Spring Bean으로 만들었다.

@Component
class StackTraceParser {
    fun parse(logText: String, maxFrameCount: Int = 5): List<StackTraceFrame> {
        if (logText.isBlank() || maxFrameCount <= 0) {
            return emptyList()
        }

        return frameRegex.findAll(logText)
            .map { matchResult -> matchResult.toStackTraceFrame() }
            .filterNot { frame -> isExternalLibraryFrame(frame) }
            .distinctBy { frame -> "${frame.className}.${frame.methodName}:${frame.fileName}:${frame.lineNumber}" }
            .take(maxFrameCount)
            .toList()
    }
}

처음에는 basePackage 설정값을 별도로 둘까 생각했는데, 구현 범위에서 제외했다.

애플리케이션마다 package prefix를 관리하는 것은 운영상 부담이 될 수 있다.
대신 java., kotlin., org.springframework., reactor., io.netty. 같은 명확한 외부 라이브러리 prefix를 제외하는 방식으로 시작했다.

4. Source File Resolve

이제 stack trace frame을 실제 source file로 연결해야 한다.

예를 들어 아래 frame이 있다면,

com.example.service.FooService.doWork(FooService.kt:42)

먼저 다음 direct path 후보를 찾는다.

src/main/kotlin/com/example/service/FooService.kt
src/main/java/com/example/service/FooService.java
src/test/kotlin/com/example/service/FooService.kt
src/test/java/com/example/service/FooService.java

실제 프로젝트는 multi-module 구조인 경우가 많다.

service-a/src/main/kotlin/...
domain/src/main/java/...

그래서 direct path로 못 찾으면 repository 전체를 scan해서 fileName이 일치하는 파일을 찾고, 파일 상단의 package 선언이 stack trace package와 맞는 파일을 우선 선택한다.

이 로직은 PathExtensions.kt에 확장 함수로 구현했다.

fun Path.resolveSourceFile(frame: StackTraceFrame): Path? {
    val directMatch = directSourceCandidates(frame).firstOrNull {
        it.exists() && it.isRegularFile()
    }
    if (directMatch != null) {
        return directMatch
    }

    val fallbackFileNames = frame.candidateFileNames()
    if (fallbackFileNames.isEmpty()) {
        return null
    }

    val matchedFiles = Files.walk(this).use { stream ->
        stream
            .filter { path -> path.isRegularFile() }
            .filter { path -> path.extension.lowercase() in sourceExtensions }
            .filter { path -> path.name in fallbackFileNames }
            .sorted()
            .toList()
    }

    if (matchedFiles.isEmpty()) {
        return null
    }

    return matchedFiles.firstOrNull { path -> path.hasPackageDeclaration(frame.packageName) }
        ?: matchedFiles.first()
}

fileName이 없을 때는 simpleClassName 기반으로 .kt, .java 후보를 만든다.

private fun StackTraceFrame.candidateFileNames(): List<String> {
    return buildList {
        fileName?.takeIf { it.endsWith(".kt") || it.endsWith(".java") }?.let { add(it) }
        add("${simpleClassName}.kt")
        add("${simpleClassName}.java")
    }.distinct()
}

5. Source Snippet 추출

LLM에게 전체 파일을 보내는 것은 피했다.

lineNumber가 있으면 해당 라인 기준 전후 약 40줄만 자른다.
lineNumber가 없으면 파일 상단 일부만 전달한다.

fun Path.extractSourceSnippet(
    repositoryRoot: Path,
    frame: StackTraceFrame,
    radius: Int = 40,
): SourceSnippet {
    val lines = readLines()
    val focusLine = frame.lineNumber?.takeIf { it in 1..lines.size }
    val startLine = if (focusLine != null) {
        (focusLine - radius).coerceAtLeast(1)
    } else {
        1
    }
    val endLine = if (focusLine != null) {
        (focusLine + radius).coerceAtMost(lines.size)
    } else {
        80.coerceAtMost(lines.size)
    }

    val content = (startLine..endLine).joinToString(System.lineSeparator()) { lineNumber ->
        val marker = if (lineNumber == focusLine) ">>" else "  "
        "$marker ${lineNumber.toString().padStart(4)} | ${lines[lineNumber - 1]}"
    }

    return SourceSnippet(
        filePath = repositoryRoot.relativize(this).invariantSeparatorsPathString,
        startLine = startLine,
        endLine = endLine,
        focusLine = focusLine,
        content = content,
    )
}

prompt에 들어가는 형태는 아래와 같다.

File: src/main/kotlin/com/example/service/FooService.kt
Lines: 2-82
Focus line: 42

   40 | val response = connector.call(...)
   41 | ...
>> 42 | return response.data.result.flatMap { ... }
   43 | ...

6. IncidentSourceContextService

이제 앞에서 만든 조각들을 하나의 서비스에서 조합한다.

LokiQueryResult.rawLogText()
        │
        ▼
StackTraceParser.parse()
        │
        ▼
repositoryRoot.resolveSourceFile(frame)
        │
        ▼
sourceFile.extractSourceSnippet(repositoryRoot, frame)
        │
        ▼
IncidentSourceContext

서비스 코드는 아래처럼 구성했다.

@Service
class IncidentSourceContextService(
    private val stackTraceParser: StackTraceParser,
) {
    fun createContext(logResults: LokiQueryResult, repositoryRoot: Path?): IncidentSourceContext {
        val frames = stackTraceParser.parse(logResults.rawLogText())
        if (repositoryRoot == null || frames.isEmpty()) {
            return IncidentSourceContext(
                frames = frames,
                snippets = emptyList(),
                unresolvedFrames = frames,
            )
        }

        val snippets = mutableListOf<SourceSnippet>()
        val unresolvedFrames = mutableListOf<StackTraceFrame>()
        val resolvedSnippetKeys = mutableSetOf<String>()

        frames.forEach { frame ->
            val sourceFile = repositoryRoot.resolveSourceFile(frame)
            if (sourceFile == null) {
                unresolvedFrames.add(frame)
                return@forEach
            }

            val snippet = sourceFile.extractSourceSnippet(repositoryRoot, frame)
            val snippetKey = "${snippet.filePath}:${snippet.startLine}:${snippet.endLine}:${snippet.focusLine}"
            if (resolvedSnippetKeys.add(snippetKey)) {
                snippets.add(snippet)
            }
        }

        return IncidentSourceContext(
            frames = frames,
            snippets = snippets,
            unresolvedFrames = unresolvedFrames,
        )
    }
}

source section prompt 생성은 service가 아니라 DTO에 위임했다.

data class IncidentSourceContext(
    val frames: List<StackTraceFrame>,
    val snippets: List<SourceSnippet>,
    val unresolvedFrames: List<StackTraceFrame>,
) {
    fun createSourceSectionPrompt(): String {
        if (snippets.isEmpty()) {
            return ""
        }

        return buildString {
            appendLine("## Related source snippets")
            appendLine("The following snippets were selected from JVM stack trace frames. Treat them as focused source context, not as the full repository.")
            appendLine()
            snippets.forEach { snippet ->
                appendLine(snippet.createSourceSnippetPrompt())
            }
        }
    }
}

7. LLM Prompt 확장

기존 incident 분석 prompt는 alert, log, metric 중심이었다.

여기에 source section을 추가했다.

fun executeAnalyzeFiring(
    alertSection: String,
    logSection: String,
    metricSection: String = "",
    sourceSection: String = "",
): String

source section이 있으면 prompt에 추가한다.

if (sourceSection.isNotBlank()) {
    append(sourceSection)
    appendLine()
}

그리고 응답 형식을 강화했다.

LLM에게 markdown 보고서를 먼저 작성하게 하고, 하단에는 source code suggestion JSON을 delimiter 사이에 넣도록 했다.

---SOURCE_CODE_SUGGESTIONS_JSON_START---
[
  {
    "filePath": "relative/path/to/File.kt",
    "originalCode": "exact original code lines",
    "suggestionCode": "replacement code lines",
    "description": "Why this change helps",
    "lineNumber": 42
  }
]
---SOURCE_CODE_SUGGESTIONS_JSON_END---

이렇게 한 이유는 markdown만으로 UI를 만들기 어렵기 때문이다.

분석 보고서는 markdown으로 읽기 좋게 보여주고, 코드 수정 권고안은 structured data로 따로 저장해야 UI에서 안전하게 다룰 수 있다.

8. SourceCodeSuggestion Record

수정 권고안은 Java record로 만들었다.

public record SourceCodeSuggestion(
    String filePath,
    String originalCode,
    String suggestionCode,
    String description,
    Integer lineNumber
) { }

AnalyzeFiringRecord에는 아래 필드를 추가했다.

List<SourceCodeSuggestion> sourceCodeSuggestions

LLM 응답은 ObservabilityFacade에서 markdown과 JSON으로 분리한다.

private fun parseAnalyzeFiringResponse(raw: String): Pair<String, List<SourceCodeSuggestion>> {
    val startIdx = raw.indexOf(SOURCE_CODE_SUGGESTIONS_START)
    if (startIdx == -1) {
        return Pair(raw.trim(), emptyList())
    }

    val markdown = raw.substring(0, startIdx).trim()
    val afterStart = raw.substring(startIdx + SOURCE_CODE_SUGGESTIONS_START.length)
    val endIdx = afterStart.indexOf(SOURCE_CODE_SUGGESTIONS_END)
    val jsonText = (if (endIdx == -1) afterStart else afterStart.substring(0, endIdx)).trim()
    val sanitized = codeAnalysisResultHandler.sanitizeControlChars(jsonText)

    val suggestions = runCatching {
        codeAnalysisResultHandler.parseJsonArray(sanitized, SourceCodeSuggestion::class.java)
    }.getOrElse {
        log.warn("Failed to parse source code suggestions JSON — attempting recovery: {}", it.message)
        codeAnalysisResultHandler.recoverIssuesFromJson(sanitized, SourceCodeSuggestion::class.java)
    }
    return Pair(markdown, suggestions)
}

LLM이 JSON을 항상 완벽하게 주지는 않을 수 있으므로, 기존 코드 위험 분석에서 사용하던 lenient JSON handler를 재사용했다.

9. UI: AI Analysis 하단에 Source Code Suggestions 추가

UI는 기존 AI Analysis 영역의 가장 하단에 source code suggestion list를 붙였다.

function renderAnalysisLayers(appName, record) {
    ...
    return `
        ...
        <div class="analysis-layer">
            <div class="layer-header">AI Analysis<span class="layer-header-disclaimer">* AI-generated results may not always be accurate.</span></div>
            <div class="analysis-text markdown-body">${renderMarkdown(record.analyzeResults)}</div>
            ${renderSourceCodeSuggestions(record.sourceCodeSuggestions)}
        </div>`;
}

파일 경로는 클릭 가능한 버튼으로 렌더링한다.

function renderSourceCodeSuggestions(suggestions) {
    const items = Array.isArray(suggestions) ? suggestions.filter(Boolean) : [];
    if (items.length === 0) return '';

    return `<div class="source-suggestion-list">
        <div class="source-suggestion-list-title">Source Code Suggestions</div>
        ${items.map((suggestion, idx) => {
            const filePath = suggestion.filePath || 'Unknown file';
            const line = suggestion.lineNumber == null ? '' : `:${suggestion.lineNumber}`;
            const description = suggestion.description ? `<div class="source-suggestion-summary">${escHtml(suggestion.description)}</div>` : '';
            return `<div class="source-suggestion-item">
                <button class="source-suggestion-link" data-suggestion-idx="${idx}" title="${escHtml(filePath)}">
                    ${escHtml(filePath)}${escHtml(line)}
                </button>
                ${description}
            </div>`;
        }).join('')}
    </div>`;
}

파일 path를 클릭하면 popup을 띄운다.

왼쪽에는 originalCode, 오른쪽에는 suggestionCode를 보여준다.
Copy 버튼은 수정 제안 코드만 복사한다.

async function copySourceSuggestionCode() {
    const code = activeSourceSuggestion?.suggestionCode || '';
    if (!code) return;

    const copyBtn = document.getElementById('source-suggestion-copy-btn');
    try {
        await writeClipboardText(code);
        copyBtn.textContent = 'Copied';
    } catch (e) {
        copyBtn.textContent = 'Copy failed';
    }
}

브라우저의 Clipboard API가 막힌 환경도 있을 수 있어서 textarea + execCommand('copy') fallback도 넣었다.

10. 테스트

이번 작업은 순수 함수와 orchestration이 섞여 있어 테스트를 여러 층으로 나눴다.

StackTraceParserTest

일반 JVM frame 파싱
line number 없는 frame 처리
외부 라이브러리 frame 제외
중복 제거
top 5 제한
Unknown Source, Native Method 처리

PathExtensionsTest

direct path resolve
multi-module fallback resolve
package declaration matching
fileName 없는 경우 simpleClassName fallback
focus line snippet
no-line top section snippet

IncidentSourceContextServiceTest

Loki log + temp repository로 source snippet 생성
repository root가 없는 경우 unresolved 처리
source file을 찾지 못한 frame 처리
duplicate snippet 제거

ObservabilityFacadeTest

checkout 결과가 source context service로 연결되는지 확인
GitHub URL이면 GitHub token을 clone에 전달
GitLab URL이면 GitLab token을 clone에 전달
LLM 응답의 source code suggestion JSON이 AnalyzeFiringRecord에 매핑되는지 확인

UI 확인

JS 문법은 아래 명령으로 확인했다.

node --check src/main/resources/static/js/aiops.js

관련 테스트도 함께 실행했다.

./gradlew test --tests 'com.walter.spring.ai.ops.facade.ObservabilityFacadeTest'
./gradlew test --tests 'com.walter.spring.ai.ops.service.IncidentSourceContextServiceTest'
./gradlew test --tests 'com.walter.spring.ai.ops.util.PathExtensionsTest'

11. 구현하면서 정한 것들

basePackage는 설정으로 관리하지 않았다

처음에는 application별 basePackage를 설정으로 둘까 생각했다.

하지만 운영에서 애플리케이션별 package prefix를 정확히 관리해야 하는 부담이 생긴다.
또한 multi-module이나 legacy package 구조에서는 오히려 헷갈릴 수 있다.

그래서 첫 구현에서는 명확한 외부 라이브러리 prefix만 제외했다.

java.
javax.
jakarta.
kotlin.
org.springframework.
reactor.
io.netty.
...

이후 필요하면 application 설정이 아니라 repository 분석 결과나 package 후보 추론 방식으로 개선하는 편이 나을 것 같다.

전체 repository는 LLM에 보내지 않았다

이 기능의 목적은 static analysis가 아니라 incident analysis다.
장애 로그에서 이미 문제 위치 후보를 제공하는데, 굳이 전체 repository를 넣으면 token만 낭비하고 결과도 흐려질 수 있다.
그래서 stack trace가 지목한 파일과 라인 주변만 보낸다.

정확한 배포 commit은 아직 추적하지 않는다

현재는 설정된 deploy branch의 최신 코드를 checkout한다.
실제 운영에서는 장애 시점에 배포된 commit SHA를 알아야 더 정확하다.

12. 남은 개선점

이번 구현은 MVP에 가깝다.
실제 운영 수준으로 올리려면 다음이 남아 있다.

정확한 배포 commit checkout
- 현재는 deploy branch 기준이다.
- 장애 시점의 commit SHA를 찾아 checkout하는 방식이 필요하다.
stack trace가 없는 장애 대응
- timeout, latency, saturation 계열 장애는 stack trace가 없을 수 있다.
- logger name, HTTP path, controller mapping, trace span 등을 fallback으로 사용할 수 있다.
caller/callee 주변 파일 확장
- top frame 하나만으로는 원인 파악이 부족할 수 있다.
- 같은 package의 caller/callee 후보 파일을 제한적으로 추가하는 방법을 고려할 수 있다.
LLM JSON 안정성
- delimiter + JSON 방식은 실용적이지만 완벽하지 않다.
- response format을 더 강하게 제어할 수 있는 모델/API를 쓰면 안정성이 좋아질 수 있다.

13. 마무리

장애 분석에서 로그와 메트릭은 “무슨 일이 일어났는지”를 알려준다.

하지만 실제 수정은 소스 코드에서 이루어진다.

이번 작업은 Loki 로그에 이미 들어 있는 JVM stack trace를 활용해서, 장애 분석과 소스 수정 제안을 하나의 흐름으로 연결한 것이다.

아직 정확한 배포 commit 추적이나 stack trace 부재 시 fallback은 남아 있지만, 최소한 다음 흐름은 가능해졌다.

Grafana Alert
  → Loki Logs
  → Stack Trace
  → Source File
  → Source Snippet
  → LLM Root Cause
  → Source Code Suggestion
  → UI Popup Compare

이 정도면 AI 기반 장애 분석이 단순히 “로그를 요약해주는 기능”에서 조금 더 실질적인 “수정 후보를 제시하는 기능”으로 넘어가기 시작했다고 볼 수 있다.