Kagent: Kubernetes에 AI Agent를 도입하기!

Container Orchestration/Kubernetes

Kagent: Kubernetes에 AI Agent를 도입하기!

Somaz 2026. 6. 18. 00:00

728x90

Overview

Kubernetes 클러스터에서 장애가 발생했을 때, 우리는 어떻게 대응할까요?

1. Slack 알림 확인
2. kubectl로 파드 상태 확인
3. 로그 확인
4. Prometheus 메트릭 확인
5. 에러 메시지를 ChatGPT에 복사/붙여넣기
6. ChatGPT 제안 시도
7. 또 다른 에러 발생
8. 다시 ChatGPT로...
9. 반복... 😫

이 과정이 익숙하신가요? 우리는 AI의 도움을 받지만, AI와 인프라 사이의 중간 다리 역할을 계속 해야 한다. AI는 우리 클러스터를 볼 수 없고, 우리는 AI가 제안한 것을 수동으로 실행해야 한다.

Kagent는 이 문제를 해결한다. AI Agent가 Kubernetes 클러스터 내부에서 직접 실행되어, 스스로 문제를 진단하고, 해결책을 계획하고, 실제로 조치를 취한다. 더 이상 우리가 중간다리 역할을 할 필요가 없다.

이 글에서는 Kagent가 무엇인지, 어떻게 작동하는지, 그리고 실제로 설치하고 사용하는 방법까지 다룬다.

Kagent란?

한줄 요약

"Kubernetes 클러스터 안에서 실행되는 자율적인 AI Agent 프레임워크"

기본 정보

개발사: Solo.io
발표일: 2025년 3월 17일
라이선스: Apache 2.0 (오픈소스)
상태: CNCF Sandbox 프로젝트
기반: Microsoft AutoGen 프레임워크
GitHub: https://github.com/kagent-dev/kagent

왜 만들어졌나?

Solo.io의 고객 문제 해결 과정에서 시작되었다. 복잡한 Kubernetes 환경에서의 troubleshooting, 설정 관리, 배포 자동화 등이 너무 많은 수동 작업을 필요로 했고, 이를 AI로 자동화하기 위해 개발되었다.

Kagent의 핵심 아키텍처

Kagent는 3개의 레이어로 구성된다.

1. Tools (도구 레이어)

AI Agent가 사용할 수 있는 MCP(Model Context Protocol) 스타일의 함수들이다.

기본 제공 도구들

- Kubernetes Tools
  ├── GetResources (리소스 조회)
  ├── DescribeResource (상세 정보)
  ├── GetPodLogs (로그 조회)
  ├── GetEvents (이벤트 확인)
  ├── ApplyManifest (리소스 적용)
  └── CreateResource (리소스 생성)

- Helm Tools
  ├── GetRepositories (저장소 목록)
  ├── GetCharts (차트 목록)
  ├── InstallChart (차트 설치)
  └── UpgradeRelease (릴리스 업그레이드)

- Prometheus Tools
  ├── QueryMetrics (메트릭 쿼리)
  └── GetAlerts (알림 조회)

- Argo Tools
  ├── GetApplications (앱 목록)
  ├── SyncApplication (앱 동기화)
  └── GetRollouts (롤아웃 조회)

- Istio Tools
  ├── GetVirtualServices
  ├── GetDestinationRules
  └── GetGateways

커스텀 도구 추가 가능

HTTP API 호출
Database 쿼리
Slack 메시지 전송
외부 시스템 연동

2. Agents (에이전트 레이어)

자율적으로 작동하는 AI 시스템이다. 단순한 챗봇이 아니라, 계획을 세우고, 실행하고, 결과를 분석하여 다음 행동을 결정한다.

Agent의 구성 요소

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: kubernetes-expert
spec:
  # 시스템 프롬프트 (Agent의 역할 정의)
  systemPrompt: |
    You are a Kubernetes expert specializing in troubleshooting.
    You help diagnose issues, analyze logs, and suggest fixes.
  
  # LLM 설정
  modelConfig:
    name: gpt-4
    provider: openai
  
  # 사용 가능한 도구들
  tools:
    - name: kubectl-tools
    - name: prometheus-tools
  
  # 다른 Agent들과 협업 가능
  agents:
    - name: helm-agent

Agent의 특징

자연어 명령 이해
다단계 작업 계획 수립
결과 분석 및 적응
다른 Agent와 협업 (A2A - Agent-to-Agent)

3. Framework (프레임워크 레이어)

Agent를 관리하고 실행하는 인터페이스이다.

3가지 관리 방법

a) CLI (Command Line Interface)

# Agent 목록 조회
kagent list agents

# Agent 실행
kagent run kubernetes-expert "Check if all pods are running"

# 결과 확인
kagent logs kubernetes-expert

b) Web UI Dashboard

브라우저에서 Agent 관리
실시간 실행 결과 확인
Agent 생성/수정/삭제

c) Declarative YAML

# agent-config.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: my-custom-agent
spec:
  # ...

Kagent가 할 수 있는 일

1. 자동 Troubleshooting

시나리오: 파드가 CrashLoopBackOff

기존 방식

1. kubectl get pods -n production
2. kubectl describe pod <pod-name>
3. kubectl logs <pod-name>
4. Prometheus 메트릭 확인
5. 원인 분석
6. 해결 방법 구글링
7. 수정 적용
8. 확인

Kagent 사용

User: "production 네임스페이스에서 장애가 발생했습니다. 확인해주세요."

Kagent Agent:
1. [자동] production 네임스페이스 파드 상태 확인
2. [자동] CrashLoopBackOff 파드 발견: payment-api
3. [자동] 파드 로그 분석
   → "Error: Cannot connect to database: connection refused"
4. [자동] Service 상태 확인
   → database-service는 정상
5. [자동] NetworkPolicy 확인
   → payment-api → database 간 트래픽 차단 발견!
6. [제안] NetworkPolicy 수정안 제시
7. [승인 후 실행] NetworkPolicy 적용
8. [자동] 파드 정상 작동 확인 ✓

Total time: 2분 (기존 30분 → 2분)

2. Canary Deployment 자동화

User: "payment-api v2를 Canary 배포해주세요. 트래픽 10%로 시작하고, 
      에러율이 1% 미만이면 단계적으로 늘려주세요."

Kagent Agent:
1. [계획] Canary 배포 전략 수립
   - 10% → 25% → 50% → 100%
   - 각 단계마다 5분 대기
   - 에러율 < 1% 조건 확인

2. [실행] Argo Rollout 생성
   - payment-api v2 배포 (10% 트래픽)

3. [모니터링] Prometheus 메트릭 확인
   - 에러율: 0.3% ✓
   - 응답 시간: 평균 120ms ✓

4. [진행] 25%로 증가

5. [모니터링] 계속...

6. [완료] 100% 배포 완료
   - 총 소요 시간: 20분
   - 에러 없이 안전하게 배포됨

3. Zero Trust 보안 정책 적용

User: "production 네임스페이스에 Zero Trust 정책을 적용해주세요."

Kagent Agent:
1. [분석] 현재 서비스 간 통신 패턴 파악
   - payment-api → database
   - payment-api → redis
   - api-gateway → payment-api

2. [생성] NetworkPolicy 자동 생성
   - 필요한 통신만 허용
   - 기본은 모두 거부

3. [생성] Istio AuthorizationPolicy 생성
   - mTLS 강제
   - JWT 인증 적용

4. [테스트] 연결성 테스트
   - 모든 서비스 정상 통신 확인

5. [적용] 정책 배포

6. [보고] Zero Trust 적용 완료 보고서 생성

실습: Kagent 설치 및 사용하기

사전 준비

필수 요구사항

Kubernetes 클러스터 (Minikube, Kind, EKS 등)
Helm 3.x
kubectl
LLM API 키 (OpenAI, Anthropic, Gemini 등)

내 환경

# Minikube 사용
minikube version
# minikube version: v1.33.0

kubectl version --short
# Client Version: v1.30.0
# Server Version: v1.30.0

helm version --short
# v3.15.0

Step 1: Minikube 클러스터 시작

# Minikube 시작
minikube start --cpus=4 --memory=8192

# 클러스터 상태 확인
kubectl get nodes

Step 2: Kagent 설치

Helm으로 설치

# Kagent Helm 저장소 추가
helm repo add kagent https://kagent-dev.github.io/kagent
helm repo update

# 네임스페이스 생성
kubectl create namespace kagent-system

# OpenAI API 키 시크릿 생성
kubectl create secret generic openai-secret \
  --from-literal=apiKey=YOUR_OPENAI_API_KEY \
  -n kagent-system

# Kagent 설치
helm install kagent kagent/kagent \
  --namespace kagent-system \
  --set controller.image.tag=v0.6.6 \
  --set modelConfig.provider=openai \
  --set modelConfig.secretName=openai-secret

자세한건 해당 링크 참고하면 된다.

설치 확인

# Kagent 파드 확인
kubectl get pods -n kagent-system

# 출력 예시:
# NAME                                READY   STATUS    RESTARTS   AGE
# kagent-controller-7d8f9c5d4-xk2m9   1/1     Running   0          2m
# kagent-ui-5f9b8d7c6-p4n8k          1/1     Running   0          2m
# kagent-engine-6c8d7b5f4-q3m7n      1/1     Running   0          2m

Step 3: Web UI 접속

# Port-forward로 UI 접속
kubectl port-forward -n kagent-system svc/kagent-ui 8080:80

# 브라우저에서 접속
# http://localhost:8080

UI에서 할 수 있는 것

기본 제공 Agent 확인
새 Agent 생성
Agent 실행 및 결과 확인
도구(Tools) 관리

Step 4: 첫 번째 Agent 실행 - Kubernetes Expert

1. Agent 생성 (YAML 방식)

# k8s-expert-agent.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: k8s-expert
  namespace: kagent-system
spec:
  description: "Kubernetes troubleshooting expert"
  
  systemPrompt: |
    You are KubeExpert, a Kubernetes specialist.
    Your role is to help diagnose and resolve Kubernetes issues.
    
    When analyzing problems:
    1. Check pod status first
    2. Review logs for errors
    3. Verify service connectivity
    4. Check resource constraints
    5. Suggest actionable fixes
    
    Always explain your reasoning and provide clear solutions.
  
  modelConfig:
    name: gpt-4
    provider: openai
    temperature: 0.3  # 낮은 temperature = 더 일관된 답변
  
  tools:
    - name: kubectl
      type: kubernetes
      permissions:
        - get
        - list
        - describe
      resources:
        - pods
        - services
        - deployments
        - events
    
    - name: logs
      type: kubernetes
      permissions:
        - logs

Agent 배포

kubectl apply -f k8s-expert-agent.yaml

# Agent 확인
kubectl get agents -n kagent-system

2. Agent에게 질문하기 (CLI 사용)

# Kagent CLI 설치
curl -sL https://kagent.dev/install.sh | bash

# Agent와 대화
kagent chat k8s-expert

# 프롬프트에서 질문 입력:
> Show me all pods in default namespace

Agent 응답:
[Thinking] I'll check the pods in the default namespace...
[Action] Running: kubectl get pods -n default
[Result] 
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7d8f5c8b9-2xk4m   1/1     Running   0          5h
redis-6c7f8d9b8-9pm3n   1/1     Running   1          3h

[Summary] There are 2 pods in the default namespace:
- nginx: Running normally
- redis: Running with 1 restart (check logs if concerned)

Step 5: 실전 시나리오 - 장애 진단 자동화

시나리오 설정: 문제가 있는 앱 배포

# 문제가 있는 deployment 생성
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: app
        image: nginx:latest
        env:
        - name: DATABASE_URL
          value: "postgres://nonexistent-db:5432/mydb"
        command: ["/bin/sh", "-c"]
        args: 
        - |
          echo "Connecting to database at \$DATABASE_URL..."
          nc -zv nonexistent-db 5432 || exit 1
          nginx -g 'daemon off;'
EOF

# 파드 상태 확인 (CrashLoopBackOff 발생)
kubectl get pods -n default

Agent에게 진단 요청

kagent chat k8s-expert

> The broken-app pod is in CrashLoopBackOff. Please diagnose and suggest a fix.

Agent의 자동 진단 과정

[Thinking] Pod is in CrashLoopBackOff. I need to investigate...

[Action 1] Getting pod details
→ kubectl describe pod broken-app-xxx

[Analysis 1] 
- Container is failing to start
- Exit code: 1
- Last restart: 30 seconds ago

[Action 2] Checking pod logs
→ kubectl logs broken-app-xxx --previous

[Analysis 2] Log shows:
"Connecting to database at postgres://nonexistent-db:5432/mydb..."
"nc: bad address 'nonexistent-db'"

[Root Cause Found]
The application is trying to connect to 'nonexistent-db' which doesn't exist.

[Action 3] Checking services
→ kubectl get svc -n default

[Analysis 3]
No service named 'nonexistent-db' found in the namespace.

[Solution Suggested]
Option 1: Deploy the required database service
Option 2: Update DATABASE_URL to point to an existing service
Option 3: Remove the database check for testing

Would you like me to create a postgres database service? (yes/no)

Step 6: 커스텀 Agent 만들기 - Prometheus 모니터링 Agent

# prometheus-agent.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: metrics-analyzer
  namespace: kagent-system
spec:
  description: "Prometheus metrics analysis expert"
  
  systemPrompt: |
    You are a Site Reliability Engineer specializing in metrics analysis.
    You analyze Prometheus metrics to identify performance issues and anomalies.
    
    When analyzing metrics:
    1. Look for unusual spikes or drops
    2. Correlate metrics across services
    3. Identify potential bottlenecks
    4. Suggest optimization opportunities
    5. Alert on concerning trends
  
  modelConfig:
    name: gpt-4
    provider: openai
    temperature: 0.2
  
  tools:
    - name: prometheus
      type: prometheus
      endpoint: "http://prometheus-server.monitoring:9090"
      queries:
        - name: pod_cpu_usage
          query: 'rate(container_cpu_usage_seconds_total[5m])'
        - name: pod_memory_usage
          query: 'container_memory_working_set_bytes'
        - name: http_request_rate
          query: 'rate(http_requests_total[5m])'
        - name: http_error_rate
          query: 'rate(http_requests_total{status=~"5.."}[5m])'

사용 예시

kagent chat metrics-analyzer

> Analyze CPU usage for the last hour and identify any pods using excessive resources

Agent 응답:
[Querying] Fetching CPU metrics for the past hour...

[Analysis] CPU Usage by Pod:
1. payment-api: 85% average (⚠️ High)
   - Peak: 95% at 14:23
   - Recommendation: Consider scaling up or optimizing

2. user-service: 42% average (✓ Normal)
   - Stable usage pattern

3. database: 68% average (⚠️ Moderate)
   - Increasing trend detected
   - May need investigation

[Action Suggested]
1. Scale payment-api from 2 to 4 replicas
2. Profile payment-api to find CPU hotspots
3. Monitor database queries for optimization

Would you like me to scale payment-api now? (yes/no)

Step 7: Multi-Agent 협업 - 복잡한 시나리오

시나리오: Helm 배포 + 모니터링 설정

Agent 구성

# deployment-coordinator.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: deployment-coordinator
  namespace: kagent-system
spec:
  description: "Coordinates deployment and monitoring setup"
  
  systemPrompt: |
    You coordinate between helm-agent and prometheus-agent to:
    1. Deploy applications via Helm
    2. Verify deployment success
    3. Setup monitoring and alerts
    4. Run smoke tests
  
  # 다른 Agent들을 호출할 수 있음
  agents:
    - name: helm-agent
      role: deployment
    - name: metrics-analyzer  
      role: monitoring
  
  tools:
    - name: kubectl

실행

kagent chat deployment-coordinator

> Deploy nginx-ingress chart and setup monitoring with alerts

Agent 협업 과정:
[Coordinator] Breaking down the task...

[Step 1] Delegating to helm-agent
→ helm-agent: Installing nginx-ingress chart...
→ helm-agent: ✓ Chart installed successfully

[Step 2] Verification
→ Coordinator: Checking pods...
→ Coordinator: ✓ All pods running

[Step 3] Delegating to metrics-analyzer
→ metrics-analyzer: Creating ServiceMonitor...
→ metrics-analyzer: Setting up alerts...
→ metrics-analyzer: ✓ Monitoring configured

[Step 4] Final verification
→ Coordinator: Running smoke test...
→ Coordinator: Testing HTTP endpoint...
→ Coordinator: ✓ All tests passed

[Completion Report]
- nginx-ingress v4.7.1 deployed
- 3 replicas running
- Monitoring active with 5 alerts configured
- Smoke tests: 100% pass rate

대안: API 키 없이 무료 실습하기 (Ollama)

Step 1: Ollama 설치 및 실행

로컬 PC에 Ollama 설치

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# https://ollama.com/download 에서 다운로드

모델 다운로드 및 실행

# Ollama 서버 실행
ollama serve

# 다른 터미널에서 모델 다운로드
ollama pull llama3.2:3b  # 3B 파라미터 모델 (가벼움)
# 또는
ollama pull llama3.2:1b  # 1B 파라미터 모델 (더 가벼움)

# 테스트
ollama run llama3.2:3b "Hello, how are you?"

Step 2: Kubernetes에 Ollama 배포

Ollama Deployment 생성

# ollama-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ollama

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      # 초기화 컨테이너로 모델 다운로드
      initContainers:
      - name: model-puller
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          ollama serve &
          sleep 10
          ollama pull llama3.2:3b
          pkill ollama
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
      
      # 메인 컨테이너
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
      
      volumes:
      - name: ollama-data
        emptyDir: {}

---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
  - port: 80
    name: http
    targetPort: http
    protocol: TCP

배포

kubectl apply -f ollama-deployment.yaml

# 파드 상태 확인
kubectl get pods -n ollama -w

# 모델이 다운로드될 때까지 대기 (2-5분)
kubectl logs -n ollama -f deployment/ollama -c model-puller

# 서비스 확인
kubectl get svc -n ollama

Step 3: Kagent를 Ollama와 연결

Kagent 설치 (API 키 없이)

# Kagent CRD 설치
helm install kagent-crds \
  oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds \
  --namespace kagent \
  --create-namespace

# Kagent 설치 (Ollama 사용)
helm install kagent \
  oci://ghcr.io/kagent-dev/kagent/helm/kagent \
  --namespace kagent \
  --set providers.default=ollama \
  --set providers.ollama.host=http://ollama.ollama.svc.cluster.local:80

Ollama ModelConfig 생성

# ollama-modelconfig.yaml
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: llama3-local
  namespace: kagent
spec:
  model: llama3.2:3b
  provider: Ollama
  ollama:
    host: http://ollama.ollama.svc.cluster.local:80

kubectl apply -f ollama-modelconfig.yaml

# ModelConfig 확인
kubectl get modelconfig -n kagent

Step 4: 첫 Agent 생성 (무료!)

# local-k8s-agent.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: k8s-expert-local
  namespace: kagent
spec:
  description: "Kubernetes expert using local LLM"
  
  systemPrompt: |
    You are a Kubernetes troubleshooting expert.
    Help diagnose and resolve Kubernetes issues.
    Keep your answers concise and actionable.
  
  # Ollama 모델 사용
  modelConfig:
    name: llama3-local
  
  tools:
    - name: kubectl
      type: kubernetes
      permissions:
        - get
        - list
        - describe
      resources:
        - pods
        - services
        - deployments

kubectl apply -f local-k8s-agent.yaml

# Agent 확인
kubectl get agents -n kagent

Step 5: Agent 사용 (완전 무료!)

UI에서 사용

# UI 접속
kubectl port-forward -n kagent svc/kagent-ui 8080:80

# 브라우저에서 http://localhost:8080 접속
# k8s-expert-local Agent 선택하고 대화 시작

CLI에서 사용

# Kagent CLI로 직접 대화
kagent chat k8s-expert-local

> Show me all pods in default namespace

실전 예시: 완전 무료 실습

예시 1: 파드 상태 확인

kagent chat k8s-expert-local

> Check if there are any pods with issues in the cluster

[Agent 응답]
Let me check the pod status across all namespaces...

[Action] Running kubectl get pods --all-namespaces

I found:
- default namespace: All pods running (2/2)
- kube-system namespace: All pods running (8/8)
- kagent namespace: All pods running (3/3)
- ollama namespace: 1 pod running (1/1)

Everything looks healthy!

예시 2: 리소스 사용량 확인

> What pods are using the most memory?

[Agent 응답]
Checking resource usage...

Top 3 memory consumers:
1. ollama-xxx: 6.2Gi (Ollama LLM server)
2. kagent-engine-xxx: 512Mi
3. coredns-xxx: 128Mi

The Ollama pod is expected to use high memory since it's running the LLM model.

비용 비교

방법	비용	성능	설치 시간
OpenAI API	$0.03/1K tokens	최고	5분
Anthropic API	$0.025/1K tokens	최고	5분
Ollama (로컬)	완전 무료	중간	10분

Ollama 모델 추천

가벼운 모델 (4GB RAM 이하)

ollama pull llama3.2:1b    # 1.3GB - 가장 가벼움
ollama pull phi3:mini      # 2.3GB - 마이크로소프트
ollama pull gemma2:2b      # 1.6GB - Google

중간 모델 (8GB RAM)

ollama pull llama3.2:3b    # 2GB - 추천!
ollama pull mistral:7b     # 4.1GB - 좋은 성능

강력한 모델 (16GB+ RAM)

ollama pull llama3.1:8b    # 4.7GB
ollama pull qwen2.5:14b    # 9GB

트러블슈팅

문제 1: Ollama 파드가 너무 느림

# 더 많은 리소스 할당
resources:
  requests:
    memory: "8Gi"
    cpu: "4"
  limits:
    memory: "16Gi"
    cpu: "8"

문제 2: 작은 모델 사용하고 싶음

# 1B 모델로 변경
ollama pull llama3.2:1b

# ModelConfig 업데이트
kubectl patch modelconfig llama3-local -n kagent \
  --type merge \
  -p '{"spec":{"model":"llama3.2:1b"}}'

문제 3: GPU 사용하고 싶음

# GPU가 있다면
containers:
- name: ollama
  image: ollama/ollama:latest
  resources:
    limits:
      nvidia.com/gpu: 1  # GPU 1개 사용

하이브리드 전략:

yaml

# 두 개의 ModelConfig 생성
- llama3-local (무료, 일반 쿼리용)
- gpt-4 (유료, 복잡한 쿼리용)

장단점 비교

Ollama (로컬 LLM)

장점

완전 무료
데이터가 외부로 나가지 않음 (프라이버시)
인터넷 없이도 사용 가능
API 키 관리 불필요

단점

상대적으로 느림
리소스 사용량 높음 (메모리, CPU)
성능이 OpenAI/Anthropic보다 낮음
큰 모델은 GPU 필요

클라우드 LLM (OpenAI, Anthropic)

장점

최고 성능
빠른 응답
리소스 부담 없음

단점

비용 발생
API 키 필요
데이터가 외부로 전송됨
인터넷 필수

Best Practices

1. Security (보안)

API 키 관리

# Secret으로 저장 (절대 코드에 하드코딩 금지)
kubectl create secret generic llm-credentials \
  --from-literal=openai-key=$OPENAI_API_KEY \
  --from-literal=anthropic-key=$ANTHROPIC_API_KEY \
  -n kagent-system

RBAC 설정

# Agent에 최소 권한만 부여
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kagent-readonly
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list"]  # write 권한은 주지 않음
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]

2. Cost Management (비용 관리)

Token 사용량 모니터링

# Agent 설정에 제한 추가
spec:
  modelConfig:
    maxTokens: 4000  # 응답 토큰 제한
    budget:
      daily: 1000000  # 일일 토큰 한도
      perQuery: 10000  # 쿼리당 토큰 한도

캐싱 활용

spec:
  cache:
    enabled: true
    ttl: 3600  # 1시간 동안 같은 쿼리 캐싱

3. Observability (관찰 가능성)

Agent 활동 로깅

# Agent 로그 확인
kubectl logs -n kagent-system deployment/kagent-engine -f

# 특정 Agent 실행 히스토리
kagent history k8s-expert

# 메트릭 확인
kubectl port-forward -n kagent-system svc/kagent-metrics 9090:9090

4. Testing (테스트)

Agent를 배포하기 전에 테스트

# test-agent.yaml
spec:
  testMode: true  # 실제 실행 없이 dry-run
  
  testScenarios:
    - name: "pod-failure"
      input: "A pod is in CrashLoopBackOff"
      expectedActions:
        - "describe pod"
        - "get logs"
        - "analyze error"

리소스 정리

실습이 끝난 후 클러스터 리소스를 정리하는 방법이다.

1. Kagent 리소스 삭제

Helm으로 설치한 경우

# gent 제거
helm uninstall kagent -n kagent-system

# CRD 제거
helm uninstall kagent-crds -n kagent

# 네임스페이스 삭제
kubectl delete namespace kagent-system kubectl delete namespace kagent

직접 생성한 Agent 삭제

# 생성한 Agent 확인
kubectl get agents -A

# Agent 삭제
kubectl delete agent k8s-expert -n kagent-system kubectl delete agent metrics-analyzer -n kagent-system kubectl delete agent deployment-coordinator -n kagent-system kubectl delete agent k8s-expert-local -n kagent

# ModelConfig 삭제
kubectl delete modelconfig llama3-local -n kagent

2. Ollama 리소스 삭제

Ollama 완전 제거

# Deployment와 Service 삭제

kubectl delete deployment ollama -n ollama
kubectl delete service ollama -n ollama


# 네임스페이스 삭제
kubectl delete namespace ollama



# PV/PVC 사용 시 (있는 경우)
kubectl delete pvc ollama-data -n ollama
kubectl delete pv ollama-pv

3. 실습용 테스트 리소스 삭제

# 실습 중 생성한 테스트 앱 삭제
kubectl delete deployment broken-app -n default kubectl delete deployment nginx -n default kubectl delete deployment redis -n default


# 테스트 서비스 삭제
kubectl delete service --all -n default

4. 로컬 Ollama 정리 (선택)

로컬 PC에서 Ollama 정리

# 다운로드한 모델 확인
ollama list


# 모델 삭제
ollama rm llama3.2:3b
ollama rm llama3.2:1b
ollama rm phi3:mini


# Ollama 프로세스 종료
pkill ollama


# macOS에서 완전 제거
brew uninstall ollama


# Linux에서 완전 제거
sudo systemctl stop ollama
sudo rm /usr/local/bin/ollama
sudo rm -rf /usr/share/ollama

5. Secret 및 ConfigMap 정리

# API 키 시크릿 삭제
kubectl delete secret openai-secret -n kagent-system
kubectl delete secret anthropic-secret -n kagent-system
kubectl delete secret llm-credentials -n kagent-system



# ConfigMap 삭제 (있는 경우)
kubectl delete configmap -n kagent-system --all

6. RBAC 리소스 정리

# 생성한 ClusterRole 삭제
kubectl delete clusterrole kagent-reader


# ClusterRoleBinding 삭제
kubectl delete clusterrolebinding kagent-reader-binding


# Role 및 RoleBinding 삭제
kubectl delete role kagent-readonly -n production kubectl delete rolebinding kagent-readonly-binding -n production

7. 완전 초기화 (모든 것 삭제)

한 번에 모든 리소스 정리

#!/bin/bash

# cleanup-kagent.sh
echo " Kagent 리소스 정리 시작..."


# Helm 제거
helm uninstall kagent -n kagent-system 2>/dev/null
helm uninstall kagent-crds -n kagent 2>/dev/null
helm uninstall kagent -n kagent 2>/dev/null


# Ollama 제거
kubectl delete namespace ollama --grace-period=0 --force 2>/dev/null


# Kagent 제거
kubectl delete namespace kagent-system --grace-period=0 --force 2>/dev/null
kubectl delete namespace kagent --grace-period=0 --force 2>/dev/null


# RBAC 제거
kubectl delete clusterrole kagent-reader 2>/dev/null
kubectl delete clusterrolebinding kagent-reader-binding 2>/dev/null

# 테스트 리소스 제거
kubectl delete deployment --all -n default 2>/dev/null

# Minikube 중지
minikube stop
echo "정리 완료!"

실행

chmod +x cleanup-kagent.sh

./cleanup-kagent.sh

리소스 정리 체크리스트

실습 종료 후 확인할 항목

Kagent Helm 차트 제거
Ollama Deployment 삭제
생성한 Agent 모두 삭제
테스트용 Deployment 삭제
Secret 및 ConfigMap 삭제
RBAC 리소스 삭제
Minikube 클러스터 중지 또는 삭제
로컬 Ollama 모델 삭제 (선택)
Docker 이미지 정리 (선택)

트러블슈팅

문제 1: Agent가 응답하지 않음

증상

kagent chat k8s-expert
# Timeout waiting for response...

해결

# 1. Engine 파드 상태 확인
kubectl get pods -n kagent-system -l app=kagent-engine

# 2. 로그 확인
kubectl logs -n kagent-system deployment/kagent-engine

# 3. API 키 확인
kubectl get secret openai-secret -n kagent-system -o yaml

# 4. 네트워크 확인
kubectl exec -it -n kagent-system deployment/kagent-engine -- \
  curl -I https://api.openai.com

문제 2: 권한 에러

증상

Error: pods is forbidden: User "system:serviceaccount:kagent-system:kagent" 
cannot list resource "pods" in API group ""

해결

# RBAC 확인
kubectl get clusterrolebinding | grep kagent

# ClusterRole 생성
kubectl create clusterrole kagent-reader \
  --verb=get,list,watch \
  --resource=pods,services,deployments

# ClusterRoleBinding 생성
kubectl create clusterrolebinding kagent-reader-binding \
  --clusterrole=kagent-reader \
  --serviceaccount=kagent-system:kagent

문제 3: LLM 응답이 이상함

증상: Agent가 엉뚱한 답변을 하거나 무한 루프에 빠짐

해결

# System Prompt 개선
spec:
  systemPrompt: |
    You are a Kubernetes expert.
    
    IMPORTANT RULES:
    - Always verify information before acting
    - If unsure, ask for clarification
    - Never make destructive changes without confirmation
    - Limit analysis to 5 steps maximum
    - If stuck, explain the issue and stop
  
  # Temperature 조정 (낮을수록 일관적)
  modelConfig:
    temperature: 0.1  # 0.0 ~ 1.0
    
  # 최대 반복 제한
  maxIterations: 10

실제 사용 사례

Case 1: 야간 장애 자동 대응

상황:

새벽 3시에 프로덕션 장애 발생
온콜 엔지니어가 자는 중

Kagent 설정

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: incident-responder
spec:
  # Prometheus Alert와 연동
  triggers:
    - type: prometheus-alert
      severity: critical
      
  automation:
    enabled: true  # 자동으로 조치
    requireApproval: false  # 긴급 시 승인 불필요
    
  actions:
    - diagnose: true
    - attempt-fix: true
    - notify-on-call: true
    - create-incident-report: true

결과

Agent가 자동으로 문제 진단
메모리 부족 파드 재시작
5분 만에 복구
엔지니어는 아침에 보고서만 확인

Case 2: 개발자 온보딩 가속화

문제: 신입 개발자가 Kubernetes를 모름

해결

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: newbie-helper
spec:
  systemPrompt: |
    You are a friendly Kubernetes tutor for new developers.
    Explain concepts simply and provide step-by-step guidance.
    Always include educational context with your answers.

사용 예

Developer: "How do I check my app's logs?"

Agent:
Let me help you check your logs! 📚

[Explanation]
Logs in Kubernetes are stored in each pod's containers.
To view them, you use the `kubectl logs` command.

[Your App]
I see you have a pod called "user-api-6c7f8d9b8-9pm3n"

[Command to run]
kubectl logs user-api-6c7f8d9b8-9pm3n

[Result]
(shows logs...)

[Learn More]
- Add `-f` to follow logs in real-time
- Add `--previous` to see logs from crashed pods
- Add `--tail=100` to see last 100 lines only

Try it yourself! Let me know if you need help.

미래 전망 및 로드맵

Kagent 프로젝트는 빠르게 발전 중이다.

현재 (2025년 12월 기준)

CNCF Sandbox 프로젝트
800+ GitHub Stars
400+ Discord 멤버
Argo, Helm, Istio, K8s, Prometheus 통합

계획 중 (Roadmap)

OpenTelemetry 통합
Multi-agent workflow 확장
더 많은 CNCF 프로젝트 도구
GUI에서 Agent 시각적 설계
Agent 마켓플레이스 (커뮤니티가 Agent 공유)

마무리

Kubernetes 운영의 패러다임이 바뀌고 있다.

과거

문제 발생 → 사람이 진단 → 사람이 수정 → 사람이 확인

현재 (Kagent)

문제 발생 → AI가 진단 → AI가 수정 → AI가 확인 → 사람은 승인만

미래

문제 발생 → AI가 자동으로 모든 것 처리 → 사람은 전략적 의사결정에만 집중

시작하기

Kagent는 아직 초기 단계이지만, 가능성은 무궁무진하다.

DevOps 엔지니어에게

반복적인 troubleshooting에서 해방
복잡한 배포 자동화
24/7 온콜 부담 감소

Platform 팀에게

개발자 셀프서비스 강화
표준화된 운영 절차
지식의 민주화 (누구나 전문가처럼)

조직 전체에

장애 대응 시간 단축
운영 비용 절감
엔지니어는 더 가치있는 일에 집중

Reference

Somaz | DevOps Engineer | Kubernetes & Cloud Infrastructure Specialist

728x90

저작자표시 비영리 변경금지 (새창열림)

'Container Orchestration > Kubernetes' 카테고리의 다른 글

Kubernetes OOMKilled 대응 전략: 무작정 메모리만 늘리지 말자! (0)	2026.06.10
Kubernetes 클러스터로의 외부 트래픽 흐름 완벽 가이드 (0)	2026.06.04
Cilium CNI 환경에서의 Kubernetes 네트워크 문제 해결 (0)	2026.05.28
Kubernetes 내부 네트워크 완벽 분석: IPVS 모드 환경에서의 패킷 플로우 추적 (0)	2026.05.21
ingress-nginx → nginx-gateway-fabric 마이그레이션 실전 기록 (온프레미스 K8s, 11개 인스턴스) (0)	2026.05.14

현재글Kagent: Kubernetes에 AI Agent를 도입하기!