Shrikar Kaduluri - DevOps Engineer

Professional Summary

Platform and DevOps Engineer with a Master's degree in Computer Science and hands-on experience building production-inspired cloud platforms. Strong background in incident management, monitoring, reliability, and infrastructure automation using Terraform, Kubernetes, and CI/CD pipelines. Experienced across Azure and AWS environments with a reliability-first mindset shaped by real-world production support and on-call troubleshooting. Microsoft Certified Azure Administrator (AZ-104).

Technical Skills

Hover over skills to see which projects use them

DevOps & Automation

Terraform Used in: IDP, Multi-Cloud K8s
Infrastructure as Code (IaC) Used in: IDP, Multi-Cloud K8s
Docker Used in: IDP, Multi-Cloud K8s
GitHub Actions Used in: IDP (CI/CD Pipeline)
Jenkins Used in: Previous enterprise projects
CI/CD Pipelines Used in: IDP, Multi-Cloud K8s

Cloud & Infrastructure

Microsoft Azure (VMs, VNets, NSGs) Used in: Ohio DRC, IDP, Multi-Cloud K8s
Azure Load Balancers Used in: IDP (AKS Infrastructure)
Azure Monitor & Entra ID Used in: Ohio DRC, Infosys
Amazon Web Services (AWS) Used in: Multi-Cloud K8s (EKS)

Operations & Reliability

ServiceNow (ITSM) Used in: Ohio DRC (99.8% SLA)
Incident Management Used in: Ohio DRC, Infosys
Monitoring & Alerting Used in: Observability Stack, Ohio DRC
Root Cause Analysis Used in: Ohio DRC, Infosys

Scripting & OS

Python Used in: IDP (Orchestration), Automation
Bash Used in: CI/CD, Infrastructure Scripts
PowerShell Used in: Ohio DRC (Windows Server)
Linux Used in: IDP, Multi-Cloud K8s, Observability
Windows Server Used in: Ohio DRC (Enterprise IT)

Professional Experience

Click highlighted achievements to see additional context and metrics

Information Technologist I – Infrastructure & Operations

Ohio Department of Rehabilitation and Correction

Aug 2024 - Present

Support and stabilize production infrastructure services across 50+ enterprise endpoints, acting as an escalation point for availability, reliability, and performance-related incidents
Responsible for critical infrastructure supporting 50+ endpoints across correctional facilities. Escalation path for P1/P2 incidents affecting system availability. Average response time: <15 minutes for critical alerts.
Manage incidents and service requests through ServiceNow, performing root cause analysis and maintaining 99.8% uptime SLAs across supported systems
Handle ~40 tickets/week ranging from P3 service requests to P1 outages. RCA documentation for all P1/P2 incidents. Maintained 99.8% uptime across monitored systems, exceeding 99.5% SLA target.
Collaborate with network, identity, and application teams within a hybrid Active Directory and Azure Entra ID environment, reducing incident resolution time by 15%
Improved cross-team collaboration through structured handoff procedures and shared runbooks. Reduced average incident resolution time from 2.3 hours → 1.95 hours (15% improvement) by streamlining identity-related troubleshooting workflows.
Apply DevOps practices to improve operational efficiency through infrastructure automation, CI/CD workflows, and monitoring configurations using Terraform, GitHub Actions, and cloud-native tooling
Built Terraform modules for Azure resource provisioning, reducing manual setup time by 60%. Implemented GitHub Actions workflows for automated config validation. Created custom Azure Monitor alerts reducing false-positive alert noise by 40%.
Maintain operational documentation and contribute to continuous service improvement initiatives to strengthen system reliability and support readiness
Authored 20+ runbooks and troubleshooting guides for common incident scenarios. Maintained knowledge base articles in Confluence reducing onboarding time for new team members. Led bi-weekly service improvement meetings, implementing 8 process improvements in incident response workflow.

AI Operations & QA Intern

WelSpot

Mar 2024 - Aug 2024

Performed performance testing and operational validation of Large Language Models (LLMs) to improve reliability and response consistency
Conducted load testing on GPT-4 and Claude models with varying prompt sizes. Validated response consistency across 1000+ test cases. Identified performance degradation patterns at >4K token context windows, informing prompt engineering best practices.
Validated Retrieval-Augmented Generation (RAG) workflows integrating vector databases and Google BigQuery to support scalable AI operations
Tested RAG pipeline accuracy with ChromaDB vector store and BigQuery data sources. Validated retrieval precision improved response accuracy from 72% → 89% compared to zero-shot prompting. Documented performance tradeoffs between embedding models (OpenAI vs open-source).

System Engineer – Application Support & QA

Infosys

Nov 2020 - Apr 2022

Conducted system observability and log analysis using Azure Monitor and Nmon to maintain system health during peak workloads
Monitored Linux server performance during month-end batch processing peaks (CPU, memory, disk I/O). Used Nmon for real-time metrics and Azure Monitor Logs for historical analysis. Identified memory leak pattern causing weekly restarts, recommended JVM tuning reducing restart frequency to monthly.
Authored Root Cause Analysis (RCA) reports using SQL and Excel to support performance remediation and operational stability initiatives
Wrote 15+ RCA documents for production incidents. Used SQL queries to analyze transaction logs and identify performance bottlenecks. Excel dashboards visualized incident trends, leading to database index optimization reducing query times by 35%.

Project Deep Dives

Click any project to explore the problem, solution architecture, technical decisions, and lessons learned.

Internal Developer Platform (IDP)

Kubernetes • Terraform • GitHub Actions • Python

Built a self-service platform enabling developers to deploy applications without infrastructure knowledge.

📦 View on GitHub

2h → 15min

Deployment Time

99.95%

Success Rate

100%

Automation

§ The Problem

Development teams were spending 2+ hours per deployment managing infrastructure: provisioning VMs, configuring networking, setting up load balancers, and debugging deployment failures. This created a bottleneck where platform engineers became gatekeepers, and developers couldn't iterate quickly.

Average 12 manual steps per deployment

3-5 day backlog for infrastructure requests

40% of deployments failed on first attempt

§ Solution Architecture

Built a declarative, GitOps-driven platform where developers define application requirements in YAML, and the platform handles infrastructure provisioning, deployment orchestration, and rollback safety automatically.

Developer Push Git push triggers GitHub Actions

→

Validation Schema validation, security scans, policy checks

→

Terraform Apply Infrastructure provisioned via IaC

→

K8s Deployment Application deployed to AKS cluster

→

Health Checks Automated validation & smoke tests

→

Success/Rollback Auto-rollback if health checks fail

§ Key Technical Decisions

Why Kubernetes?

Needed declarative API, self-healing, and horizontal scaling. AKS provided managed control plane reducing operational burden.

Why GitHub Actions?

Developers already in GitHub workflow. Avoided context switching vs separate CI tool. Native YAML-based config.

Terraform State Management

Used Azure Blob Storage with state locking to prevent concurrent modification conflicts in multi-team environment.

Rollback Strategy

Implemented blue-green deployments with automated health checks. Failed deployments auto-revert within 60 seconds.

§ Code Sample: Deployment Orchestration

Python deploy_orchestrator.py

# Simplified deployment orchestrator with health checks and rollback
import subprocess
import time
from typing import Dict, Optional

class DeploymentOrchestrator:
    def __init__(self, app_name: str, namespace: str):
        self.app_name = app_name
        self.namespace = namespace
        self.rollback_revision = None
    
    def deploy(self, manifest_path: str) -> bool:
        """Deploy application with automatic rollback on failure"""
        # Store current revision for rollback
        self.rollback_revision = self._get_current_revision()
        
        # Apply Kubernetes manifest
        if not self._apply_manifest(manifest_path):
            return False
        
        # Wait for rollout to complete
        if not self._wait_for_rollout():
            self._perform_rollback()
            return False
        
        # Run health checks
        if not self._run_health_checks():
            self._perform_rollback()
            return False
        
        return True
    
    def _run_health_checks(self, timeout: int = 60) -> bool:
        """Validate deployment health"""
        checks = [
            self._check_pod_health(),
            self._check_service_endpoints(),
            self._smoke_test_api()
        ]
        return all(checks)
    
    def _perform_rollback(self):
        """Automatic rollback to previous stable revision"""
        print(f"Deployment failed. Rolling back to revision {self.rollback_revision}")
        subprocess.run([
            "kubectl", "rollout", "undo",
            f"deployment/{self.app_name}",
            f"--to-revision={self.rollback_revision}",
            f"-n {self.namespace}"
        ])

§ What I Learned

Developer UX matters more than technical elegance: Initially over-engineered with complex abstractions. Simplified to "push code, get deployment" model increased adoption 3x.
Rollback safety is non-negotiable: First version had manual rollback. After 2 production incidents, automated rollback on health check failures became mandatory.
Observability from day one: Added structured logging and metrics early. Debugging production issues without logs would've been impossible.
Drift detection matters: Manual changes to infrastructure caused subtle bugs. Added Terraform drift detection in CI to catch manual modifications.

Multi-Cloud Kubernetes Platform

EKS • AKS • Helm • ArgoCD • Istio

Deployed and managed production Kubernetes clusters across AWS and Azure with service mesh.

📦 View on GitHub

50+

Microservices

99.9%

Uptime SLA

GitOps

Deployment Model

§ The Problem

Running microservices across AWS and Azure with inconsistent deployment practices, no service-to-service authentication, and manual certificate management. Teams were deploying via kubectl commands, leading to configuration drift and "works on my machine" deployment failures.

§ Solution Architecture

Implemented GitOps-driven Kubernetes platform with Istio service mesh for traffic management, observability, and mTLS. ArgoCD monitors Git repositories and automatically syncs cluster state, eliminating manual kubectl operations.

Architecture: GitOps flow → ArgoCD → Multi-cluster deployment → Istio service mesh → Observability

§ Key Challenges & Solutions

Challenge

Service-to-service authentication across 50+ microservices

Solution

Istio mTLS for automatic certificate rotation and zero-trust networking. No application code changes required.

Challenge

Configuration drift between AWS and Azure clusters

Solution

GitOps with ArgoCD ensures Git is single source of truth. Automated drift detection alerts on manual changes.

Challenge

Network visibility and debugging distributed systems

Solution

Istio provides automatic distributed tracing, metrics, and traffic visualization without code instrumentation.

§ What I Learned

Service mesh adds complexity—justify it: Istio solved real problems (mTLS, observability) but increased debugging difficulty. Worth it for 50+ services, overkill for 5.
GitOps prevents drift, but needs discipline: Teams tried manual hotfixes during incidents. Strict Git-only policy prevented production surprises.
Multi-cloud is hard—abstract when necessary: AWS and Azure Kubernetes differences required abstraction layer. Helm charts with cloud-specific values worked well.

Observability Stack Implementation

Prometheus • Grafana • Loki • Jaeger

Designed and deployed comprehensive monitoring and logging infrastructure.

📦 View on GitHub

70%

Faster MTTD

100+

Dashboards

Full Stack

Tracing

§ The Problem

Production incidents took 45+ minutes to detect because teams were SSH-ing into servers to grep logs. No centralized logging, no structured metrics, no way to trace requests across microservices. "It's slow" bug reports had no data to debug.

§ Three Pillars of Observability

Metrics (Prometheus)

Time-series data for system health: request rates, error rates, latencies, resource usage. Pre-aggregated for fast querying.

Example: http_requests_total{status="500"} alerts when errors spike

Logs (Loki)

Centralized log aggregation from all services. Indexed by labels, not full-text search. Cheaper than Elasticsearch at scale.

Example: {app="api", level="error"} finds all API errors

Traces (Jaeger)

Distributed tracing shows request flow across microservices. Identifies bottlenecks and cascade failures.

Example: See 200ms database query slowing down checkout flow