Imran
Hasan
Hasan
Systems engineer specializing in the boundary between software and hardware.
[ eBPF | kernel_tracing | high_concurrency_systems ]
"Perf is a story, not a number."— Linux Kernel Principles
"Stability is maintained through the Single Responsibility Principle. Build small, fix fast."
Professional
Proof.
K8s Cluster & Devtron Setup
"Imran is the best freelancer we have ever met for Kubernetes, period. He knows what he is doing and he can consult what we need."
Docker Dev Environment
"Probably the best freelancer I have worked with so far. Great Communication. Followed the requirements perfectly."
Service Administration
"Imran done a fantastic job by assisting our current developer team with expert knowledge on a server issue."
Phoenix & RabbitMQ Container
"Helped me with my Elixir / Phoenix project involving RabbitMQ. Would def work again!"
The Stack
User Space
Applications, libraries, and runtime environments. Where business logic breathes and fails. Go binaries, Kubernetes pods, and the chaos of the edge.
Kernel Space
The primitive heart. Process scheduling, memory paging, and the virtual file system. This is where I spend my time - optimizing the cold, hard logic of hardware abstraction.
Memory Layout (Abstract)
"Perf is a story, not a number."— Linux Kernel Principles
L3—L7
Protocols
Going beyond simple HTTP. Tuning the Linux networking stack for high throughput and low latency. Implementing Anycast, BGP peering, and high-performance packet filtering at the XDP level.
"Global connectivity is a property of the routing table."
eBPF / XDP Hooking
Bypassing the standard kernel network stack for extreme performance. Implementing DDoS mitigation and load balancing directly in the NIC driver phase.
Global Anycast
Announcing the same IP space from multiple geographical locations. Leveraging BGP path selection to route traffic to the nearest healthy edge node.
Congestion Management
Tuning TCP BBR for high-speed cross-data-center replication. Minimizing bufferbloat and maximizing bandwidth utilization on high-latency links.
Protocol
Design
Custom network protocols, binary serialization, and protocol analysis. From wire format design to congestion control and zero-copy implementations.
Binary Protocol Design
Custom wire formats with versioning and backward compatibility
Protocol Analysis
Wireshark dissectors, packet capture, and traffic analysis
Performance Optimization
Zero-copy I/O, kernel bypass with DPDK, and QUIC implementation
AI Internals
Precision & GEMM
Most focus on layers; I focus on the SRAM. Understanding how BF16 and FP8 precision affects the gradient flow and why FlashAttention saved the HBM bottleneck. It's not just math; it's memory management.
- Triton Kernel Ops
- Quantization Theory
- SM Occupancy Optimization
Inference Lifecycle
Producing models is easy; scaling them is war. Implementing KV-Cache eviction strategies for long-context windows and building low-latency Model-Mesh architectures that handle 10k+ requests per second without jitter.
- Serving & Autoscaling
- Weight Serialization
- Continuous Training Loops
Binary & Model Exploits
LLMs are just functions with huge attack surfaces. Beyond simple prompt injection, I study Adversarial Perturbations to fool computer vision and Model Inversion attacks to leak training clusters from frozen weights.
- Poisoning Data Lakes
- Oracle Extraction
- Inference Side-Channels
Weight Distribution Map
NODE_ID: 0x9f2a-7c1b // TENSOR_CORE_ACTIVE
LATENT_SPACE_DENSITY: [VALIDATING_GRADIENTS] >> ERROR_0.021
IO_TRITON_KERNEL: CACHED_SRAM_ALLOC(32KB)
"The most dangerous vulnerability in an AI system isn't the prompt; it's the assumption that the weights are immutable."
Technical
Journal.
Documenting the journey through low-level systems. From Netfilter deep-dives to Kubernetes operator internals, these articles explore the "Why" behind the architecture.
Understanding iptables: A Comprehensive Guide
Diving deep into the Netfilter framework, chain logic, and packet traversal through the Linux networking stack.
READ_FULL_DUMPKubernetes Networking Internal Flow
Analyzing CNI plugins, Service proxy logic, and how packets escape the container namespace into the physical wire.
READ_FULL_DUMPKernel-Informed Scaling in AWS
Using eBPF to monitor L1 cache misses as a metric for infrastructure horizontal pod autoscaling.
READ_FULL_DUMPDistributed
Tracing
Building production-grade observability with OpenTelemetry, Jaeger, and custom instrumentation. Tracking requests across 100+ microservices with sub-millisecond precision.
Span Context Propagation
W3C Trace Context headers across HTTP, gRPC, and message queues
Sampling Strategies
Tail-based sampling with 1% overhead at 1M req/s
Custom Instrumentation
Auto-instrumentation for Golang stdlib and third-party libraries
Performance
Optimization
Connection Pooling
Implemented database connection pooling with pgBouncer. Reduced connection overhead from 50ms to <1ms.
Query Optimization
Analyzed slow queries with EXPLAIN ANALYZE. Added strategic indexes, reducing query time by 95%.
Caching Strategy
Multi-layer caching with Redis and in-memory LRU. 98% cache hit rate for hot data.
Profiling Results
Chaos Engineering
Failure Injection
Deliberately introducing failures to test system resilience. Network partitions, pod crashes, and resource exhaustion.
Steady State Hypothesis
Production Chaos Experiments
"The best time to find out your system can't handle failure is before your customers do."
Cloud
Architecture
Multi-region design, disaster recovery, and cost optimization. Building resilient systems with active-active failover and automated recovery across AWS, GCP, and Azure.
Multi-Region Design
Active-active across 5 regions with global load balancing
Disaster Recovery
RPO <5min, RTO <15min with automated failover
Cost Optimization
FinOps practices reducing cloud spend by 40% with spot instances
Kubernetes
Internals
Deep dive into K8s control plane, custom schedulers, CNI plugins, and operator patterns. Managing 10,000+ pods across multi-region clusters with custom resource definitions.
Custom Scheduler
Topology-aware scheduling with GPU affinity and NUMA optimization
CNI Deep Dive
Cilium eBPF networking with service mesh integration
Operator Pattern
Custom controllers with reconciliation loops and leader election
LLVM IR &
SSA Form.
The "Middle-End" of modern computing. Transforming high-level code into a mathematically sound Intermediate Representation. Static Single Assignment (SSA) ensures every variable has exactly one definition, enabling clean optimization.
"Complexity is a compiler's optimization problem."
Front-End
Parsing source text into AST and initial IR.
Middle-End
Passes: Inlining, Vectorization, Dead-Code Elimination.
Back-End
Instruction selection for specific silicon (x86/ARM).
Abstract
Interpretation.
Proving the Nullity
Solving the Halting Problem by approximation. Using mathematical domains to prove that certain execution paths are unreachable or that a pointer can never be null, enabling peak performance without runtime checks.
"To understand the code, we must understand the space of all possible codes."
Just-In-Time.
Self-Mutation.
Machine Generation
Synthesizing executable code on the fly. JIT compilers transform bytecode into native machine instructions at runtime, making dynamic languages like Javascript and Lua compete with C in performance-critical hot loops.
"The fastest code is the code you write while the program is running."
Kernel
Observability
PROBE_TYPE: KPROBE / TRACEPOINT
FILTER: PID_FILTER_ENABLED
DUMP_LEVEL: VERBOSE
01. Dynamic Tracing
Using `perf` and `bpftrace` to analyze production workloads without adding significant overhead. Visualizing bottleneck on fire-graphs and tracing syscall latency in real-time.
02. Context Jitter
Reducing context switching overhead by tuning task priority (niceness) and CPU affinity. Optimizing for NUMA locality and minimizing L1/L2 cache misses in high-speed Golang runtimes.
03. Soft-IRQ / Tasklets
Managing deferred work execution and interrupt storms. Distributing packet processing load across cores using RPS/RFS and ensuring deterministic response times under heavy sustained I/O.
"To know the kernel is to trace the kernel."
Async Everything.
Moving past the synchronous bottleneck. Implementing `io_uring` to eliminate syscall overhead in high-throughput database engines and file servers.
Ring Buffer Dynamics
Zero-Syscall I/O
By sharing memory between user-space and kernel-space via ring buffers, we submit I/O requests and reap completions without a single context switch.
Polled Mode
Eliminating interrupts entirely. The kernel threads poll the submission queue, further reducing latency for ultra-fast NVMe storage.
// TRUTH LIVES IN THE RING BUFFER
Database
Internals
Query optimization, storage engine design, and transaction isolation. From B-tree indexing to MVCC implementation and distributed consensus protocols.
Query Optimizer
Cost-based optimization with statistics and cardinality estimation
Storage Engine
LSM-tree implementation with compaction strategies and bloom filters
ACID Guarantees
Snapshot isolation with MVCC and write-ahead logging
SLAB &
SLUB.
Caching at the speed of hardware. The kernel's SLUB allocator avoids fragmentation by keeping "caches" of commonly used object types. No more expensive buddy allocator calls for small tasks.
Object Caching
Rather than allocating and freeing raw pages, the kernel maintains a pool of initialized objects (task_struct, mm_struct). Reuse is cheaper than initialization.
Cache Locality
SLUB minimizes metadata overhead and maximizes L1 cache utilization by aligning objects to processor cache lines.
/proc/slabinfo excerpt
"Memory allocation is not a request; it's a negotiation with the hardware."
CFS Internals.
Fairness in O(log N).
The Red-Black Tree
The Completely Fair Scheduler (CFS) doesn't use standard queues. Instead, it balances runnable tasks in a time-ordered red-black tree. The task with the smallest `vruntime` (virtual runtime) resides at the left-most node—always ready to be picked next.
vruntime Tracking
Every cycle a task spends on the CPU increases its `vruntime`. Tasks with higher priority (lower nice value) see their `vruntime` increase slower—giving them effectively more "fair" time on the processor.
Preemption Latency
The maximum delay between a task becoming runnable and actually running. Tuned for sub-millisecond response in interactive workloads.
Load Balancing
Pushing and pulling tasks across runqueues to ensure even distribution across logical cores and NUMA nodes.
"Fairness is not a feeling; it's a calculated O(log N) property of the runqueue."
Lockless RCU.
Read-Copy-Update (RCU)
Scaling to thousands of cores without contention. RCU allows many readers to access a data structure simultaneously without taking any locks, while writers perform updates by creating clones.
Writer Logic (Abstract C)
// CONTENTION IS THE DEBT OF SHARED STATE
KVM &
VM-Exits.
Hardware Assist
Leveraging Intel VT-x and AMD-V to run guest code at near-native speeds. The kernel acts as a traffic controller, catching "sensitive" instructions via VM-Exits.
Virtio Rings
Bypassing device emulation. Virtio provides a standardized interface for guest-to-host communication via shared memory ring buffers.
Context Switch Visual
Exit Reasons (Abstract)
"A hypervisor is just a kernel that manages other kernels."
LSM Bound.
Landlock.
The next evolution in process isolation. Fine-grained, userspace-driven sandboxing using modern Linux Security Modules.
Landlock Hooking
Unlike traditional LSMs like AppArmor which require root to load profiles, Landlock allows an unprivileged process to restrict its own access to the file system.
Namespace Pivot
Using `unshare()` and `pivot_root()` to create a detached view of the system. Implementing private mounts, network stacks, and user mappings without heavy virtualization.
// ISOLATION IS THE ONLY TRUTH
Reverse
Engineering
Deep binary analysis, malware dissection, and vulnerability discovery. From x86/ARM disassembly to advanced decompilation and exploit development.
Static Analysis
IDA Pro, Ghidra, Binary Ninja for control flow reconstruction
Dynamic Analysis
GDB, WinDbg, Frida for runtime instrumentation and hooking
Malware Analysis
Sandbox evasion detection, C2 protocol reverse engineering
Fuzzing &
Exploits
Coverage-guided fuzzing, exploit development, and vulnerability discovery. From AFL++ instrumentation to custom mutators and proof-of-concept exploits.
Coverage-Guided Fuzzing
AFL++, LibFuzzer with custom mutators and dictionaries
Exploit Development
ROP chain construction, heap exploitation, kernel exploits
Triage & Analysis
Automated crash analysis with ASAN, UBSAN, and Valgrind
UEFI &
Secure Boot.
Defending the foundation. Validating the entire boot chain from the BIOS to the kernel using cryptographic signatures. Preventing bootkits from persisting in the motherboard's SPI flash memory.
Firmware Verification Chain
Covert
Channels.
Information Leakage.
When the infrastructure itself becomes the carrier. Smuggling data across network boundaries by modulating timing, packet loss, or unused fields in DNS, ICMP, and TCP headers that egress filtering ignores.
Packet Timing Modulation
Heap
Grooming.
Turning chaos into predictability. By carefully spraying allocations and deallocations, we can "groom" the memory layout of the browser or kernel heap to place controllable data exactly where an exploit needs it.
Use-After-Free
Exploiting the gap between object destruction and pointer nullification. Grooming ensures the "freed" slot is immediately occupied by attacker-controlled data.
Heap Spraying
Exhausting the memory allocator to force new allocations into predictable regions, bypassing basic ASLR and guard-page protections.
Grooming Outcome: Success
Return-Oriented
Programming.
The Art of Gadgets
Bypassing Data Execution Prevention (DEP/NX) by repurposing existing code within the target process. By chaining together small snippets of code (gadgets) ending in a `ret` instruction, we can execute arbitrary logic without injecting any shellcode.
Chain Construction
"Code reuse is the most efficient form of malware."
Block
Layer CoW.
Moving away from "overwrite-in-place." Modern systems use Copy-on-Write to ensure atomic snapshots and data integrity. If the power drops, the system remains consistent.
MQ Dispatching
Scaling multi-queue block devices to avoid software bottlenecks. Each core has its own submission queue, minimizing lock contention during high-IOPS NVMe operations.
Checksum Verification
Silent bit-rot detection. The block layer stores a hash of the data alongside the block itself, verifying the integrity on every read.
VFS Cache Layer
Page cache abstracts physical storage, providing unified access to files via memory address space.
Bi-directional I/O
Scatter/Gather lists allow the transfer of non-contiguous physical memory chunks into a single storage operation.
Write-Back Merging
The kernel merges adjacent write requests in time, reducing seek overhead and increasing overall throughput.
/* CONSISTENCY IS A PROPERTY OF THE DATA, NOT THE PHYSICAL DRIVE */
Silicon.
DMA Engines
Offloading the processor. Direct Memory Access (DMA) allows peripherals to read and write to system memory without taxing the CPU, enabling Gbps-scale throughput in modern networking and storage.
Bus Mastery Diagram
MSI-X Interrupts
Message Signaled Interrupts avoid the sharing problems of traditional pin-based signals. Each multi-queue NIC can trigger a specific vector to notify the exact CPU core responsible for the data.
IOMMU Protection
The "MMU for devices." IOMMU restricts peripherals to specific memory ranges, preventing "malicious" hardware (or buggy drivers) from writing outside their designated buffers.
TLP
"At the end of the day, everything is just a pointer to a silicon register."
Timing is
Everything.
Hacking without touching the logic. Measuring nanosecond differences in response times to leak cryptographic keys from the CPU cache. Even "correct" code can be vulnerable to its own execution speed.
"A microsecond is an eternity to a modern processor."
Cache-Line Visualization
// Constant-time implementation required to prevent leak.
// mask = -(a == b);
// result = (select_a & mask) | (select_b & ~mask);
Glitch.
Voltage Injection
Hacking the physics of the chip. By dropping the VCC voltage for a few nanoseconds at the precise moment a cryptographic check is performed, we can flip a single bit in the CPU's internal pipeline, causing a `branch_if_equal` to always succeed.
Clock Glitching
Injecting double-pulses into the clock line to force the processor to skip instructions, bypassing security checks entirely.
Electromagnetic Faults
Using high-power magnetic pulses (EMFI) to induce currents in the silicon die from a distance, flipping bits without physical contact.
"The hardware is the final authority, but the hardware is not a god. It obeys the laws of physics, and physics can be glitched."
BadUSB &
HID Shadows.
The Physical Trust Problem
When the machine trusts the hardware implicitly. By emulating a Human Interface Device (HID), a malicious micro-controller can type passwords, execute scripts, and exfiltrate data at 1000 WPM, bypassing all software-based network protections.
"If an attacker has physical access to the device, it's no longer the user's device."
Selected Work
2023—2025Kernel-Informed Scaling
eBPF-Based MLOps
High-Freq Observability
Low-Latency Gateway
I make complex systems work in production. Not in demos. Not in staging. Production.
Eight years building infrastructure that handles millions of requests. I've migrated monoliths to Kubernetes, built MLOps platforms from scratch, and reduced cloud bills by 40%.
My philosophy is simple: automate everything, measure everything, and make complexity disappear. When systems "just work," that's when the engineering is solid.