Hi There 👋

Hi, My name is Harsh, Welcome to my blog. Here I share whatever new things I learn or my past learnings while exploring system architecture, database internals, observability, and backend infrastructure—one deep dive at a time.

Multipart Form Uploads - Busboy and Node Streams

I was recently investigating ways to improve the efficiency of file uploads to a Node.js server. This need arose after encountering a production bug where the absence of a maximum file size limit for uploads led to an out-of-memory crash due to file buffers consuming excessive heap memory. In this Node.js server, I was using Express and express-openapi-validator to document the server’s API with an OpenAPI specification. express-openapi-validator utilizes multer for file uploads. I had previously encountered this library whenever file uploads from forms needed to be handled in Node.js, but I never questioned why a separate library was necessary for file uploads. This time, I decided to go deeper to understand if a dedicated package for file uploads is truly needed, and if so, what specific benefits Multer or similar libraries provide. I initially needed to find a configuration option in express-openapi-validator to set a request-wide limit on the maximum size (in bytes) of data allowed in a request, including all file attachments. The express-openapi-validator package offers a fileUploader configuration (fileUploader documentation) that passes options directly to multer. ...

Building Fault Tolerant KV Storage System - Part 2

We previously discussed replicated state machines, leader election in the RAFT consensus algorithm, and log-based state maintenance. Now, we’ll focus on log replication across peer servers. We’ll also examine how RAFT ensures that the same commands are applied to the state machine at a given log index on every peer, because of the leader’s one-way log distribution to followers. Leader Initialisation Once a candidate becomes leader we call setupLeader function which initiates go routine for each peer in the RAFT cluster, Each go routine of respective peer is responsible for replicating new log entries or sending heartbeat via AppendEntries RPC. ...

Building Fault Tolerant KV Storage System - Part 1

This blog post is part of a series detailing my implementation of a fault-tolerant key-value server using the RAFT consensus protocol. Before diving into RAFT and its mechanics, it’s crucial to grasp the concept of a replicated state machine and its significance in building systems that are both fault-tolerant and highly available. Replicated State Machine A replicated state machine is essentially a collection of identical machines working together. One machine acts as the “master,” handling client requests and dictating the system’s operations. The other machines, the “replicas,” diligently copy the master’s state. This “state” encompasses all the data necessary for the system to function correctly, remembering the effects of previous operations. Think of a key-value store: the state would be the key-value pairs themselves, replicated across all machines to maintain consistency and resilience. Having the master’s state copied across multiple machines enables us to use these replicas in several ways. They can take over as the new master if the original fails, or handle read operations to reduce the load on the master. Because the state is copied across machines, network issues like latency, packet loss, and packet reordering significantly affect how closely a replica’s state matches the master’s. The difference between the master’s (most up-to-date) state and a replica’s state is known as replication lag. Generally, we aim to minimize this lag, and different systems offer varying levels of consistency (which we will discuss later when covering replication in RAFT). ...

Map Reduce

This article shares learnings from Google’s influential MapReduce paper and explores the challenges encountered while implementing a simplified version. Our system uses multiple worker processes, running on a single machine and communicating via RPC, to mimic key aspects of a distributed environment. What is Map-Reduce At its core, MapReduce is a programming model and an associated framework for processing and generating massive datasets using a parallel, distributed algorithm, typically on a cluster of computers. You might already be familiar with map and reduce operations from functional programming languages. For instance, in JavaScript, array.map() transforms each element of an array independently based on a mapper function, while array.reduce() iterates through an array, applying a reducer function to accumulate its elements into a single output value (e.g., a sum, or a new, aggregated object). ...

Debugging Redis Latency

This article is about how at work we solved the issue of high response time while executing Redis commands from Node.js server to a Redis compatible database known as dragonfly. Background After introducing metrics to our Node.js service, we started recording the overall response time whenever a Redis command was executed. We had a wrapper service around a Redis driver known as ioredis for interacting with our Redis-compatible database. Once we set up Grafana dashboards for metrics like cache latency, we saw unusually high p99 latency numbers, close to 200ms. This is a very large number, especially considering the underlying database query itself typically takes less than 10ms to complete. To understand why this latency was so high, we needed more detailed insight than metrics alone could provide. As part of a broader effort to set up our observability stack, I had been exploring various tracing solutions – options ranged from open-source SDKs (OpenTelemetry Node.js SDK) with a self-deployed trace backend, to third-party managed solutions (Datadog, Middleware, etc.). For this investigation, we decided to proceed with a self-hosted Grafana Tempo instance to test the setup and feasibility. (So far, the setup is working great, and I’m planning a detailed blog post on our observability architecture soon). With tracing set up, we could get a waterfall view of the path taken by the service while responding to things like HTTP requests or event processing, which we hoped would pinpoint the source of the delay in our Redis command execution. ...

Socket File Descriptor and TCP connections

Socket File Descriptors and Their Kernel Structures A socket is a special type of file descriptor (FD) in Linux, represented as socket:[inode]. Unlike regular file FDs, socket FDs point to in-memory kernel structures, not disk inodes. The /proc/<pid>/fd directory lists all FDs for a process, including sockets. The inode number of a socket can be used to inspect its details via tools like ss and /proc/net/tcp. Example: Checking Open FDs for Process 216 ls -l /proc/216/fd Output: ...

Understanding Inodes and Disk Layout

Overall Organization Of Data In Disks Assuming we have a 256KB disk. Disk Blocks: The basic units of storage on the disk, each 4 KB in size. The disk is divided into these blocks, numbered from 0 to N-1 (where N is the total number of blocks). Inode Bitmap (i): Block 1; a bitmap tracking which inodes are free (0) or in-use (1). Data Bitmap (d): Block 2; a bitmap tracking which data blocks are free (0) or allocated (1). Inode Table (I): Blocks 3-7; an array of inodes, where each inode (256 bytes) holds metadata about a file, like size, permissions, and pointers to data blocks. 5 blocks of 4KB will contain 80 256 byte inode strutures. Data Region (D): Blocks 8-63; the largest section, storing the actual contents of files and directories. Inode Every inode has a unique identifier called an inode number (or i-number). This number acts like a file’s address in the file system, allowing the operating system to quickly locate its inode. For example: ...

Files And Directories

Files and directories File systems virtualize persistent storage (e.g., hard drives, SSDs) into user-friendly files and directories, adding a third pillar to OS abstractions (processes for CPU, address spaces for memory). File Paths and System Calls Files are organized in a tree-like directory structure, starting from the root (/). A file’s location is identified by its pathname (e.g., /home/user/file.txt). To interact with files, processes use system calls: open(path, flags): Opens a file and returns a file descriptor (fd). read(fd, buffer, size): Reads data from the file into a buffer using the fd. write(fd, buffer, size): Writes data to the file via the fd. close(fd): Closes the file, freeing the fd. File Descriptors A file descriptor is a small integer, unique to each process, that identifies an open file. When a process calls open(), the operating system assigns it the next available fd (e.g., 3, 4, etc.). Every process starts with three default fds: ...

RAID (Redundant array of inexpensive disk)

RAID Disks Three axes on which disks are analysed Capacity - How much capacity is needed to store X bytes of data Reliability - How much fault-tolerant is the disk Performance - Read and write speeds (Sequential and random) To make a logical disk (comprising set of physical disks) reliable we need replication, so there is tradeoff with capacity and performance (write amplification) When we talk about collection of physical disks representing one single logical disk we should know that there would be small compute and some non-volatile RAM also included to fully complete the disk controller component. This RAM is also used for WAL for faster writes similar to #Database In a way this set of disks also have challenges similar to distributes databases. ...

Multilevel Page table

Segmented Page Table Page table can grow large for a 32-bit address space and 4 KB page size we will be using 20 bits for virtual page number resulting in 2^20 bytes (i.e. 4MB of page table) for a single page table and each process will have its own page table so it is possible that we will be storing ~100sMB for page table alone which is not good. For above page table with 4 bits for VPN (Virtual page number) we can see that only VPN 0,4,14 and 15 are valid i.e. pointing to a PFN (Physical Frame Number) other PTEs (Page table entry) are just taking up space which is not used. We can use segmentation here with base and bound registers for each page table to only store valid PTE in the table. This will again split the virtual address to also contain the segment bits to identify which segment the address belongs to (code, heap or stack). Instead of using Base Page Table Register to query page table we will now be using Base Page Table Register [Segment] to get page table physical address for a given segment. ...