A Micro Scalable Architecture

10 min readFeb 25, 2019

Micro Scaling is the principle of scaling micro services to maximise resource utilisation. Having multiple scaling strategies makes it possible to achieve improved resource utilisation. By scaling strategies, I am referring to how and which micro services are scaled. For example, a service might be scaled back to free resources to scale another service up.

Why is maximising resource utilisation important? Because cloud computing providers rent servers and not resources, and each server has fixed amount of resources, so utilising all of those resources, minimises cost. Also optimally utilising resources should improve performance, services aren’t simply scaled to optimise resource utilisation. I am following the twin goals of optimal resource utilisation and maximum performance.

To illustrate micro scaling, I will describe an architecture for sending large heterogenous collections of Http requests. I will describe some scaling strategies for this architecture. I will first discuss some of the intricacies involved in sending Http requests and, in particular, the unpredictable nature of response times. After which I’m going to discuss different approaches to sending Http requests and handling of responses. The last two parts will present the architecture of the system and some scaling strategies for optimising performance and resource utilisation.

Hypertext Transfer Protocol

Hypertext Transfer Protocol (Http) is a request/response protocol: the client sends a request to the server and waits for a response from the server. Analogous to Q&A, it is similar to asking a question and waiting for an answer. Sometimes the question has to be asked again or it has to be clarified. This is the same with Http: ask, wait, answer, rinse & repeat.

It is the waiting that is unpredictable. Waiting times aren’t predictable and inconsistent. But since the response is important to know whether the request succeeded, it becomes important to wait and evaluate the response.

The diagram demonstrates the issue: unpredictable response times. Response times depend on a number of different factors and most of which aren’t influenced by the client. Servers are notoriously unstable, leading to unknown response times.

Irregular response times make it hard to predict how long it will take to send large batches or streams of Http requests. This becomes important when requests need to be sent and responded to in a timely fashion. It is this problem that I will attempt to tackle and along the way, describe what I mean by micro scaling.

Three Approaches to Sending Http Requests

First approach is to process them one after the other, another approach is to send them all at once, while a third approach is a combination of the two. It’s a matter of resource requirements, complexity and time.

(In the diagram, t represents the response time for the respective request.)

Serial is the simplest and uses the least resources. But it takes the most time since subsequent requests are sent only after a response has been received for the previous request. This can easily lead to a build up of requests since response times are predictable and delays are accumulative.

Parallel does all the requests at the same time, which takes more resources and complexity also increases since resources need to be managed. In addition, the number of requests directly determines the amount of resources that are required. This can lead to resource bottlenecks and in turn, to build up of requests.

Asynchronous is a combination of serial and parallel. Requests are sent in serial, while responses are handled in parallel. Of course, response handling is again limited by available resources. On the other hand, these resources can be better utilised by resource pooling, since resources aren’t directly associated with the request.

Decoupling requests and responses requires storing an association between the two. This makes asynchronous also the most complex approach. Without the association between request and response, it would not be possible to retry a request. Since response determine whether to retry or not.

Complexity and time are both hard to optimise. Complexity can be abstracted and time can be saved, so instead I will talk about optimising resource utilisation. As discussed above, this is important since resources are still sold in form of servers which implicitly batch resources.

Optimising Resource Utilisation

Let’s begin with a visualisation of resource allocation over time, where the workload is irregular and unpredictable, making resource optimisation, and in turn resource utilisation, a non-trivial exercise.

Resource allocation for irregular Workload

Resource allocation also includes resources that have been allocated but are not being used, this is especially clear for the upper dashed-blue line which represents over-provisioning. It covers peak workloads but wastes a lot of resources at other times. The lower dashed-blue line, on the hand, is under-provisioning resources. The system experiences performance issues during peak workloads but is on average, in a stable state. Having dedicated servers in a datacenter is a typical example of static resource allocation.

The dotted-red line represents actively allocating resources according to workload requirements. Characteristically, allocation is always slightly behind the increase in workload, and deallocation of resources is delayed since resource releasing can require manual intervention. This is something that can happen when using a cloud computing provider for example.

The solid-green line represents resource allocation that tracks the workload. It’s still a little behind the workload jumps, since increases in workload need to be detected, however it is closer to being optimal than anything else. In part, this is something that Kubernetes enables and where Serverless Cloud Computing is heading.

Using a micro service architecture to send streams of Http requests, the aim is to get to the green line by scaling micro services to optimise resource utilisation.

Architecture

Server setup is a typical collection of servers at a typical cloud computing provider. Each server has their own resources, so total resources are fragmented to the amount free on each server. On top of the servers, Kubernetes is installed and the architecture uses specific features of Kubernetes to enable micro scaling. The aim of this architecture is to scale with the workload and use all available resources, i.e., optimise resource utilisation.

The architecture is designed to accept Http requests as serial streams and respond asynchronously as requests are completed. Completed means, for a single request, either it succeeded, or it was retried until it succeeded or it failed. Failure can have several causes: the request hit a server error, or the request could have been malformed, or the server was unreachable or the request exhausted all its retries.

A quick run-through starting on the top left hand side, input is a serial stream of Http requests. Serial is fine as the response time is guaranteed. The Proxy component responds immediately and requests are processed after responding to the client. The request is pushed into a Redis database for a Proxy Store component to handle.

The Proxy Store component takes the request from Redis and moves these into the Kafka requests instance. Along the way, requests are annotated with initial retry counts and exponential backoff times. This ensures that all requests are self-contained and have no interdependencies on the system or other requests.

Next the Parc component (top right corner) takes requests from Kafka and makes requests to the external servers. Kafka provides a decoupling between the Proxy Store and the Parc components. Thus there is no direct dependency between the Proxy Store and the Parc components, and allows sharding across requests.

Each Parc consists of four separate components that can also be scaled independently. Redis is the central component that provides decoupling between the other three “satellite” components. The Incoming component is responsible for taking requests from Kafka and pushing these into Redis, the Caller component retrieves the requests, performs them and then pushes the responses back into Redis. Finally the Outgoing component interfaces with both Kafka instances: requests that should be retried are pushed into Kafka request, responses that are complete are pushed to Kafka responses.

Requests that need retrying aren’t retried directly because they have a exponential retry backoff time. These remain in Redis until they are ready to be retried. These requests are pushed back into Kafka so that another Parc component may handle the requests, requests don’t need to be handled by the same Parc.

Completed requests which have been pushed to Kafka responses are then handled by a Responder component (bottom right) that sends these back to the user. As requests are self-contained, the responses of all attempts to handle the request are attached to the final response, the client can then analysis what exactly happened in handling their requests. In addition, since the original request is contained in the response sent to the client, the client does not need to maintain an association between request and response.

Micro Scaling: Scheduler Component

The Scheduler component is responsible for determining resource availability and implementing various scaling strategies. Important is that it never uses more resources than are available. Kubernetes will randomly shutdown services when this happens, making the entire system unstable.

Each component is small in terms of resource requirements and they follow the Unix philosophy of doing only one thing and doing that well. Kubernetes allows for the definition of base and upper resource requirements for components. It is important that components define these accurately and have a limited range, else micro scaling is impractical.

Besides querying resource availability, the Scheduler also monitors all Redis and Kafka instances. Included, for example, are Redis queue lengths and read-lags in consuming requests from Kafka. It needs these metrics to scale components

Components that aren’t managed by the Scheduler, are assumed to be already running. That is, Proxy, Proxy Redis and both Kafka instances are already available. Parc components are completed managed over their lifecycle. This includes the Parc specific Redis instances and individual Parc components.

Initial scaling begins with the Proxy Stores, which are the first components to be scaled up when requests come in. When there is a buildup of requests in the Proxy Redis, the Scheduler will spin up Proxy Stores to handle those requests. In turn, there will be a buildup of requests in Kafka, causing the Scheduler to spin up Parc components. Note: the Scheduler maintains no state, the Scheduler uses the state of the system.

Each Parc consumers from a single Kafka topic and the number of topics is dynamic, initially there is a single topic. This allows the scheduler to increase the Kafka topics to spread the workload among multiple Parc components. The final components that are scaled are the responders. Again, noticing the buildup of responses in the response Kafka, the Scheduler will dynamically spin responders up.

Having started the core of the system, the Scheduler now begins to maintain and optimise the system for the current throughput. Scaling down is also part of what the scheduler does, so if throughput drops, components are scaled down.

Scaling Strategies

If the scheduler notices that a Parc component isn’t up-to-date with the Kafka request topic, it can spin up more incoming handlers and callers for that Parc. Alternatively, the Scheduler can spin up a completely new Parc by creating a new Kafka topic for the Proxy Stores to write to.

If the outgoing response queue is building up, the outgoing component are scaled up. Since Responders are sending Http requests to external endpoints, ironically they face the same difficulties that this architecture was designed to solve. Retries are harder to handle since they build up in the Parc Redis and are only dequeued once the time comes to retry them. There is no easy way to handle a buildup of retries.

Caller components can either be scaled individually or their parallelity can be increased. Caller use lightweight threads and can easily handle more requests. Increase in parallelity is preferable in situations where response times are low and responses small, with larger responses and longer response times, having more Parcs with callers reduces the storage requirements of individual Redis instances.

Finally, if there aren’t enough resources for scaling the components, then the number of servers in the cluster can be increased or alternatively, a new cluster may be setup and traffic redirected there.

Summing Up

Ultimately the goal is to optimisation of resource utilisation of available server resources. There will always be various approaches to doing this and micro scaling is one. As always, there isn’t a one-size-fits-all solution for building software. As such, this is an experiment with an architecture and problem that enables micro scaling. Whether I will find the optimal solution remains to be seen.

Micro scaling seems to work best if components are decoupled, they have small in resource footprints and they are stateless. Statelessness allows components to be ephemeral and therefore they can easily be scaled up or down as needed. Small resource footprints allows usage of the smallest amounts of available resources and decoupling avoids having “scaling chains”, whereby component A can only be scaled up if component B is available.

In addition, good monitoring of the entire system to define system state and development of scaling strategies are also vital.

Some components are fixed and need to be available, especially those that maintain user state. These can’t be spontaneously scaled, making for a base requirement in resources.

Postscript: Serverless Cloud Computing

Serverless Cloud Computing is a new paradigm which applies the concepts of functional programming to cloud computing. Basically “units of code” (be they functions, servers or scripts) are executed when necessary and users pay only for that execution time. So it becomes impossible to waste resources since resources are only used when necessary. Functional because data is passed through and no longer directly stored.

In the end, the ideal solution is the one that comes as close as possible to ephemeralization!