Linux namespaces and how docker internally uses those namespaces

While using docker it is quite natural to feel like we are dealing with a virtual machine because we can install something in a container, it has an IP, file tree is different and whatnot. In reality, containers are just a mechanism that allows Linux processes to run with some degree of isolation and separation. Containers basically use few different primitives in Linux combined. By using Linux namespaces containers can make these happen.

 

What are namespaces? Namespaces are another Linux kernel primitive that can be used to control access and visibility of resources. Generally, when properties like names or permissions change, it is visible only within the namespace and it is not visible to the processes outside of it. So it makes it look like that the process has its own isolated copy of a given resource. Docker uses these properties of the namespace to provide isolation. Namespaces can map resources outside a namespace to a resource inside a namespace. Linux has at least 7 kinds of namespaces that cover different kinds of resources: Mount, UTS, IPC, PID, Network, User, Cgroup. Sometimes we these namespaces individually and sometimes we have to use them together because types like IPC, Cgroup, PID use file systems that need to be mounted. A name server does not stay open when there is no running process on it or a bind mount. That’s why a docker gets killed when the process inside it gets killed, it is because of how Linux System deals with its PID 1, which I am going to discuss later in this blog.

How would I find out the namespace of a process? We can find any information about a process by accessing the proc virtual file system. If we check carefully we should be able to find a directory called ns which contains symbolic links to the namespace. These symbolic link structures can provide us the information about the namespace type and inode number that the kernel uses. By comparing this inode number we can confirm if both of the processes are part of the same namespace or not.

readlink /proc/$$/ns/uts

How can we use namespace? A set of Linux syscall is used to work with namespaces, most commonly clone, unshare, etc. Clone is a sort of more general version of the fork system call that allows us to create a new child process.  It is controlled by a set of flags that we can if we want the new child process to be in some new namespaces and we want to move the process to move itself from one namespace that it is currently resident in, to another existing namespace. unshare supports the same flags as clone. unshare is a syscall that a running process can use to move into a new namespace. Namespaces automatically close when nothing is holding them open so it ceases to exist when the running process stops or kills. Creating a user namespace requires no privilege but for another namespace, it requires privilege. So time to time you will be needed to use sudo. We can use commands like setns, nsenter, and ip netns to switch or enter from one namespace to another namespace.

Every docker container has its own hostname. Docker uses UTS namespace to achieve it. UTS namespaces isolate two system identifiers of node name (generally known as hostname) and the NIS domain name. On a  Linux system, there might be multiple UTS namespaces. If one of those processes changes the hostname the change will be visible to the other processes within that namespace.

Lets use unshare command to create new namespaces and execute some command in those namespaces.

echo $$

readlink /proc/$$/ns/uts

sudo unsure -u bash

echo $$

readlink /proc/$$/ns/uts

hostname new_host_name

readlink /proc/$$/ns/mnt

nsenter -t PID -u bash

Another fascinating feature of a Linux container is that it is successful to make us believe that every container has a completely different set of files but in reality, it uses Linux mount namespaces to provide a separate view of the container image mounted on the containers root file system. Mount namespace allows us to isolate a set of mount points seen by a process. Mount points are just tuples that keep track of the device, its pathname, and the paren of the mount. By tracking parents, mount point helps us to maintain a single directory hierarchy. So across namespaces, we can have processes that have a different set of mount points and they see a completely different single directory hierarchy tree. It hides the host file system from view to share data across containers or between the host and a container.  Volumes are really amount added to the container’s mount namespace.  There are ways to share subtrees that allow us to propagate mount points across different namespaces.

Docker makes sure that from one container we don’t access another process of a container. PID namespaces play a great role in it. PID namespace isolates process id. We can only see the id of a process when the process that is part of the same PID namespaces. Since every PID in a namespace is isolated and unique in that isolated space, we don’t need to worry about potential conflicts of id. We can use the PID namespace in many ways, if we want to freeze a process and move it to another core or another machine, we can do that without thinking about the conflict it may cause in terms of PID conflict. PID namespace also allows us to use PID 1 across different namespaces. PID 1 is a special thing on any UNIX system. PID 1 is the parent of all orphaned processes and is responsible for doing certain system initialization. PID namespace is different from all other namespaces. PID namespace maintains a hierarchy. This type of hierarchy has a depth of 32. Each PID namespace has a parent PID namespace that has been initiated by clone() or unshare().  For when the fork happens the process stays at the same level but when clone happens it goes to a lower level. All these 32 generations are initiated from PID 1 on that process. So as a parent of all processes, PID 1 can see all the processes in the namespace, but lower-level PIDs can’t see higher-level PIDs. When PID 1 goes down or gets killed Linux kernel panics and gets restarted. When PID 1 in a namespace goes away, the namespace is considered unusable

sudo unshare -p -f bash
echo $$ 
ls /proc/

We are going to see many processes, but we should not see any processes, because PID should not have any processes yet, right? It is because we unshare only the PID namespace, and totally forgot to unshare mount point, so we can still have access to all processes that are part of PID1 across all namespaces.

Now let’s unshare mount and PID namespace together:

sudo unshare  -m -p -f bash

mount -t proc none /proc

docker typically uses a separate network namespace per container. Network namespace a separated view of the network with different network resources like routing table rules, firewall rules the socket port, some directories in /proc/net and sys/class/net, and so on. Network namespaces can be connected using a virtual ethernet device pair or veth pair and these vets pair are being isolated using network namespace. These veth pair are connected to a Linux bridge to enable outbound connectivity. Kubernetes pods and ECS tasks get a unified view of the network and a single IP address exposed to address all of the containers in the same pod or task. Every container has its own socket port number space. It can use some NAT rules on the container we could run a web server on port 80.

ip link

sudo unshare —net ip link

touch /var/run/netns/new

mount —bind /proc/$$/ns/net /var/run/netns/new

exit


docker run —name busybox  —detach busybox

ps ax | grep busybox

sudo redlink /proc/PID/ns/net

docker inspect busybox | jq .[0].NetworkSettings.IPAddress

nsenter --target PID --net ifconfig eth0

ip link | grep veth

ifconfig vethxxxxxx

 

Since dockers are just a clever use of Linux namespaces, we can interchangeably use all namespace-related commands and tools to debug dockers as well. For example, to enter namespaces we can use commands like setns, nsenter, ip netns. To demonstrate that we can run a docker image, and find the process id of that image, and then let’s use nsenter command to access to enter the namespace and run commands inside that namespace that has been created by docker.

docker run —name redis  —detach redis
ps ax | grep redis
nsenter —target PID —mount /usr/local/bin/redis-server


			
					

How docker internally handles resource limit using linux control groups

While using docker it is quite natural to feel like we are dealing with a virtual machine because we can install something in a container, it has an IP, we can ssh it, we can allocate memory and CPU to it like any virtual machine but in reality, containers are just a mechanism that allows Linux processes to run with some degree of isolation and separation. Containers basically use few different primitives in Linux combined. If containers are just based on Linux primitives how are we being able to set limits to memories and CPUs? The answer is very simple and it is already available in the Linux system, docker takes leverage of Linux Control groups.

What are Control groups? Control groups are commonly known as cgroups. Cgroups are the abstract frameworks in Linux systems for tracking, grouping, and organizing Linux processes. No matter what process it is, every process in a Linux system is tracked by one or more cgroups. Typically cgroups are used to associate processes with a resource. We can leverage cgroups to track how much a particular group of processes is using for a specific type of resource. Cgroup plays a big role when we are dealing with multitenancy because it enables us to limit or prioritize specific resources for a group of processes. It is particularly necessary because we don’t want one of our resources to consume all the CPU or say io bandwidth. We can also lock a process to run a particular CPU core using cgroup as well.

We can interact with the abstract framework of cgroups through subsystems. In fact, the subsystems are the concrete implementations that are bound to resources. Some of the Linux Subsystems are Memory, CPU time, PIDs, Devices, Networks. Each and every subsystem are independent of each other. They have the capability to organize their own processes separately. One process can be part of two independent cgroups. All of the c group subsystems organize their processes in a structured hierarchy. Controllers are responsible for distributing specific types of resources along the hierarchy Each subsystem has an independent hierarchy. Every task or process ID running on the host is represented in exactly one of the c groups within a given subsystem hierarchy. These independent hierarchies allow doing advanced process-specific resource segmentation. For example when two processes share the total amount of memory that they consume but we can provide one process more CPU time than the other.

This hierarchy is maintained in the directory and file structure. by default, it is mounted in /sys/fs/cgroup the directory but it can be mounted in another directory of choice as well. We can also mount multiple cgroup locations in multiple locations as well, which comes in handy when a single instance is using by multiple tenants and we want one tenant cgroup to be mounted on his disk area. Mounting cgroup can be done using the command:

mount -t cgroup2 none $MOUNT_POINT

Now lets explore the virtual filesystem of cgroup:

ls /sys/fs/cgroup
ls /sys/fs/cgroup/devices

Some of the resource controllers apply settings from the parent level to the child level, an example of such controller would be the devices controller. Others consider each level in the hierarchy independently, for example, the memory controller can be configured this way. In each directory, you’ll see a file called tasks. This file holds all of the process IDs for the processes assigned to that particular cgroup.

cat /sys/fs/cgroup/devices/tasks

It shows you all of the processes that are in this particular cgroup inside this particular subsystem but suppose that you have an arbitrary process and you want to find out which c groups it’s assigned to you can do that with the proc virtual file system. Let’s look at that the proc file system contains a directory that corresponds to each process id.

ls /proc

To see our current process we can use the command:

 echo $$
cat /proc/<pid>/cgroups

Let’s say I want to monitor a group for how much memory it is using. We can read the virtual file system and see what it returns.

cat /sys/fs/cgroup/memory/memory.usage_in_bytes

What are seeing these files and directories everywhere? It just interfaces into the kernel’s data structures for cgroups. Each directory has a distinct structure. Even if you create a new directory, it will automatically create a bunch of files to match up. Let’s create a cgroup. To create a cgroup, we just need to create a directory, at least this is how the kernel keeps track of it.

sudo mkdir /sys/fs/cgroup/pids/new_cgroup
ls /sys/fs/cgroup/pids/new_cgroup
cat /sys/fs/cgroup/pids/new_cgroup/tasks

The kernel keeps track of these processes using this directory and files. So adding or removing processes or changing any settings is nothing but changing the content of the files.

cat /sys/fs/cgroup/pids/new_cgroup/tasks
echo $$ | sudo tee /sys/fs/cgroup/pids/new_cgroup/tasks

You have written one line to the task files, haven’t you? Now let’s see what it is inside the tasks file:

cat /sys/fs/cgroup/pids/new_cgroup/tasks

 

We are supposed to see at least 2 files because when a new process is started it begins in the same cgroups as its parent process. When we try to use command cat, our shell starts another process and it appears in the tasks file.

 

We can limit the number of processes that a cgroup is allowed to run by modifying pids.max file.

Echo 2 | sudo tee /sys/fs/cgroup/pids/new_cgroup/pids.max

Now let’s try to run 3 processes instead of two and it is going to crash our shell.

 

Now that we have a basic understanding of cgroups. Let’s investigate cgroups inside our docker containers.

Let’s try to run a docker container with a CPU limit of 512 and explore the cgroups.

docker run —name test —cpu-shares 512 -d —rm buxybox sleep 1000

docker exec demo ls /sys/fs/cgroup/cpu

docker exec demo ls /sys/fs/cgroup/cpu/cpu.shares

 

So basically docker is using my commands to manipulate the group setting files to get things done. Interesting indeed. If it is not a virtual machine, these files are supposed to be on our machine too, isn’t it? yes, you got it right. Usually, these files are located somewhere in /sys/fs/cgroup/cpu/docker. Usually, there is a directory with a 256 hash that contains the full docker id.

ls /sys/fs/cgroup/cpu/docker
cat /sys/fs/cgroup/cpu/docker/<256big>/cpu.shares

Cgroup mechanism is tied to namespaces. If we have a particular namespace defined then we can add the namespace to the cgroup and everything that is a member of the cgoup becomes controlled by the cgroup.

Just want to give a word of warning before modifying anything of a cgroup we should keep in mind that dependencies there might be on the existing hierarchy. For example, amazon ECS or google c advisor uses the secret hierarchy to know where to read CPU and memory utilization information.

Reflection on “Designing Event Driven Systems for microservices using kafka”

Previously in my blog I have talked about microservices, and in a microservice architecture service teams think of their software as being a cog in a far larger machine. So suddenly for service team your customers or business people are not the only consumer of your application but are other applications that is consuming their application and they really cared that your service was reliable and application started to become platforms and as we discussed previously microservices does not share a single database with multiple application, but most of the time for service to perform it needs to access the data in a decoupled way as the key concept to implement microservices is, strong cohesion and low coupling. One of the pattern that was suggestion was event driven design pattern. Thank you confluent for offering me this free book on event driven system, called “Designing Event-Driven Systems: Concepts and Patterns for Streaming
Services with Apache Kafka” by Ben Stopford.

The beauty of events is that it decouples which means no one orders other do things, which reduces the dependency in services. It also makes a new entity to get used to with the system without the need of modification in other service. Even there are few techniques that encourages to decouple application from the database where the database is growing huge. We can do event based approach when a service to service data sharing is needed.

To make event drived programming easier Kafka has many useful features. Kafka has Connect interface, which pulls and pushes data to wide range of interfaces and datastores. Streaming APIs can manipulate data on the fly, that makes kafka encourages request-response protocols like Enterprise Service Bus. But kafka is much higher level tool compared to ESB. ESBs focus on the integration of legacy and off-the-shelf systems, and kafka is using an ephemeral and comparably low-throughput messaging layer. Kafka cluster has a distributed system at core, that provides high availability, storage, and linear scale-out. Using Kafka, we can store and lets users define queries and execute them over the data held in the log and we can pipe into views that users can query directly. It also supports transactions, just like a database but it is said that Kafka is a database inside out, a tool for storing data, processing it in real time, and creating views. Like most streaming systems implement the same broad pattern where a set of streams is prepared, and then work is performed one event at a time. Join, Filter, Process, in Kafka Streams or KSQL it can be implemented as well.

Where does kafka sits from the architectural level? One single Kafka cluster can be placed at the center of an organization, as the architecture of kafka inherits and inspired from storage systems like HDFS, HBase, or Cassandra compared to the traditional messaging systems that implement JMS (Java Message Service) or AMQP (Advanced Message Queuing Protocol) and that makes it more scalable. The underlying abstraction of kafka is, partitioned log sequentially being appended in a distributed computer system across multiple computers with redundancy which is designed to handles case like high-throughput streaming, mission-critical scenarios, ordering needs to be preserved. Its ability to store datasets removes the queue-depth problems that plagued traditional messaging systems. Kafka messaging system sits a partitioned, replayable log. this replayable log–based approach has two primary benefits. First it makes it easy to react to events that are happening now, with a toolset specifically designed for manipulating them. Second, it provides a central repository that can push whole datasets to wherever they may be needed. When a service wants to read messages from Kafka, it “seeks” to the position of the last message it read, then scans sequentially, reading messages in order while periodically recording its new position in the log. Taking a log-structured approach has an interesting side effect. Both reads and writes are sequential operations. This makes them sympathetic to the underlying media, leveraging prefetch, the various layers of caching, and naturally batching operations together. This in turn makes them efficient. In fact, when you read messages from Kafka, the server doesn’t even import them into the JVM (Java virtual machine). Data is copied directly from the disk buffer to the network buffer (zero copy)—an opportunity afforded by the simplicity of both the contract and the underlying data structure.

To ensure high availability kafka keeps redundant copies of same logs across multiple machines (usually/ideally 3) as replica that makes kafka failure tolerant. When a machine goes down we are not losing any data, and when that machine status changes to up, it sync up with other nodes and start contributing to cluster, that makes kafka not only reliable but also scalable. With Kafka, hitting a scalability wall is virtually impossible in the context of business systems. Even though kafka keep multiple copies of same data across machines but if we need to ensure paranoid level security to ensure the persistence of our highly sensitive data, we may require that data be flushed to disk synchronously with an overhead of throughput. If we absolutely have to rely on this approach it is recommended to increase the producer batch size to increase the effectiveness of each disk flush on the machine. When keys are not provided the replications are usually spread data across the available partitions in a round-robin fashion but when a key is provided with the message, it uses a hash of the key to determine the partition number. Kafka ensures that messages with the same key are always sent to the same partition which ensures strong orders. Sometime key-based ordering isn’t enough to maintain global ordering, use a single partition topic. When we are reading and writing to a topic we are basically reading and writing to all of them. By default messages are retained for some configurable amount of time for a topic. Kafka also has special topics called compacted topics, it stores data with respect to a consistent foreign key. It work like a simple log-structure merge-trees (LSM trees). It scans through a topic for old messages that has been  superseded (based on their key) and removes them in a periodic fashion. So if you find two logs with same key, don’t open a bug report right away. Although it’s not uncommon for kafka to see retention based or compacted topics holding more than 100 TB of data yet not no confuse it with a database. Multitanancy is very common when we are dealing with multiple services but it opens up the potential for inadvertent denial-of-service attacks which causes degradation or instability of services. To solve this problem Kafka introduces a throughput control quotas, that enables us to define an amount of bandwidth that would be allocated to specific services which ensures that it operates in the boundary of  enforced service-level agreements or SLA. Client authentication is provided through either Kerberos or Transport Layer Security (TLS) client certificates, ensuring that the Kafka cluster knows who is making each request.

Now lets go back to event driven programming. As we are used to batch operations and sequential programming, if a database is connected with our software, we are used to with the architecture where we ask question and we wait for an answer but in our world where data is being populated like rabbit and maybe our database size is huge. Maybe we should not waste our time waiting for answer, maybe we should ask then then keep doing something else that needed to be done and when our database is done producing the result that we have asked for, it should tell us by issuing and event that the task is done. Now a days the application that are using event-driven architectures, Event Sourcing, and CQRS (Command Query Responsibility Segregation) is helping them as a break away from the pains of scaling database centric systems. Maybe our database size is not huge, maybe database centric approach is working wonderfully. But it works for individual applications but we live in a world of interconnected systems. We need a mechanism for sharing data from one application to other that complements this complex, interconnected world. We can constantly push data from one application into our applications using events. To be event driven an applications needs to react, complex application may need to blend multiple streams together to make something meaningful out of it, it may need to build views and changing state and move itself forward.

One of such pattern for service to service communication could be, The Event Collaboration Pattern. The Event Collaboration Pattern allows a set of services to collaborate around a single business workflow, with each service doing its bit by listening to events, then creating new ones. For example basket service publishes an event to OrderRequested topic when an order is placed. Order service is subscribed to OrderRequested topic and when it listens to that event it validates the order and when it is done validating it creates an event to  OrderValidated. Payment service is interested about order validations so it is already subscribed OrderValidated events and when it a new event is published at OrderValidated it try to process payment and publish it to OrderPaymentProcessed topic. Order service is interested about OrderPaymentProcessed events so it when a new event gets published it confirms the order by publishing events at OrderConfirmed topic. Shipping service is interested about this topic and it has lots of status, and every status change can be a new topic that would interest the Order service and finally when Order is delivered Shipping service publish its event to OrderDelivered event. Order service is subscribed to this topic and it changes status of the event to delivered. As we can see, every service is decoupled and every service is collaborating with the services by publishing events (rather than making synchronous REST calls). If a service is not scaling properly it is just going to add more unprocessed messages to its events but it won’t crush the system or make the whole system down. That is the beauty of this pattern. One thing to note here is that the more events that we create the system is getting slower, to make the system faster, we need to appreciate parallel execution otherwise we won’t be able to make our system fast enough.

Let’s talk about Shared database approach. As kafka logs can makes data available centrally as a shared source of truth but with the simplest possible contract. This keeps applications loosely coupled. Query functionality is not shared; it is private to each service, allowing teams to move quickly by retaining control of the datasets they use. For example suppose in our previous example OrderDelivered event both order and email service are interested about this event and email service is producing email based on the event but the order service is lagging behind to update the order, in that scenario the user might get an email with a link that has not been created yet or the views are not updated yet and that would make the system inconsistent which is a violation of CAP theorem. In situation like this Kafka queries can come handy. For a certain event we can query for associated other event and see if that has been fulfilled or not then a service can do what it was supposed to do, but as we have mentioned previously in this blog, maybe paranoid style of data storage is necessary in this approach.

Database has a number of core components which includes a commit log, a query engine, indexes, and caching. Instead of conflating these components inside a single black-box which is known as database, separating them using stream processing tools. These parts can exist in different places, joined together by the log. Kafka Streams can create indexes or views, and these views behave like a continuously updated cache, living inside or close to your application. So in this approach rather than letting our application ask for data, we are basically pushing data to the application and let it process the task it needs to do.

Speaking of commands and queries. Commands are actions that can expect something to happen. Events are both a fact and a notification that has no expectation of any future action. Queries are a request to look something up, that does not expect any action but it a result of some data. From the perspective of a service, events lead to less coupling than commands and queries. Command Sourcing is essentially a variant of Event Sourcing but applied to events that come into a service, rather than via the events it creates. If we are using a database with our service and it is being updated from recalculation and processing the events there are few other benefits associated with it. If they are stored, immutably, in the order they were created in, the resulting event log provides a comprehensive audit of exactly what the system did. Another amazing thing about event driven programming is that, if at certain point we lost few of our services due to our bug, those services does not lose all its data that needs to be processed by the system because the events are stored in kafka and using kafka we can store data persistently as long as it has been predefined. If we implement CQRS using kafka, imagine how powerful it is going to get, as kafka keeps all its logs in its persistent storage, heavens forbid for some reason if our database gets into some sort of trouble we are not missing any data whatsoever. After the point of rollback of a backup, we have to populate the rest of the data from kafka log. So that is pretty neat.

In-Process Views with Tables and State Stores is an Event Sourcing and CQRS technique which can be implemented using Kafka’s Streams API as it lets us implement a view natively, right inside the Kafka Streams API without needing external database! From the example above when the “OrderValidated” event returns to the orders service, the database is updated with the final state of the order, before the call returns to the user. One of the most reliable and efficient way to achieve this can be using a technique called change data capture (CDC). Most databases write every modification operation to a write-ahead log so that at if the database encounters an error, it can recover its state from there. Many also provide some mechanism for capturing modification operations that were committed. Connectors that implement CDC repurpose these, translating database operations into events that are exposed in a messaging system like Kafka. Because CDC makes use of a native “eventing” interface, which is very efficient, as the connector is monitoring a file or being triggered directly when changes occur, rather than issuing queries through the database’s main API, and also it is very accurate, as issuing queries through the database’s main API will often create an opportunity for operations to be missed if several arrive, for the same row, within a polling period. ome popular databases with CDC support in Kafka Connect are MySQL, Postgres, MongoDB, and Cassandra. There are also proprietary CDC connectors for Oracle, IBM, SQL Server, and more.

Speaking of collisions and merging, collisions occur if two services update the same entity at the same time. If we
design the system to execute serially, this won’t happen, but if we allow concurrent execution it can. There is a formal technique for merging data in this way that has guaranteed integrity; it is called a conflict-free replicated data type, or CRDT. A useful way to generalize these ideas is to isolate consistency concerns into owning services using the single writer principle. When locks are used widely  in concurrent environments, and the subsequent efficiencies that we can often gain by consolidating writes to a single thread. The idea closely relates to the Actor model. From a services perspective, it also marries with the idea that services should have a single responsibility. responsibility for propagating events of a specific type is assigned to a single service—a single writer. This single writer approach can be implemented in kafka by modifying enforcing permission of topic at kafka. In single writer principle we are basically accepting eventual consistency over a global consistency.

Kafka data transactions do remove duplicates, allow groups of messages, state stores backed by a Kafka topic. When a http calls fail (maybe due to timeout), it is very common to give it a retry but maybe our retry is generating multiple database entry. It goes tricky when it happens on a system where the payment is involved. It makes system idempotent. Transactions in Kafka allow the creation of long chains of services, where the processing of each step in the chain is wrapped in exactly-once guarantees. Kafka can do that because they are not being operating using TCP, they are operating over UDP (User Datagram Protocol) which gives them a higher level of abstraction, which handles delivery, ordering, and so on. As Kafka is a broker, there are actually two opportunities for duplication. Sending a message to Kafka might fail before an acknowledgment is sent back to the client, with a subsequent retry potentially resulting in a duplicate message. On the other side, the process reading from Kafka might fail before offsets are committed, meaning that the same message might be read a second time when the process restarts. In kafka transactions it use commit as a marker message introduces in Snapshot Marker Model. After the messages are send it sends commit signal and until commit signals are sent it is not being available for read. Commit markers are coordinator. The overhead of this commit markers can be reduced by using it wisely. For context, For a batches that commit every 100 ms, with a 1 KB message size, have a 3% overhead when compared to in-order, at-least-once delivery.

Kafka consumer and producer needs to agree on the data format and schema.  For schema management: Protobuf and JSON Schema are both popular, but most projects in the Kafka space use Avro. For central schema management and verification, Confluent has an open source Schema Registry that provides a central repository for Avro schemas. New schema updates need to be backward compatible. Sometime it is absolutely necessary to change schema, which will break backward compatibility. For cases like this creating a new topic is recommended and also we would need to push data to both topic.

There are actually two types of table in Kafka Streams: KTables and Global KTables. With just one instance of a service running, these behave equivalently. However, if we scaled our service out—so it had four instances running in paral lel—we’d see slightly different behaviors. This is because Global KTables are broadcast: each service instance gets a complete copy of the entire table. Regular KTables are partitioned: the dataset is spread over all service instances. Whether a table is broadcast or partitioned affects the way it can perform joins. With a Global KTable, because the whole table exists on every node, we can join to any attribute we wish, much like a foreign key join in a database. This is not true in a KTable. Because it is partitioned, it can be joined only by its primary key, just like you have to use the primary key when you join two streams. So to join a KTable or stream by an attribute that is not its primary key, we must perform a repartition.

Ensuring consistency in data at the time of scaling can also be challenges as data propagation across nodes takes time, to solve this we can partition topic based on product id this way, product with same product id will always go to same node, which will make this operation consistent always. Inventory product and order products are maybe spread across multiple nodes and to be able to join them we need to reshuffle everything, this process is called Rekey to Join. data arranged in this way is termed co-partitioned. Once rekeyed, the join condition can be performed without any additional network access required. But then again in our next operation we may need to join few other tables, so we need to use rekeyed to join them, so our previous rekeys get shuffled in this way, so it can become a less practical solution depending on our situation.

Reflecton on Building Microservices Designing Fine Grained Systems

Recently I have been reading “Building Microservices Designing Fine Grained Systems” by Sam Newman. Indeed it is a must read book if you are trying to move toward microservices. This book is going to influence and reflect a lot at my work, so I was thinking I should definitely write a reflection blog. This is what I am attempting to do here. Obviously my point of view will differ in terms of many things, I will be explaining my take aways from the book and I will be trying to reflect my point of view and experience as well. Sam Newman is not a fanboy like me, so he expressed his caution while using docker, but I am a fanboy of docker/kubernetes so obviously I would have whole different perspective how docker makes managing orchestrating microservices easier or where and how docker/kubernetes fits in. Maybe I am going to write about it in another blog.

I agree with Sam Newman when he compared our software industry with fashion industry. We are easily dominated by tech trends. We tend to follow what other big organization is doing. But it has its reasons as well. This big organizations who are setting trends, they are doing a lot of experiments, and publishing their findings and some time publishing few open source tools along with it, as a biproduct of their work and smaller organization like us, most of the time we don’t get the luxury to experiment in that large scale, we don’t have the luxury to innovate, but we can easily become the consumer of those tools and we can easily take leverage of the knowledge that has been shared with us. So we take the easy route.

One of the key thing about microservices is that, every service needs to have some form of autonomy, it can depend on other service for data or information that is not within the boundary or context,  but it has to have its own data source that it owns, where it can store and retrieve data that is within the boundary or the context of the service. It can take leverage the best programming language that matches with its goal, it can use the database that matches with its need. Independence of tooling is a great feature, it reduces the tech debts in a very positive way. That’s probably the main selling point of microservices.

At a service oriented architecture, maybe it can be a common practice to directly access database that other service is primarily using, the reason is very simple I need a data that the other service has, so I am accessing the data that I need. But what happens when the owner of the service change its schema for representing their data better? The other service that is consuming the data of that service locks up, the other services which were consuming the data won’t be able to consume the data that it absolutely needs. So maybe, we need less coupling in our services then what we are already got used when we were using service oriented architecture. The best way that we can think of, maybe we can do that using REST APIs, when we would need a data from other service we would request the data using API calls to the service who owns the data and we don’t know the internal details about how the data is being stored. We need the data not its internal representation. Less coupling allows us to change the internal representation of the things without impacting the expected result that it needs to produce. The import term here is, it allows us to improve and change without going through huge hassles and long meetings with stakeholders who runs other services that will be affected for the change in your internal representation. Not only database, usually the system that has been designed to make RPC calls to change or to collect resources of other services are also prone to this problem.

When it is microservice we are talking about, it is easy to imagine that there is a huge code base with millions of lines of code that does pretty much everything on its own, it is struggling with load and the codebase is getting out of hand and almost unmanageable and we don’t want that so we want to move out of it. We want to split this huge monolith into few manageable services with manageable loads. Now dividing a monolith is not always an easy task. Setting a boundary is often harder but every service needs to have a well defined boundary. When dividing monolith we can follow a simple rule of thumb that says code that changes together will stay together which is known as high cohesion and the codes that does not change together may split apart. Then we can further split it based on domain, functionality, special technical needs. For example maybe a facebook like friends network can leverage a graph structured database, maybe neo4j or graphql in that case it deserves to be separated as a different service that use the appropriate technical tools that it needs.

Service to service communication in microservice is tricky, many people are using trusted network based security where the idea is, we try out best (invests more) to secure our parameter/border, it is hard to get into the parameter but once you are inside the parameter we trust you with everything. For organization like this few extra header that defines username is enough. But there are some other organization they are not being able to guarantee with 100% accuracy that their parameters are secured, there are so many hacking techniques as well that allow them to breach the boundary pretending to be someone else. So maybe we need to maintain some form of security and privacy even when we are communicating inside our secured parameter, from one service to another service. All internal communication can be encrypted using some form of HTTPS which encrypts the message and verify that the message source and destinations are accurate. Some organizations are relying on JWT as it is very light weight but still every service needs to have the capability to verify that token. A shared token has to be shared between the services and it has to be done using secured fashion.

When deciding between orchestration and choreography, choreography should be the best pattern because otherwise if we try to orchestrate by one service, we are basically overpowering it compared to other services that eventually it will start becoming a god service that will keep dominating other services all the time. We don’t want service to service domination. We want equal distribution of load and responsibility and almost equal importance. Also it has its problems as well, because in a microservice world new services start to appear everyday and old services gets discarded everyday it becomes real hassle for the owner of the god service to track them and to add and remove those service calls to our god service. Rather than a service who is making rest api calls to notify other services to do certain things, we can fire events and any other service who might be interested on that event will subscribe to that event and play its part when the event is triggered. If new service comes in, it is easier for it as well to subscribe to that event. It can be implemented using kafka or maybe Amazon SQS.

In microservice world it is extremely important to isolate the bug, the bug from one service tends to cripple other services. So it often becomes difficult to identify the real bug in appropriate service. So we need to have proper testing mechanism that that will isolate a service while testing and at the time of testing other api calls that it makes to other services needed to be mocked so it needs to have proper setup for isolated testing. So we would servers that is capable to imitate the ideal behavior of other service. When testing, it would be wise to have another environment that has its own set of database, in that way we would be able to test how resource creation is responding on a particular service. We talked about isolated testing, but integration testings are also important. As in micro services, there are so many moving pieces, maybe a service is calling an api that has been removed or the url has changed for some reason, (maybe the owner of the service has forgot to inform other stakeholdsrs) in that case the service that is calling the api, it won’t be able to perform what it was supposed to do. So integration tests are important, it is going to test when integrated together it is still performing the way it should be performing.

Every service needs to get monitored and there has to be a set of action that would take place when a service is down. For example maybe there can be a sudden bug introduced in one service causing huge CPU or memory can make the service go offline. It can affect the other service that is running on that machine. It is usually recommended that one service is being hosted on one machine so that the failure on one service does not cascade the problem to another service. There also should a well placed rule or circuit breaker on when are we going to consider a service is down and when are we going to stop sending traffic to a service. Suppose one of are service is down, in that case another thing to note here is that is it going to be wise to show the whole system is down when there is only one service that is down out of maybe 100s of services. Instead of doing that we can replace a portion of website with something else. For example when the ordering service of an ecommerce is down. It won’t be wise to shut the whole website down, we can let user visit the site and rather than letting them click “place order” button we can show them “call our customer care to place order”. So if implemented wisely micro service can reduce our embarrassment. Netflix has implemented an opensource smart router called zulus that can open and close a feature to group of users while the rest of the users get that is default. It can come handy time to time as well. There are many tools out there to monitor servers, from opensource world we can use nagios, or we can use 3rd party monitoring tools like datadog, new relic, cloudwatch and the options are many.

When talking about monitoring, we also need to be conscious that sometime our app is going to pass all the health checks yet it will produce a lot of error. For such situations maybe we need to monitor the logs, maybe a centralized log aggregation and monitoring system (famous ELK stack) can be helpful. As we are dealing with microservice, I am sure we have so many servers and it won’t be wise to poke in every server to collect logs when there is crisis, a log aggregation platform is a necessity, that’s how I want to put it. Our log aggregation system needs to be intelligent to fire an alarm when it reaches a certain number of error messages. Depending on the severity of error message, maybe it is time to let the circuit breaker know that we don’t want any traffic on that service until it has been sorted out.

In microservices we are dealing with 100s and 1000s of services, so there will always be some sort of deployment going on, either it is on this service or that service. If DevOps culture is being adopted small changes will be incouraged to go on production. The philosophy is if a bug is introduced due to small change, it will be easier to spot and easier to fix compared to a huge chunk of code. As each services are interconnected, if a service go down even for a second other services gets affected so zero downtime is a must. We can use blue green deployment or cannery deployment strategy to achieve zero downtime thingy and it is absolutely important. If something goes wrong in a deployment there has to have a well defined strategy to rollback to previous build. It is better if it has a CI/CD pipeline attached with it. Chef, Puppet like server automation tools comes handy because it helps us to automate the monotonous tasks. When automation scripts are being written, it is being tested over and over again, it is faster and more reliable, same goes for rollback.

Micorservices makes security much easier, as we know from CIA triangle, we can’t have everything. Often when we want to ensure security, we will need to slow our systems down as  a trade off. Now, as we are splitting our services, does all our services needs equal amount of security be ensured (read, does all our services needs get slow?). Often some services need more security than other. So microservices will give us opportunity to use the write security where security is needed, so it won’t slow the whole server down, when a specific service needs special security measure.

Sometime DNS management gets troublesome. It takes hours to get everything propagated across the servers. It can When we are dealing with interconnected services (and there are 100s of them) it will be really troublesome if server to server DNS records are not being reflected the same way, so we can have our own service discovery mechanism, we can have our own DNS server. We can take advantage of Zookeeper, Consul, Eureka and so on.

Scaling microservices can be challenging some time. It is said that people use 20% of the feature 80% of the time, in other word 80% of the feature are never been used. So when we have a heavy load, linearly increasing all the servers most of the time will be a wrong choice. So we need to setup more complex scaling techniques that needs to be implemented on each service separately. Our load balancers are getting smarter everyday so it won’t be too hard to implement.

Newman has pointed out few interesting things about implementation of microservices, one of the key point he has mentioned is about Conway’s Law. If the organization is not organized, and works independently in terms of team to department and collaboratively with other department like a microservice, it would be challenging for that organization to adopt microservices. So it makes me wonder maybe for smaller startup, maybe microservices are not the solution they should look for. Also when it comes to the ownership of service, in smaller organization maybe the same person or team is going to own all of the services. It is going to be a huge overhead as well.

database strategies for microservices

Last blog I have talked about the problem of database lockup but how can we solve it?

Shared Tables: Shared Tables could be a easy to go and a dirty solution that is very common. But be aware that it is high maintenance.
Using mysql we can use FEDERATED ENGINE (http://dev.mysql.com/doc/refman/5.1/en/federated-storage-engine.html) to do this. We have to create a federated table based on the table at another remote location that we want.

CREATE TABLE remote_user (
  username varchar(20) NOT NULL,
  password varbinary(20) NOT NULL,
  PRIMARY KEY(username)
) ENGINE=FEDERATED DEFAULT CHARSET=utf8 CONNECTION='mysql://username:password@someip:port/db/user’;

Database View : A database view is a comparatively better approach when the cases are simple because it allows another representation of database model which is more suitable. Most amazing thing about database view is that it supports wide range of databases. But for heavy use cases we can see performance issues. While considering database view we must ensure that both of the databases can connect with each other without any network or firewall issue. Most of the database views are read only, updating them according to need might get tricky.

CREATE TABLE federated_table (
    [column definitions go here]
)
ENGINE=FEDERATED
CONNECTION='mysql://username:password@someip:port/db/user’;

Triggers:
Database triggers might come handy where one database operation will trigger another database update. We can bind to AFTER INSERT, AFTER UPDATE, and AFTER DELETE triggers.

CREATE TRIGGER user_bi BEFORE INSERT ON user FOR EACH ROW
BEGIN
  INSERT INTO remote_user (username,password) VALUES (NEW.username,NEW.password);
END

Data Virtualization: When we are dealing with micro services possibly some of our databases are running using Mysql while other services are running other DBMS. In that case Data Virtualization strategy is necessary. One open source data virtualization platform is Teiid. But when dealing with data virtualization strategy we must know that if we are dealing with stale data or not, as it will have serious performance issue as it will add another hop as the data is not being accessed directly from database.

Event sourcing: Rather then making database operatins we can consider designing it as a stream of events that goes one after another through as message broker. So it does not matter how many users are accessing your database it will never lock up your database but it would take more time to process the data.

Change Data Capture: Another approach is to use Change Data Capture (CDC), is an integration strategy that captures the changes that are being made to a data and makes them available as a sequence of events in other databases that needs to know about these changes. It can be implemented using Apache Kafka, Debezium and so on.

Simple trick that can can help us to achieve Zero Downtime when dealing with DB migration

Currently we are dealing with quite a few deployment processes. For a company that enables DevOps culture, deployment happens many many times a day. Tiny fraction of code change goes to deployment, and as the change size is so small it gets easier to spot a bug and if the bug is crucial maybe it is time to rollback to an older version and to be able to have a database that accepts rollback, yet we have to do it with zero downtime so that the user do not understand a thing. It is often is not as easy as it sounds in principal.

Before describing about few key idea to solve this common problem lets discuss few of our most common deployment architectures.

In a blue/green deployment architecture, it consists of two different version of application running concurrently, one of them can be the production stage and another one can be development platform, but we need to note that both of the version of the app must be able to handle 100% of the requests. We need to configure the proxy to stop forwarding requests to the blue deployment and start forwarding them to the green one in a manner that it works on-the-fly so that no incoming requests will be lost between the changes from blue deployment to green.

Canary Deployment is a deployment architecture where rather than forwarding all the users to a new version, we migrate a small percentage of users or a group of users to new version. Canary Deployment is a little bit complicated to implement, because it would require smart routing Netflix’s OSS Zuul can be a tool that helps. Feature toggles can be done using FF4J and Togglz.

As we can see that most of the deployment processes requires 2 version of the application running at the same time but the problem arises when there is database involved that has migration associated with it because both of the application must be compatible with the same database.So the schema versions between consecutive releases must be mutually compatible.

Now how can we achieve zero downtime on these deployment strategies?

So we can’t do database migrations that are destructive or can potentially cause us to lose data. In this blog we will be discussing how can we approach database migrations:

One of the most common problem that we face during UPDATE TABLE is that it locks up the database. We don’t control the amount of time it will take to ALTER TABLE but most popular DBMSs available in the market, issuing an ALTER TABLE ADD COLUMN statement won’t lead to locking. For example if we want to change the type of field of database field rather than changing the field type we can add a new column.

When adding column we should not be adding a NOT NULL constraint at the very beginning of the migration even if the model requires it because this new added column will only be consumed by the new version of the application where as the new version still doesn’t provide any value for this newly added column and it breaks the INSERT/UPDATE statements from current version. We need to assure that the new version reads values from the old column but writes on both.  This is to assure that all new rows will have both columns populated with correct values. Now that new columns are being populated in a new way, it is time to deal with the old data, we need to copy the data from the old column to the new column so that all of your current rows also have both columns populated, but the locking problem arises when we try to UPDATE.

Instead of just issuing a single statement to achieve a single column rename, we’ll need to get used to breaking these big changes into multiple smaller changes. One of the solution could be taking baby steps like this:

ALTER TABLE customers ADD COLUMN correct VARCHAR(20); UPDATE customers SET correct = wrong

WHERE id BETWEEN 1 AND 100; UPDATE customers SET correct = wrong

WHERE id BETWEEN 101 AND 200;
ALTER TABLE customers DELETE COLUMN wrong;

When we are done with old column data population. Finally when we would have enough confidence that we will never need the old version, we can delete a column, as it is a destructive operation the data will be lost and no longer recoverable.

As a precaution, we should delete only after a quarantine period. After quarantined period when we are enough confident that we would no longer need our old version of schema or even a rollback that does require that version of schema then we can stop populating the old column.  If you decide to execute this step, make sure to drop any NOT NULL constraint or else you will prevent your code from inserting new rows.

Higher Level View of RDMA programming and its vocabularies

Recently I have come across a pretty cool tool called RDMA. It enables direct memory access from the memory of one computer into that of another computer without involving the burden of either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. In this blog I will be noting down few vocabularies that comes in handy when dealing with RDMA.

Queue Pair (QP) consists of a Send Queue (SQ) and Receive Queue (RQ). When we expect it to send data we would send it to SQ and when we expect it to receive data, we would sends it to RQ. Both of them can but put on a Completion Queue (CQ).  Completion queue (CQ) is used by network adapter to notify the status of the completed Work Request. Each entry in Completion Queue entry (CQE) holds information of completion status of one or more completed work requests.

When we want an adapter to send or receive, we need to post a request these are called work requests. In a Send Request we need to assign how much data will be sent for connected and unconnected transport and the memory buffer where data is located for connected and unconnected transport, to where the data should be send and the type of the send request and in a receive requests, the maximum data size to be received and memory buffer where data should be. Completion of a send queue and a receive queue can be assigned to same or different completion queues.

Work queue maintains order of their posted time however in different work queue does not maintain orders. Every work queue has ids own user defined id wr_id and flags, for example wr.send_flags = IBV_SEND_SIGNALED  defines generation of a completion element once the data is transmitted. it can be handled in a chain manner by assigning another work queue in wr.next

ibv_create_cq is the command that helps to create CQ. Transportations can be completed successfully or with error result is reported through a completion queue entry (CQE) polling a CQ is used to retrieve the CQE from the CQ outcome is reported in the status field of the completion entry.

We create a QP using ibv_create_qp function. In the parameter it takes a Protection Domain(PD) and a set of attributes. Protection domain is gathering resources in groups. Resource from same protection domain are allowed to communicate with each other. Eg: QP, MP. Resource from outside protection domain are not allowed to communicate. To allocate protection domain by calling ibv_alloc_pd. Attribute struct would look something like this:

struct ibv_qp_init_attr qp_init_attr;
struct ibv_cq *cq;
qp_init_attr.send_cq = cq; 
qp_init_attr.recv_cq = cq; 
qp_init_attr.qp_type = IBV_QPT_UD; 
qp_init_attr.cap.max_send_wr = 2; 
qp_init_attr.cap.max_recv_wr = 2; 
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

more at: https://www.rdmamojo.com/2012/12/21/ibv_create_qp/

Where max_send_wr maximum number entities that we want to allow in a send queue before completion. By the way it would be wise to note that it should be less than max cqe.
max_recv_wr maximum scatter queue that we want to allow.
At max_send_sge, max_recv_sge, sge is a short hand for Scatter Gather Entries. The maximum number of scatter gather entries can be queried using ibv_device_query.

As you can see we have set qp_type to IBV_QPT_UD which is there for Unreliable Data. In reliable context, QP is possible between two RCs, but when it is about Unreliable QP, it allows one to many Unreliable Queue Pair without requiring any previous connection setup.

Like many things in network programming, QP goes through a series of steps before it ends up processing send a receive.

RESET: By default, upon the creation of QP, it is at it’s reset state. Although it is at its ready to receive data but it can’t process any work request at this state.
INIT: After RESET it goes to INIT state after its initial configuration. When QP moves from RESET to INIT, QP starts receiving receive buffers in the receive queues using ipv_recv commands. this data won’t be used until QP is in RTR state.
RTR: After that it goes to Ready To Receive state, at this state it is configured to receive data.
RTS: After that it goes to Ready To Send, at this stage it is configured to send data. At this stage device can post using ipv_post_send commend.

After creation if you want you can modify QP using ibv_modify_qp. When modifying QP, pkey_index, port_num, qkey (for unrelieable datagram only) might be necessary. All QP that wants to communicate on unreliable datagram must share same q_key.

To make RDMA do things, it is necessary for network adapters to ask for permissions to access local data. this is done through MR (memory region). A memory region has address, size and set of permissions. that control access to the memory pointed out by the region.

To register memory region we need to call ibv_reg_mr. It takes the Protection Domain, Start Virtual Memory Address, size, access bit information like local read, local write, remote read, remote write, atomic operation. Local read access is necessary when the adapter has to access local pc to gather data when rdma operating is being processed. Local write access is necessary when adapter has to scatter data when recieving a send operation. Remote access is necessary when the adapter has to access local data from rdma operation recieved by remote process.

to open a communication we would need to call ipv_open_device, where we can assign a context and a pointer. cq_context, channel, comp_vector is necessary when dealing with completion events.

If we want to send data from a to b, we would be needing a source and a destination address or destination_gid, which is known as address vector.

We can collect our device details using ibstat command. But please note that we would need to have connect two devices, install mlnx_ofed, ibstat command, change port type ib/eth, check ports are enabled state LinkUp, if running using infiniBand opensm must be running. Also we can collect that information programmatically using ibv_get_device_list function.

Under the hood libverb handles rdma network related operations, like creating, modifying, querying, destroying resources. it handles sending receiving data from QPs, and recieving Completion Queues.

As Ipdump does not work when we are dealing with infiniBand as it bypasses OS layer we can use ibdump for debugging.

Running a on premise local mysql replica with AWS RDS Aurora master

To solve our problem we are running a hybrid cloud. Few of our services are running on cloud and some of our services are running in premise locally in our country where we have our users and where AWS does not provide service. To able to do that we need a database replica that has read facility.

We need to creating replica user:

CREATE USER 'replica'@'%' IDENTIFIED BY 'slavepass'; 
GRANT REPLICATION SLAVE ON *.* TO 'replica'@'%';

Then create a new DB Cluster parameter group and set binlog_format to MIXED. Modify the Aurora cluster and select the custom parameter group. Restart your db to apply those changes. Now if you run following command you will be able to see the bin log file name and position.

show master status

Now we need to dump our master user data to sql dump so that we can feed our slave database.

mysqldump --single-transaction --routines --triggers --events -h XXX.azhxxxxxx2zkqxh3j.us-east-1.rds.amazonaws.com -u bhuvi –-password='xxx' my_db_name > my_db_name.sql

It can be GB to TB of data depending on your database size. So it will take time to download.

Run follwoing to know your mysql configuration file:

mysqld --help -verbose | grep my.cnf

For me it is /usr/local/etc/my.cnf

vi /usr/local/etc/my.cnf

and change server-id to:

 [mysqld] server-id = 2

now lets import these data into our mysql.

mysql -u root –-password='xxx' my_db_name < my_db_name.sql

Now we need to let our slave database know who is the master:

CHANGE MASTER TO  
MASTER_HOST = 'RDS END Point name',  
MASTER_PORT = 3306,  
MASTER_USER = '',  
MASTER_PASSWORD = '',  
MASTER_LOG_FILE='',  
MASTER_LOG_POS=;

Now we need to start the slave.

start slave;

Setting up (comodo) ssl for your website on aws

We have bought our ssl from comodo from name.com as we got a better deal there. After sending them our signed key. comodo sent us following files via email, against my private key. Now I would blog about how I setted the whole thing up on AWS.

First of all, before purchasing I had to send them a key which I had generated using OpenSSL using following command:

openssl req \
       -newkey rsa:2048 -nodes -keyout domain.key \
       -out domain.csr

Which was pretty easy. And as we had bought Comodo Essential SSL Wildcard so we could buy it without verifying our company, in fairly easy in less than 5 min.

After our successful purchase comodo sent us following files as zip in my email:
domain_com.crt
COMODORSAAddTrustCA.crt
domain_com.crt os our Primary Certificate, COMODORSAAddTrustCA.crt is our Intermediate Certificate, and AddTrustExternalCAROOT.crt is the The Root Certificate.

Now it gets a little bit tricky because currently our certificates are in .crt format, but we want it to be in *.pem format. So we would need to convert them in *.pem.

openssl x509 -in ./AddTrustExternalCARoot.crt -outform pem -out ./pem/AddTrustExternalCARoot.pem
openssl x509 -in ./COMODORSAAddTrustCA.crt -outform pem -out ./pem/COMODORSAAddTrustCA.pem
openssl x509 -in ./COMODORSADomainValidationSecureServerCA.crt -outform pem -out ./pem/COMODORSADomainValidationSecureServerCA.pem
openssl x509 -in ./domain_com.crt -outform pem -out ./domain.pem

We would also need to keys that was used to create these certificates by comodo.

openssl rsa -in ./domain.key -outform PEM -out domain.key.pem

Lets create the chain first:

$ cat ./COMODORSADomainValidationSecureServerCA.pem > ./CAChain.pem
$ cat ./COMODORSAAddTrustCA.pem >> ./CAChain.pem
$ cat ./AddTrustExternalCARoot.pem >> ./CAChain.pem

Now you need to login to your aws console and search for ACM (Amazon Certificate Manager). and if it is your first time you need to click on Provision certificates.

It is time to import your certificate to ACM. At the form where it says Certificate body* please paste domain.pem and domain.key.pem and at Certificate chain paste CAChain.pem.

So thats it we are done importing.

Now if you have a load balancer you can take advantages of this ssl. If you have an existing load balancer or feel free to create one, where at the place of listener add https instead of http and for certificate choose acm and your domain.

You are good to go.

Configuring django for centralised log monitoring with ELK stack with custom logging option (eg. client ip, username, request & response data)

When you are lucky enough to have enough users that you decide to roll another cloud instance for your django app, logging becomes a little bit tough because in your architecture now you would be needing a load balancer which will be proxying request from one instance to another instance based on requirement. Previously we had log in one machine to log monitoring was easier, when someone reported a error we went to that instance and looked for errors, but now as we have multiple instance we have to go to all the instance, regardless of security risks, i would say it is a lot of work. So I think it would be wise to have a centralized log aggregating service.

For log management and monitoring we are using Elastic Logstash and Kibana popularly known as ELK stack. For this blog we will be logging pretty much all the request and its corresponding responses so that debugging process gets handy for us. To serve this purpose we will leverage django middlewares and python-logstash.

First of all let’s configure our settings.py for logging:

LOGGING = {
    'version': 1,
    'disable_existing_loggers': True,
    'formatters': {
        
        'standard': {
            'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s'
        },
        'logstash': {
            '()': 'proj_name.formatter.SadafNoorLogstashFormatter',
        },
    },
    'handlers': {
        'default': {
            'level':'DEBUG',
            'class':'logging.handlers.RotatingFileHandler',
            'filename': '/var/log/proj_name/django.log',
            'maxBytes': 1024*1024*5, # 5 MB
            'backupCount': 5,
            'formatter':'standard',
        },  
        'logstash': {
          'level': 'DEBUG',
          'class': 'logstash.TCPLogstashHandler',
          'host': 'ec2*****.compute.amazonaws.com',
          'port': 5959, # Default value: 5959
          'version': 1, # Version of logstash event schema. Default value: 0 (for backward compatibility of the library)
          'message_type': 'logstash',  # 'type' field in logstash message. Default value: 'logstash'.
          'fqdn': False, # Fully qualified domain name. Default value: false.
          #'tags': ['tag1', 'tag2'], # list of tags. Default: None.
          'formatter': 'logstash',
      },

        'request_handler': {
            'level':'DEBUG',
            'class':'logging.handlers.RotatingFileHandler',
            'filename': '/var/log/proj_name/django.log',
            'maxBytes': 1024*1024*5, # 5 MB
            'backupCount': 5,
            'formatter': 'standard',
        },
    },
    'loggers': {
        'sadaf_logger': {
            'handlers': ['default', 'logstash'],
            'level': 'DEBUG',
            'propagate': True
        },
    }
}

As you can see we are using a custom logging format. We can leave this configuration and by default LogstashFormatterVersion1 is the logging format that will work just fine. But I chose to define my own logging format because my requirement is different, I am running behind a proxy server, I want to log who has done that and from which IP. So roughly my Log Formatter looks like following:

from logstash.formatter import LogstashFormatterVersion1

from django.utils.deprecation import MiddlewareMixin
class SadafNoorLogstashFormatter(LogstashFormatterVersion1):
    def __init__(self,*kargs, **kwargs):
        print(*kargs, **kwargs)
        super().__init__(*kargs, **kwargs)


    def format(self, record,sent_request=None):
        print(record)
        print(sent_request, "old req")
        caddr = "unknown"
        #print(record.request.META)

        if 'HTTP_X_FORWARDED_FOR' in record.request.META:
            caddr = record.request.META['HTTP_X_FORWARDED_FOR'] #.split(",")[0].strip()
        
#        print(record.request.POST,record.request.GET, record.request.user)
        message = {
            '@timestamp': self.format_timestamp(record.created),
            '@version': '1',
            'message': record.getMessage(),
            'host': self.host,
            
            'client': caddr,
            'username': str(record.request.user),

            'path': record.pathname,
            'tags': self.tags,
            'type': self.message_type,
            #'request': self.record

            # Extra Fields
            'level': record.levelname,
            'logger_name': record.name,
        }

        # Add extra fields
#        print(type(self.get_extra_fields(record)['request']))
        message.update(self.get_extra_fields(record))

        # If exception, add debug info
        if record.exc_info:
            message.update(self.get_debug_fields(record))

        return self.serialize(message)

As our requirement is to log every request our middleware may look like following:

import logging

request_logger = logging.getLogger('sadaf_logger')
from datetime import datetime
from django.utils.deprecation import MiddlewareMixin
class LoggingMiddleware(MiddlewareMixin):
    """
    Provides full logging of requests and responses
    """
    _initial_http_body = None
    def __init__(self, get_response):
        self.get_response = get_response

    def process_request(self, request):
        self._initial_http_body = request.body # this requires because for some reasons there is no way to access request.body in the 'process_response' method.


    def process_response(self, request, response):
        """
        Adding request and response logging
        """
#        print(response.content, "xxxx")
        if request.path.startswith('/') and \
                (request.method == "POST" and
                         request.META.get('CONTENT_TYPE') == 'application/json'
                 or request.method == "GET"):
            status_code = getattr(response, 'status_code', None)
            print(status_code)

            if status_code:
                if status_code >= 400:
                    log_lvl = logging.ERROR
                else:
                    log_lvl = logging.INFO

            #request_logger.log(logging.DEBUG,)
            request_logger.log(log_lvl,
                               "GET: {}"
                               ""
                               .format(
                                   request.GET,
                                   ), 
                                   extra ={
                                       'request': request,
                                       'request_method': request.method,
                                       'request_url': request.build_absolute_uri(),
                                       'request_body': self._initial_http_body.decode("utf-8"),
                                       'response_body':response.content,
                                       'status': response.status_code
                                   }
                                       #extra={
                    #'tags': {
                    #    'url': request.build_absolute_uri()
                    #}
                #}
                )
#            print(request.POST,"fff")
        print("hot")
        return response

So pretty much you are done. Go login to your Kibana dashboard, make index pattern that you are interest and see your log: