Developing applications / microservices is a definitely a skill that your team needs to gain, but what does it mean when you say “I am done”. What is the definition of “DONE”?
A lot of times you hear from us software engineers the following statements:
- “I am done coding that part but I still need to wrap up my unit-testing”.
- “My user story is done, but we just need to finish writing the automation tests.”
- “I am done, but I just need to add better logging in the code and add some better exception handling.”
- “I am done, but I have not deployed to QA environment yet; it works on my machine.”
We all have been in these shoes. We need to step back and look at this from the software engineering 10,000-foot level and truly understand what it means to deliver something and call it done and be enterprise worthy at the same time.
Here is my recommended list of things that your TEAM needs to accomplish in order to truly have production and enterprise worthy applications/microservices:
(1) Architecture diagram representing interactions of your family of microservices:
After going through the design and contract definitions of your APIs/microservices, your team needs to put together a single diagram that explains the interaction among microservices (owned by your team) and how these microservices interact with other teams’ microservices. This is basically the blueprint for your team’s work and this picture needs to be constantly updated as you are evolving your services.
(2) Service names, boundaries (domain lines) and API definitions/standards:
Names mean things. So you first need to properly name your services and that’s the names that you would use when talking to your teammates and clients of your services/APIs.
I have a separate article on how you go about defining what a microservice is: https://medium.com/@almirx101/micro-in-microservices-677e183baa5a
Essentially, you need define the purpose and boundaries of your service.
Then you get into API routes and properly defining them for each service. The goal is to keep the routes RESTful and if you run into the situation when they are not, then it should trigger you to revisit the purpose and boundaries of that given microservice.
(3) Application CONFIG file defining your service:
The assumption would be that your platform is set up in such a way that you as the owner of a microservice would properly set up your application CONFIG file which defines your microservice and all the resources that it needs. This config file defines anything from what database you need, database details, what cache you need, what queues you need, how much memory and CPU you want your application to use, what physical server you want your application to run on, and etc.
Basically this config file is the contract between you (as the owner of a microservice) and DevOps scripts/tools that take this config file and use as the definition to deploy your service into a given environment whether it is in the cloud or on-prem.
You own this config file but with ownership comes responsibility. It is important that this config file and code-reviewed properly and managed properly to keep the stability of your application.
(4) API Dependencies (Direct or via Pub/Sub approach):
As you are defining your microservices and API contracts, it is important that you explicitly think about API dependencies. Think about the type of dependency you want to achieve. You can directly call another microservice via http call (sync or async), or you can use the publish/subscribe methodology. It depends on your use case and there is no right or wrong here. The important thing is that you give it an explicit thought as you are designing your family of microservices and as you are designing the interaction among them and interaction with outside services. Ideally you could have this API dependency definition within the config file mentioned above.
(5) API Dependency Analysis Tool:
As you have all these API dependencies, they need to be managed somehow. You need a holistic view of all your services. Therefore, you need to develop a tool that performs static analysis through your config files and maps a graph of all microservices and how they interact with each other. To supplement this static analysis tool, you would also need a tool that in real-time maps all the microservices interactions that are actually taking place in production.
(6) Logging / correlationId / Log-Levels:
First agree on the format of your log messages. Then implement a logging framework in a library that all your microservices can utilize.
Make sure that you introduce the concept of correlationId that can be used to map the flow especially between microservices. For example, your microservice could be logging a requestId or some form of a GUID unique to that service, but in order to connect things among multiple components/services, you need the concept or correlationId.
Now that you have the foundation and the principles down, your team needs to have the right mindset; it looks simple on paper, but there is much more to it. How many times have you been in a situation when you were troubleshooting a production issue and you said to yourself: “I wish had a log line for this condition”. When it comes to logging, here are some tips:
- Log positive as well as negative cases
- Any time you add an IF condition in your code, you need to think about what you will log in the IF, ELSE IF and ELSE.
- Log your exceptions if you caught them and you need to enrich your log with the context you have at that level and not at the level above in the stack-trace.
- Introduce the concept of log-levels and make sure that the production environment is not logging data that should not be logged (personal info) and that the excessive logging does not negatively impact performance.
Unwritten rule with logging is that you do NOT put a log statement in a loop unless you truly understand the number of iterations will not impact your performance with multiple requests going into your service.
(7) Log Aggregator Tool:
Your team needs or your company overall needs to pick a log aggregator tool that allows you to view all your flows through a website and it should allow you to perform fancy queries on your logs. For example, you can pick Splunk. Your service will use a library to stream all the logs into Splunk and Splunk would be your eyes into the system.
Now that you have a tool like Splunk, you need to take it to the next level. Each team should be responsible to build dashboards in Splunk that track the health of your microservices. These dashboards could be used by any technical person troubleshooting things in production.
Then the next thing to do is for your Network Operation Center to set up alerts in Splunk so that they get informed when certain limits are reached and when to open a severity 1 or 2 ticket with the right microservices team.
(8) Exception Handling in your code:
In your choice of general purpose language, you have the concept of exception handling (try-catch or try-except blocks). Even though your services would be micro-services, it does not mean that the whole microservice internally contains only one layer. Most likely you will have your API level or controller layer or resource layer. That’s the code where your API routes are defined and where you have the entry points for each route. That code typically should do just some validation and then proxy your call down into your business layer and your business layer makes calls into the database/repository layer and/or into another another API via your adapter layer. As you are writing code in these layers, you need to think about the try/catch blocks if you need it or not. For example, there is no reason to catch an exception at the lower level unless you need to log something that you have ONLY available at that layer and not at the layer of the invoker.
I don’t want to get into too much details on this topic as this deserves a totally separate article. I just wanted to point out that the exception handling needs to be taken seriously if you want to achieve that enterprise grade.
(9) Error Handling:
Don’t mistake the Error handling with Exception handling. Error handling is different. An error could be a result of a specific IF condition in your code or it could be a result of an exception that got caught. Basically your team needs to decide what how to translate all the errors within the microservice into HTTP codes and you can also enrich the JSON response with some extra details. Don’t forget that you can for example use specific range of 5xx custom http codes to provide a more meaningful response to your client. Your team needs to analyze all the paths in the code and decide what HTTP codes will be sent to the client. Don’t just have all errors be HTTP 500; that does not help your clients nor you if you are troubleshooting your service.
Timeouts in your API calls are something that is typically very lightly taken. We engineers would just code it and then when things stop working, you are wondering what your defaults are and if those defaults make sense. Understand what the defaults are in the library that you are using to perform HTTP/API calls. Then assess what is reasonable from client’s point of view.
Dynamically variable timeouts are something even better if your team has time to implement. Let’s assume you have a soft timeout of 15 seconds for outbound API calls from your microservice. Now your service is calling another microservice and it starts timing out at 15 seconds. You could build smarts into your system where timeouts dynamically change from the 15-sec soft limit to 20 seconds. Then your application knows what percentage improvement you made by increasing the timeout to 20 seconds. Then your application based on data can decide if to stay at 20 seconds or to increase the timeout to 25 seconds or go back down to 15 seconds. This mechanism allows your application to still function and service customers and obviously you would need a hard limit for your timeout because you need to draw a line and determine what timeout is truly something that is not acceptable for end customers. This mechanism allows you to be a bit more aggressive by starting with lower soft-limits which improves the overall performance of your microservice because you don’t have that many connections opened due to longer timeouts.
(11) API Contracts & Stubbed or Mock services:
This item is more about the development methodology that I would strongly recommend. I have a short article that talks about developing the layers of cake instead of slice of cake: https://medium.com/@almirx101/developing-layers-of-cake-instead-of-slices-of-cake-in-your-software-engineering-projects-71483aff2df1
What I strongly recommend that you force yourself to fully define the API contracts for all your microservices. As you go through this exercise, you will realize that you need to reach out to other teams to agree on how you will invoke their APIs and at the same time, you should reach out to your clients and agree on what the best contract for your API is.
After defining the contracts, my strong recommendation would be to develop the stubbed or mock versions of your APIs honoring these contracts. So if I called any of your microservices, I would get proper JSON responses. That means you internally would have that JSON harded or sitting in some temporary JSON file until you eventually implement the data/repository layer and your choice of database itself.
If everybody did this for each one of their microservices, then you are in theory building a thin layer of the cake or a thin layer of your microservice platform. This would enable your Front End team developing a website or a mobile application to implement the full journey A to Z by calling all these mock services. The sooner this full UI journey is given to product management team, the sooner they will get a chance to feel how good their requirements are and the sooner they will be able to adjust things and steer the ship in the right direction so that your end customers ultimately win.
(12) API Documentation and Swagger docs:
Don’t underestimate the power of documentation. I am not a big fan of LONG documents, but I am a big proponent of single-page documents. If it is longer than a page, the chance that people will read it is low. However a singe-page document will be read and if supplemented by good examples, it is even more powerful. Please follow this recommendation for your API documentation. In theory, the Swagger documentation could replace the majority of your API documentation.
(13) Postman app in Chrome:
Postman app for Chrome is what I have been using to try out different APIs. If you use this app or similar type of app, I strongly recommend that you keep the collection of API calls saved and committed into your GitHub repo so that other members of the team don’t have to figure it out from scratch. Examples are the best way to share the documentation as long as you maintain these examples.
(14) Service/API Versioning and Regular clean-up:
In the API world, you can introduce API versions and you can require from clients to pass you the version of the API that they want to call. Typically you can follow the Major.Minor.Patch versioning. For example, you can start with version 1.0.0 of your API. As you are making small changes and fix defects, you can keep incrementing the patch version. If you introduce some new features, then you can increment the minor version. However, if you introduce a big set of new features and also refactor your API to break some contracts, then you would increment the major version.
My recommendation is that you as a team don’t maintain/support too many versions at any given time. As you depreciate the old API versions, please make sure that you properly clean up all the resources that have been used by that version in order to save the costs. For example, that could be anything from dedicated servers for that API version, dedicated cache, dedicated database, configuration files, and etc.
(16) Backwards Compatibility:
One way to retrieve backwards compatibility in APIs is to introduce a new API version, but if you did that for every little thing, you would end up in API version management hell very soon. Introduce guidelines in your teams for incrementing new API versions. But let’s focus a bit on keeping backwards compatibility within the existing API version. When it comes to model classes in your Java/C#/Python/NodeJS code and the data structure/scheme in your choice of database, it is important that these changes are always backwards compatible. For example, you could introduce a new attributes in these model classes, but if you make these new attributes mandatory, then you need to think how your code would read the pre-existing records in database. You need to make sure that your code gracefully handles that. As for changing the existing attributes in your model classes, that should never happen; that’s an interface change and you should never do interface changes.
Therefore, my general advice is to always think about backwards compatibility and make it a part of your programming culture.
(17) Security behind your API routes and header information in POST/PUT calls:
Assuming that you have a mechanism to configure your API routes as internal to the platform or external to outside client, you need to have a process around tracking the security of your API routes. This is important especially if your external routes could have different roles depending on what token you get authenticated for.
Then when we get a bit deeper into the HTTP call itself, it is important that you have standards around what is acceptable in the header of the POST/PUT HTTP calls. This will save you a lot of headaches. Now that you know what is acceptable in the header, you need to have mechanisms for deciding what if the given header attribute can be injected by your external client or not. Or even if it is injected, it would not matter if you further down this HTTP call, you strip out that information and replace it with correct values; if you have this type of capability, you are that much better.
(18) Security — Encryption of data at the attribute/column level:
To be at the enterprise level and pass PCI regulations, you need to have a library for your microservices that fully abstracts the encryption of data and the management of keys needs to be done on dedicated appliances.
(19) API Automation approach:
First, you need to implement an automation framework that your teams can use.
Then your team needs to focus on automating the most likely real-life scenarios with APIs instead of trying to exhaust the permutations on each individual API.
Initial automation tests should exercise how your family of APIs work together. Then you can expand your horizon and focus on more holistic view joining forces with automation engineers from other teams.
(20) Batch process within your platform:
The microservice architecture has the event driven methodology built it in, but there are still 3rd parties that interact with your platform and require a batch oriented design. Therefore, your team needs to make sure that you have a reference architecture to handle batch oriented implementations. Depending on whether you use NoSQL or Relational type of database, this batch oriented solution will be different, but you need to have it.
(21) Load and Performance testing:
Load and performance testing needs to be done for each microservice before that microservice can be accepted into production environment and opened to real customers. Your team needs to be able with confidence to determine for every microservice:
- What is the comfortable memory & CPU usage while the service is functioning fine but close to crashing?
- What is the memory & CPU usage when the service starts crashing in the load test.
- How many requests your service can handle per minute or per second?
Your team needs to document all of this in order to be accepted into production.
(22) Circuit-breaker pattern especially when calling external (3rd party) APIs:
In order to prevent the back-pressure effect on your microservices, you need to introduce the circuit-break pattern for situations when your microservice calls into another service/API. This is especially crucial when your microservice calls into an API external to your company; you don’t want the stability of 3rd parties to have direct impact on your application and cause so much back-pressure that the other pieces of your service are affected.
(23) Database design:
First you need to decide if you will embrace NoSQL world or you will stick to more traditional SQL/Relational database world.
Let’s assume you decide to pick NoSQL, then you need to keep in mind that there is NO SUCH THING as no schema. You technically always have some data structure but your database system may not be enforcing it, and with code-first methodology, your application code would be enforcing all of this. Even if you pick NoSQL, your data still needs to be eventually streamed or replicated into some form of relational database if you want your reporting/analytics team to perform analysis on it.
Therefore, regardless of NoSQL or SQL choice, your team needs to implement the streaming/replication of data into a reporting database.
Don’t forget about the database backups. I am talking about the cold backups or snapshots. For example, if you realized that your data has been unreliable for last few hours, or the system fully went down, then you need to have a choice to pull the backups and restore from earlier that day.
(24) Database capacity provisioning:
Your team needs to provision your database for your scale of traffic and the amount of data. For example in AWS DynamoDB NoSQL world, you need to provision the reads and writes per second and you pay for what you provisioned. At the same time you need to be able to handle quick bursts of traffic or longer or consistent increases in traffic. Similar concepts need to be applied in the relational database world as well. For example, in Amazon Aurora you don’t necessarily provision tables ahead of time, but you do need to worry about the overall amount of data you can store in Aurora instances.
(25) Data / Data structure Governance & Reporting/Analytics:
What do I mean by “data governance”? Your team needs to have procedures with the goal of preventing breaking changes to data structures and negatively impacting the integrity of data. For example, in the NoSQL world, code comes first, but to manage data structure changes, there are no stored procedures; you really need to have procedures for software engineers to inform the data team about any model classes changes in your Java/C#/Python/Go/NodeJS code. You need to be always backwards compatible and since this NoSQL data eventually ends up in some form of relation database, you need to keep informing the data team about the attributes that have been added in the data structure. Ideally you would need an automated process that does static code analysis and always detects changes in these model classes even if programmers forget to inform each other and the database teammates.
On the other hand, if you are in the relational database world, you are controlling things through your stored procedures that are abstracting the data structure from software engineers and the database team is in more control.
Reporting and analytics has always been of those after-thoughts. In theory, you should start with this. Before a line of code is written, you could specify what defines “success” for you and you should be able to break it down into multiple measures. I am sure that the requirements and designs will change throughout the implementation phase, but at the principle level, it stays the same. Therefore, introduce into your work environment the concepts of reporting and analytics from early phases of design and development.
In conclusion, writing code is relatively simple and when your software engineers say they are done, you should clearly define what “done” means from enterprise point of view.
Thank you for reading this long article and I hope you found it useful as you go down this path of implementing and releasing enterprise level applications/microservices.