Monday, June 26, 2017

Top 25 things to achieve Enterprise-Level Microservices — What your team should accomplish

Developing applications / microservices is a definitely a skill that your team needs to gain, but what does it mean when you say “I am done”. What is the definition of “DONE”?
A lot of times you hear from us software engineers the following statements:
  • “I am done coding that part but I still need to wrap up my unit-testing”.
  • “My user story is done, but we just need to finish writing the automation tests.”
  • “I am done, but I just need to add better logging in the code and add some better exception handling.”
  • “I am done, but I have not deployed to QA environment yet; it works on my machine.”
We all have been in these shoes. We need to step back and look at this from the software engineering 10,000-foot level and truly understand what it means to deliver something and call it done and be enterprise worthy at the same time.
Here is my recommended list of things that your TEAM needs to accomplish in order to truly have production and enterprise worthy applications/microservices:
(1) Architecture diagram representing interactions of your family of microservices:
After going through the design and contract definitions of your APIs/microservices, your team needs to put together a single diagram that explains the interaction among microservices (owned by your team) and how these microservices interact with other teams’ microservices. This is basically the blueprint for your team’s work and this picture needs to be constantly updated as you are evolving your services.
(2) Service names, boundaries (domain lines) and API definitions/standards:
Names mean things. So you first need to properly name your services and that’s the names that you would use when talking to your teammates and clients of your services/APIs.
I have a separate article on how you go about defining what a microservice is: https://medium.com/@almirx101/micro-in-microservices-677e183baa5a
Essentially, you need define the purpose and boundaries of your service.
Then you get into API routes and properly defining them for each service. The goal is to keep the routes RESTful and if you run into the situation when they are not, then it should trigger you to revisit the purpose and boundaries of that given microservice.
(3) Application CONFIG file defining your service:
The assumption would be that your platform is set up in such a way that you as the owner of a microservice would properly set up your application CONFIG file which defines your microservice and all the resources that it needs. This config file defines anything from what database you need, database details, what cache you need, what queues you need, how much memory and CPU you want your application to use, what physical server you want your application to run on, and etc.
Basically this config file is the contract between you (as the owner of a microservice) and DevOps scripts/tools that take this config file and use as the definition to deploy your service into a given environment whether it is in the cloud or on-prem.
You own this config file but with ownership comes responsibility. It is important that this config file and code-reviewed properly and managed properly to keep the stability of your application.
(4) API Dependencies (Direct or via Pub/Sub approach):
As you are defining your microservices and API contracts, it is important that you explicitly think about API dependencies. Think about the type of dependency you want to achieve. You can directly call another microservice via http call (sync or async), or you can use the publish/subscribe methodology. It depends on your use case and there is no right or wrong here. The important thing is that you give it an explicit thought as you are designing your family of microservices and as you are designing the interaction among them and interaction with outside services. Ideally you could have this API dependency definition within the config file mentioned above.
(5) API Dependency Analysis Tool:
As you have all these API dependencies, they need to be managed somehow. You need a holistic view of all your services. Therefore, you need to develop a tool that performs static analysis through your config files and maps a graph of all microservices and how they interact with each other. To supplement this static analysis tool, you would also need a tool that in real-time maps all the microservices interactions that are actually taking place in production.
(6) Logging / correlationId / Log-Levels:
First agree on the format of your log messages. Then implement a logging framework in a library that all your microservices can utilize.
Make sure that you introduce the concept of correlationId that can be used to map the flow especially between microservices. For example, your microservice could be logging a requestId or some form of a GUID unique to that service, but in order to connect things among multiple components/services, you need the concept or correlationId.
Now that you have the foundation and the principles down, your team needs to have the right mindset; it looks simple on paper, but there is much more to it. How many times have you been in a situation when you were troubleshooting a production issue and you said to yourself: “I wish had a log line for this condition”. When it comes to logging, here are some tips:
  • Log positive as well as negative cases
  • Any time you add an IF condition in your code, you need to think about what you will log in the IF, ELSE IF and ELSE.
  • Log your exceptions if you caught them and you need to enrich your log with the context you have at that level and not at the level above in the stack-trace.
  • Introduce the concept of log-levels and make sure that the production environment is not logging data that should not be logged (personal info) and that the excessive logging does not negatively impact performance.
Unwritten rule with logging is that you do NOT put a log statement in a loop unless you truly understand the number of iterations will not impact your performance with multiple requests going into your service.
(7) Log Aggregator Tool:
Your team needs or your company overall needs to pick a log aggregator tool that allows you to view all your flows through a website and it should allow you to perform fancy queries on your logs. For example, you can pick Splunk. Your service will use a library to stream all the logs into Splunk and Splunk would be your eyes into the system.
Now that you have a tool like Splunk, you need to take it to the next level. Each team should be responsible to build dashboards in Splunk that track the health of your microservices. These dashboards could be used by any technical person troubleshooting things in production.
Then the next thing to do is for your Network Operation Center to set up alerts in Splunk so that they get informed when certain limits are reached and when to open a severity 1 or 2 ticket with the right microservices team.
(8) Exception Handling in your code:
In your choice of general purpose language, you have the concept of exception handling (try-catch or try-except blocks). Even though your services would be micro-services, it does not mean that the whole microservice internally contains only one layer. Most likely you will have your API level or controller layer or resource layer. That’s the code where your API routes are defined and where you have the entry points for each route. That code typically should do just some validation and then proxy your call down into your business layer and your business layer makes calls into the database/repository layer and/or into another another API via your adapter layer. As you are writing code in these layers, you need to think about the try/catch blocks if you need it or not. For example, there is no reason to catch an exception at the lower level unless you need to log something that you have ONLY available at that layer and not at the layer of the invoker.
I don’t want to get into too much details on this topic as this deserves a totally separate article. I just wanted to point out that the exception handling needs to be taken seriously if you want to achieve that enterprise grade.
(9) Error Handling:
Don’t mistake the Error handling with Exception handling. Error handling is different. An error could be a result of a specific IF condition in your code or it could be a result of an exception that got caught. Basically your team needs to decide what how to translate all the errors within the microservice into HTTP codes and you can also enrich the JSON response with some extra details. Don’t forget that you can for example use specific range of 5xx custom http codes to provide a more meaningful response to your client. Your team needs to analyze all the paths in the code and decide what HTTP codes will be sent to the client. Don’t just have all errors be HTTP 500; that does not help your clients nor you if you are troubleshooting your service.
(10) Timeouts:
Timeouts in your API calls are something that is typically very lightly taken. We engineers would just code it and then when things stop working, you are wondering what your defaults are and if those defaults make sense. Understand what the defaults are in the library that you are using to perform HTTP/API calls. Then assess what is reasonable from client’s point of view.
Dynamically variable timeouts are something even better if your team has time to implement. Let’s assume you have a soft timeout of 15 seconds for outbound API calls from your microservice. Now your service is calling another microservice and it starts timing out at 15 seconds. You could build smarts into your system where timeouts dynamically change from the 15-sec soft limit to 20 seconds. Then your application knows what percentage improvement you made by increasing the timeout to 20 seconds. Then your application based on data can decide if to stay at 20 seconds or to increase the timeout to 25 seconds or go back down to 15 seconds. This mechanism allows your application to still function and service customers and obviously you would need a hard limit for your timeout because you need to draw a line and determine what timeout is truly something that is not acceptable for end customers. This mechanism allows you to be a bit more aggressive by starting with lower soft-limits which improves the overall performance of your microservice because you don’t have that many connections opened due to longer timeouts.
(11) API Contracts & Stubbed or Mock services:
This item is more about the development methodology that I would strongly recommend. I have a short article that talks about developing the layers of cake instead of slice of cake: https://medium.com/@almirx101/developing-layers-of-cake-instead-of-slices-of-cake-in-your-software-engineering-projects-71483aff2df1
What I strongly recommend that you force yourself to fully define the API contracts for all your microservices. As you go through this exercise, you will realize that you need to reach out to other teams to agree on how you will invoke their APIs and at the same time, you should reach out to your clients and agree on what the best contract for your API is.
After defining the contracts, my strong recommendation would be to develop the stubbed or mock versions of your APIs honoring these contracts. So if I called any of your microservices, I would get proper JSON responses. That means you internally would have that JSON harded or sitting in some temporary JSON file until you eventually implement the data/repository layer and your choice of database itself.
If everybody did this for each one of their microservices, then you are in theory building a thin layer of the cake or a thin layer of your microservice platform. This would enable your Front End team developing a website or a mobile application to implement the full journey A to Z by calling all these mock services. The sooner this full UI journey is given to product management team, the sooner they will get a chance to feel how good their requirements are and the sooner they will be able to adjust things and steer the ship in the right direction so that your end customers ultimately win.
(12) API Documentation and Swagger docs:
Don’t underestimate the power of documentation. I am not a big fan of LONG documents, but I am a big proponent of single-page documents. If it is longer than a page, the chance that people will read it is low. However a singe-page document will be read and if supplemented by good examples, it is even more powerful. Please follow this recommendation for your API documentation. In theory, the Swagger documentation could replace the majority of your API documentation.
(13) Postman app in Chrome:
Postman app for Chrome is what I have been using to try out different APIs. If you use this app or similar type of app, I strongly recommend that you keep the collection of API calls saved and committed into your GitHub repo so that other members of the team don’t have to figure it out from scratch. Examples are the best way to share the documentation as long as you maintain these examples.
(14) Service/API Versioning and Regular clean-up:
In the API world, you can introduce API versions and you can require from clients to pass you the version of the API that they want to call. Typically you can follow the Major.Minor.Patch versioning. For example, you can start with version 1.0.0 of your API. As you are making small changes and fix defects, you can keep incrementing the patch version. If you introduce some new features, then you can increment the minor version. However, if you introduce a big set of new features and also refactor your API to break some contracts, then you would increment the major version.
My recommendation is that you as a team don’t maintain/support too many versions at any given timeAs you depreciate the old API versions, please make sure that you properly clean up all the resources that have been used by that version in order to save the costs. For example, that could be anything from dedicated servers for that API version, dedicated cache, dedicated database, configuration files, and etc.
(15) Cache:
Cache can be used as a mechanism to save some configuration values and improve the performance for accessing values in this cache. Cache can also be used to cache transactional information. For example, you could have a Customers microservice that could be saving the transactional information into cache for 15 minutes because the Front End (HTML/Javascript) code of your website accesses your Customers API on many pages. That’s where you can improve the performance and cache this information and obviously any time customer’s information is changed, you need to update the cache. My recommendation is that you analyze your use cases and see how and if you need to use cache. If you do decided to use cache, then make sure that your application/microservice cache size can be determined on service by service basis. Some services need more and some less cache and this approach will keep your costs down.
(16) Backwards Compatibility:
One way to retrieve backwards compatibility in APIs is to introduce a new API version, but if you did that for every little thing, you would end up in API version management hell very soon. Introduce guidelines in your teams for incrementing new API versions. But let’s focus a bit on keeping backwards compatibility within the existing API version. When it comes to model classes in your Java/C#/Python/NodeJS code and the data structure/scheme in your choice of database, it is important that these changes are always backwards compatible. For example, you could introduce a new attributes in these model classes, but if you make these new attributes mandatory, then you need to think how your code would read the pre-existing records in database. You need to make sure that your code gracefully handles that. As for changing the existing attributes in your model classes, that should never happen; that’s an interface change and you should never do interface changes.
Therefore, my general advice is to always think about backwards compatibility and make it a part of your programming culture.
(17) Security behind your API routes and header information in POST/PUT calls:
Assuming that you have a mechanism to configure your API routes as internal to the platform or external to outside client, you need to have a process around tracking the security of your API routes. This is important especially if your external routes could have different roles depending on what token you get authenticated for.
Then when we get a bit deeper into the HTTP call itself, it is important that you have standards around what is acceptable in the header of the POST/PUT HTTP calls. This will save you a lot of headaches. Now that you know what is acceptable in the header, you need to have mechanisms for deciding what if the given header attribute can be injected by your external client or not. Or even if it is injected, it would not matter if you further down this HTTP call, you strip out that information and replace it with correct values; if you have this type of capability, you are that much better.
(18) Security — Encryption of data at the attribute/column level:
To be at the enterprise level and pass PCI regulations, you need to have a library for your microservices that fully abstracts the encryption of data and the management of keys needs to be done on dedicated appliances.
(19) API Automation approach:
First, you need to implement an automation framework that your teams can use.
Then your team needs to focus on automating the most likely real-life scenarios with APIs instead of trying to exhaust the permutations on each individual API.
Initial automation tests should exercise how your family of APIs work together. Then you can expand your horizon and focus on more holistic view joining forces with automation engineers from other teams.
(20) Batch process within your platform:
The microservice architecture has the event driven methodology built it in, but there are still 3rd parties that interact with your platform and require a batch oriented design. Therefore, your team needs to make sure that you have a reference architecture to handle batch oriented implementations. Depending on whether you use NoSQL or Relational type of database, this batch oriented solution will be different, but you need to have it.
(21) Load and Performance testing:
Load and performance testing needs to be done for each microservice before that microservice can be accepted into production environment and opened to real customers. Your team needs to be able with confidence to determine for every microservice:
  • What is the comfortable memory & CPU usage while the service is functioning fine but close to crashing?
  • What is the memory & CPU usage when the service starts crashing in the load test.
  • How many requests your service can handle per minute or per second?
Your team needs to document all of this in order to be accepted into production.
(22) Circuit-breaker pattern especially when calling external (3rd party) APIs:
In order to prevent the back-pressure effect on your microservices, you need to introduce the circuit-break pattern for situations when your microservice calls into another service/API. This is especially crucial when your microservice calls into an API external to your company; you don’t want the stability of 3rd parties to have direct impact on your application and cause so much back-pressure that the other pieces of your service are affected.
(23) Database design:
First you need to decide if you will embrace NoSQL world or you will stick to more traditional SQL/Relational database world.
Let’s assume you decide to pick NoSQL, then you need to keep in mind that there is NO SUCH THING as no schema. You technically always have some data structure but your database system may not be enforcing it, and with code-first methodology, your application code would be enforcing all of this. Even if you pick NoSQL, your data still needs to be eventually streamed or replicated into some form of relational database if you want your reporting/analytics team to perform analysis on it.
Therefore, regardless of NoSQL or SQL choice, your team needs to implement the streaming/replication of data into a reporting database.
Don’t forget about the database backups. I am talking about the cold backups or snapshots. For example, if you realized that your data has been unreliable for last few hours, or the system fully went down, then you need to have a choice to pull the backups and restore from earlier that day.
(24) Database capacity provisioning:
Your team needs to provision your database for your scale of traffic and the amount of data. For example in AWS DynamoDB NoSQL world, you need to provision the reads and writes per second and you pay for what you provisioned. At the same time you need to be able to handle quick bursts of traffic or longer or consistent increases in traffic. Similar concepts need to be applied in the relational database world as well. For example, in Amazon Aurora you don’t necessarily provision tables ahead of time, but you do need to worry about the overall amount of data you can store in Aurora instances.
(25) Data / Data structure Governance & Reporting/Analytics:
What do I mean by “data governance”? Your team needs to have procedures with the goal of preventing breaking changes to data structures and negatively impacting the integrity of data. For example, in the NoSQL world, code comes first, but to manage data structure changes, there are no stored procedures; you really need to have procedures for software engineers to inform the data team about any model classes changes in your Java/C#/Python/Go/NodeJS code. You need to be always backwards compatible and since this NoSQL data eventually ends up in some form of relation database, you need to keep informing the data team about the attributes that have been added in the data structure. Ideally you would need an automated process that does static code analysis and always detects changes in these model classes even if programmers forget to inform each other and the database teammates.
On the other hand, if you are in the relational database world, you are controlling things through your stored procedures that are abstracting the data structure from software engineers and the database team is in more control.
Reporting and analytics has always been of those after-thoughts. In theory, you should start with this. Before a line of code is written, you could specify what defines “success” for you and you should be able to break it down into multiple measures. I am sure that the requirements and designs will change throughout the implementation phase, but at the principle level, it stays the same. Therefore, introduce into your work environment the concepts of reporting and analytics from early phases of design and development.
In conclusion, writing code is relatively simple and when your software engineers say they are done, you should clearly define what “done” means from enterprise point of view.
Thank you for reading this long article and I hope you found it useful as you go down this path of implementing and releasing enterprise level applications/microservices.
Almir Mustafic





Saturday, June 24, 2017

Amazon WorkSpaces — Desktop in the cloud for non-technical users and even software engineers

I am not sure how many of you have heard of Amazon WorkSpaces. Let me give you some information and my analysis after using it for a week.
What is it?
It is the Amazon’s solution to desktops in the cloud.
What’s the point?
Well, if you want to use a basic less powerful laptop or even use a Chromebook as your daily computer, then you will not be missing the full functionality of a powerful Windows 10 machine. For example, you can use your Chromebook to connect to your Amazon WorkSpace machine and then you get the full Windows 10 experience. The point is that it gives you the experience as if your remote/workspace Windows 10 machine is really part of your Chromebook. Where it makes the most sense is using Chromebooks to connect to your Amazon WorkSpace machine as Chromebooks with Chrome OS are fast for anything you do in the browser and they are very secure; Chromebooks are also generally cheaper than other laptops. Then when you want to some more advanced work (even software development), you can connect to your remote workspace with the client app that Amazon provides you. It basically turns your Chromebook laptop into a Windows 10 laptop; it gives you that experience.
Instead of using a Chromebook as your laptop, you can also use your an old Windows or an old Mac OS device and still remote into your Amazon WorkSpace machine and use the more powerful Windows 10 machine for all the work that you need to perform.
Who would consider this?
If your computer (Windows / MacOS) is getting old or you already have a Chromebook, then you may consider setting up an Amazon WorkSpace Windows 10 machine and paying monthly for it instead of buying a new laptop in your local tech store or online. You really need to work out for yourself what makes sense by doing some cost analysis. You need to ask yourself how often you buy new laptops. Based on my rough calculations, if you buy new laptops every 2-3 years because you need to keep up with demands, then the Amazon WorkSpace could be a good option for you. However, if you are change your laptops every 4 or 5+ years, then from the cost point of view, it makes sense that you continue buying new laptops instead of getting the Amazon WorkSpace.
On the other hand, in the enterprise world, companies get solutions like Amazon WorkSpaces because they want to reduce their maintenance/support costs. If you as a helpdesk specialist set up your employees with Chromebooks and then you give them access to an Amazon WorkSpace Windows 10 machine that already has MS Office and other apps installed, then you will spend less time supporting your users and you will have more control managing these machines in the cloud. Pushing updates and rolling out bigger OS changes will be much simpler for your helpdesk department.
Here is the pricing information:



You can see that based on these machine specifications a business person and most employees can easily utilize the “Standard” bundle. I am actually experimenting with the “Standard” bundle for the purposes of software engineering and it seems to be fast enough for some lightweight open-source type of development. The “Standard” bundle performs as good as my Macbook Air with 4Gb of memory. If I wanted to do some more intensive software development work, I am sure that the “Performance” bundle would do just fine even though the 7.5 Gb memory seems a bit low for what you would typically get if you were buying a laptop for software development; you would typically go for 16 Gb or 32 Gb memory. The “Graphics” bundle seems to be a good option if you want to occasionally use it for some heavy processing, but it is a costly option for us consumers.
The important thing is that you are the local admin on the machine in the workspace. Therefore, you can install different softwares on that Windows 10 machine. For example, here is the list of applications that I installed so far: Chrome, Dropbox, WinSCP, Python, a list of Python modules, PyCharm IDE for Python development, AWS CLI, Git for Windows. I also plan to install Java and an IDE for Java.
To set up Amazon WorkSpaces, just follow the instructions here: https://aws.amazon.com/workspaces/
Once you set it up on the AWS website and install the client app on your laptop, you can open the client app, set up the credentials and then it will connect you into the workspace Windows 10 machine. Here are some screenshots:


Login windows in the client app
You are connected and in
Start menu
A few windows opened in the workspace machine
PyCharm IDE for Python coding
Switching between virtual desktops

You can also expand your client app across two monitors and the screen resolution adjusts perfectly.


Amazon WorkSpaces client app expanded across multiple monitors

I hope you found article useful. I will continue using Amazon WorkSpaces for a couple of months and I will keep you updated.
Almir Mustafic



Thursday, June 15, 2017

AWS — When to use Amazon Aurora instead of DynamoDB

Amazon DynamoDB as managed database will work for you if you prefer code-first methodology. You will be able to easily scale it if your application inserts data and reads data by your hash key or primary key (hash+sort key). It is also good if your application is doing some queries on the data as long as the resultset of these queries returns less than 1Mb of data. Basically if you stick to functionality that is typically required by websites in real-time, then DynamoDB will perform for you. Obviously you will need to provision the reads and writes properly and you will need to implement some auto-scaling on DynamoDB WCUs and RCUs, but after you do all of the homework, it will be smooth for you without needing to manage much.
However, there are cases when you will need to go back to relational databases in order to accomplish your business requirements and technical requirements.
For example, let’s assume that your website calls one of your microservices which in turn inserts data into its table. Then let’s assume that you need to search the data in this table and perform big extracts which then have to be sent to a 3rd party that deals with your data in a batch-oriented way. If you need to for example query and extract 1 million records from your DynamoDB table, it will take you up to 4.7 hours based on my prototypes using standard AWS DynamoDB library from Python or C# application. The way you read this amount of data is by using LastEvaluatedKey within DynamoDB where you query/scan and get 1Mb (due to the cutoff) and then if the LastEvaluatedKey is not the end of resultset, you need to loop through and continue fetching more results until you exhaust the list. This is feasible but not fast and not scalable.
My test client was outside VPC and obviously if you run it within the VPC, you will almost double your performance, but it comes to bigger extracts, it still takes long. If you are dealing with less than 100,000 records, it is manageable within DynamoDB, but when you exceed 1 million records, it gets unreasonable.
So what do you do in this case? I am sure that you can improve the performance of the extract by using Data Pipeline and similar approaches that are more optimized, but you are still limited.
Basically, your solution would be to switch to a relational database where you can manage your querying much faster and you have a concept of transaction that helps with any concurrency issues you might have been challenged with. If you want to stay within the Amazon managed world, then Amazon Aurora looks very attractive. It has limitations on the amount of data, but most likely those limits are not low enough for your business. As for the big extract performance challenge, your extracts will go from hours (within DynamoDB) to minutes with Aurora.
Please consider this in your designs. Performing big extracts is opposite of the event driven architecture, but these type of requirements still exist due to a need to support legacy systems that you need to interact with or systems that have not adjusted their architecture to your methodologies.
Thank you for reading.
Almir Mustafic.