Thursday, June 18, 2015

Detect errors on time to gauge the success of your launch — Simple concepts, huge benefits!

When you develop your applications, do you put thought into a proper logging mechanism? In a perfect world, there should be no errors in the logs, but we all know that’s not the case in reality. Then how do you know that the project you just launched to a production environment is not introducing new errors or new types of errors on top of already existing errors that your triage team has been investigating?
Maybe on day 1 of your launch, you don’t see customer complaints, but in reality your systems are hurting or bleeding slowly. Yes, a network operations team could be checking the high-level health of the systems through a list of dashboards, but what are we software developers going to do to introduce a new level of detection? This all starts from the enterprise architecture and frameworks you build. Let’s assume you have a very robust logging mechanism and that this mechanism allows you to log the happy path. Let’s also assume that you have very clean guidelines for error and exception handling and utilizing the logging framework where necessary.
Now that you have all of the above in place, at the beginning of your project that is implementing business requirements within the existing framework, you have ability to cleanly define the top 20 cases to measure the success of the new code/features. Each developer can use this top-20 list as a guideline while developing the code and logging happy/negative cases. Let’s say your code is now in production and you are scanning through the logs manually and detecting the top-20 cases. Is this efficient? Are you supposed to do this on daily basis manually?
My recommendation is that you develop a lightweight solution that will be able to automatically do the following for you:
  • * Scan the logs on daily/hourly basis and produce the count of the top-20 scenarios and display the results in a table on some internal dashboard website
  • * Have ability to detect if the number of errors in each category increases by more than X% (daily comparison of errors per Y units of work and units of work could be somehow defined and tied to the traffic on your website).
  • * Have ability to detect if new types of errors and exceptions that start happening so that the team can manually assess the situation and then add each new type of error to the top-20 list and start tracking it on daily basis.
If you have all of this automated, then there is no manual work needed when you launch something to production. You will be able to tell if your new code is hurting the numbers on existing top-20 categories and you will also be able to tell if you started introducing new types of errors that hurt the revenue of your company. Let’s assume that your production deployment involves deploying to a smaller/secondary data center first and then later to the rest of your data centers. Then this type of mechanisms can help you decide whether you continue deploying to the rest of data centers after deploying to that smaller data center.
These are all simple concepts. You can spend minimal efforts in building it yourself or maybe decide to buy a solution. The importang thing is to always take the “keep it simple” approach in decision making.

Conclusion:

Start by tracking top-20 errors on daily/hourly basis and use the percent of change as the gauge for the success of your code being pushed to production environments. Detect the newly introduced low-level engineering errors in production on time to gauge the success of your launch. Don’t over-design this! Keep it simple!
Almir Mustafic (almirsCorner.com)

#softwareengineering   #softwaredevelopment   #code   #coding  #programming   #software   #errors   #ExceptionHandling   #NOC  #monitoring  

No comments:

Post a Comment