Cookie Consent by Free Privacy Policy Generator ๐Ÿ“Œ Case Study: Reducing toil of resolving issue in Node JS - Introduction

๐Ÿ  Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeitrรคge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden รœberblick รผber die wichtigsten Aspekte der IT-Sicherheit in einer sich stรคndig verรคndernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch รผbersetzen, erst Englisch auswรคhlen dann wieder Deutsch!

Google Android Playstore Download Button fรผr Team IT Security



๐Ÿ“š Case Study: Reducing toil of resolving issue in Node JS - Introduction


๐Ÿ’ก Newskategorie: Programmierung
๐Ÿ”— Quelle: dev.to

Hi there! Welcome to the first of a four-part journey where I share the transformation we achieved in addressing and resolving issues within our Node.js projects at Kargo. This adventure led us to dramatically reduce the time and effort required to tackle these challenges head-on.

In this opening segment, we delve into the core problems that initially confronted us and outline the innovative solutions we proposed.

Background

Some time ago, the Tech team at Kargo embarked on an interesting project: to build a new Node.js backend service designed to support a feature requiring specific integration with a third-party platform. This venture presented a unique challenge for Kargo's engineers, as NodeJS backend was a new addition to our technology stack, which usually was based on Elixir or Go. Despite the initial hardships stemming from our unfamiliarity with this tech stack, we persevered, successfully delivering the required functionality. The project not only met but exceeded our success metrics, leading to a well-deserved celebration among the team.

However, as time progressed, the project's scope expanded, and with it, an increase in operational issues and bugs became apparent. The time required to resolve these issues began to increase significantly. On average, it took the team approximately four hours to address each problem, with around three issues arising per week. This not only led to considerable time loss but also began to affect team productivity adversely. More importantly, the user experience and the product's overall success suffered, as frequent issues and extended downtime became more common.

Frustated Engineer

Diagnosis

After some investigation, the NodeJS project lack observability & debugging tools that we have in usual Kargo's backend service (Elixir/Go). This result in increased effort across maintenance lifecycle (Detect, Diagnose, Debug). To illustrate, here's an hypothetical example of how a particular issue is resolved:

  1. Issue detection: The team received a report from user that the feature is not working as expected. An engineer are assigned to investigate the issue. There are no dashboard for the project, so the engineer usually need to look at relevant code for impacted feature, then look at log file, custom query to database, and other ad-hoc action for confirming the issue.
  2. Issue diagnosis: To determine what's the most likely cause of the issue, the engineer need to be familiar with codebase involved. Occasionally, they are lucky and there's log that's explicit enough to point out the issue. But most of the time, they need to add more log into codebase then redeploy it to get more information.
  3. Issue debugging: After the engineer have some idea on what's the issue, though it's not usually 100% certain. They would implement probable fix, then redeploy the code through CI/CD. If the issue still persist, they would need to repeat the process again. Usually it takes 3 to 4 retry before the issue is resolved.

Let's fix this

Solution

We brainstorm on how to improve the situation, also looking at what the tools & technique we have in Elixir/Go backend service that might help in this situation. We come up following solutions that we think will help addressing problem in each of the stage (Detect, Diagnose, Debug):

  1. Monitoring Dashboard + Canonical Log: By implementing Action-oriented Dashboard, we've created a quick and centralized access point for vital aspects of the project, significantly speeding up our ability to detect and validate issues. This dashboard, built on a Canonical Log record approach, offers the flexibility needed for performing custom advanced analyses with ease. Our aim with this system is to verify common issues within a mere 5 minutes of engineering effort. For example, we successfully reduced the time to check for a frequent session connection issue from 20 minutes to just 1 minute.
  2. On Demand Diagnostic Logging: We developed a feature for dynamic, detailed logging of operations, enabling us to drill down into issues as they happened. This was particularly effective in a recent incident where typical logs were inconclusive. By enabling on-demand logging, we pinpointed the problem within minutes, a process that could have otherwise taken multiple redeployments.
  3. Developer Code Execution: Just like how Elixir's remote shell capability help Kargo's engineer rapidly diagnosing & debug for issue happened in backend, we would like to have similar capability in NodeJS project.
  4. Operational Handbook: We compiled a comprehensive handbook detailing common issues and their solutions, which served as a first reference point for our engineers. One memorable success story involved a team member who are relatively new to the project, was able to resolve an issue independently by following the handbook when the usual maintainer taking day off. This not only make issue resolving more efficient, but also give assurance to engineer.

Screenshot of Grafana Dashboard
Action-oriented dashboard help engineer quickly understand & troubleshoot issue and decide what's the fix

Ideal Outcome

We target we were able to reduce the time needed to resolve issue from 4 hour to 30 minutes. Thus reducing the impact of issue to user experience and product success, and help the team to focus on delivering more value to the user.

An ideal scenario of how the issue resolved in NodeJS project looks like:

  1. Issue detection
    For 50% common issue, the issue would be automatically detected and alerted from metric dashboard. The engineer could easily validate the issue exist by looking at metric dashboard and resolve the issue by following the operational handbook.
    For the rest of 50% issue, engineer could do analysis based on dashboard + extended query from canonical log to confirm the issue and it's scope. If the issue is persistent, engineer could easily create new metric dashboard for the issue from canonical log data.

  2. Issue diagnosis
    When thereโ€™s an unknown issue, engineer could turn on diagnostic logging for user they are interested in (based on user ID) and get relevant debug log of WhatsApp bot server to form reasonable guess on whereโ€™s the issue is. Then engineer could use developer remote execution capability to validate whether the guess is correct or not.

  3. Issue debugging
    Once root cause of issue is identified, engineer could implement fix to the solution, then test it out first using developer remote execution. Once it's confirmed then engineer could deploy the fix through CI/CD.
    Now, because root cause of issue is well identified, engineer could resolve the issue in just one try.

Conclusion

That's the background of the problem that we are facing, and next write up will be focusing on how we implement each of the solution that we proposed.

What do you do to make sure your team is able to resolve issue quickly? Share your experience in the comment below!

Stay tuned for next part

...



๐Ÿ“Œ Snake Case VS Camel Case VS Pascal Case VS Kebab Case โ€“ What's the Difference Between Casings


๐Ÿ“ˆ 35.68 Punkte

๐Ÿ“Œ Up to 'ONE BEEELLION' vid-stream gawpers toil in crypto-coin mines


๐Ÿ“ˆ 30.82 Punkte

๐Ÿ“Œ Creating Self-Serve DevOps Without Creating More Toil


๐Ÿ“ˆ 30.82 Punkte

๐Ÿ“Œ Toil! What is it good for? - Akira Brand - RSA24 #1


๐Ÿ“ˆ 30.82 Punkte

๐Ÿ“Œ HPR4118: Toil versus Livelihood


๐Ÿ“ˆ 30.82 Punkte

๐Ÿ“Œ Ghidra Show case (FiDB, parse C header, syscall resolving, patch scripting) - part 1


๐Ÿ“ˆ 30.24 Punkte

๐Ÿ“Œ Resolving Node.js and npm Issues on macOS


๐Ÿ“ˆ 29.58 Punkte

๐Ÿ“Œ Resolving Firebase Import Conflicts with Aliasing in Node.js


๐Ÿ“ˆ 29.58 Punkte

๐Ÿ“Œ Resolving Selenium's Zombie Process Issue


๐Ÿ“ˆ 29.05 Punkte

๐Ÿ“Œ New Study: Reducing Security Incidents and Impact with Endpoint Protection


๐Ÿ“ˆ 25.2 Punkte

๐Ÿ“Œ Independent study finds CybelAngel, a global cybersecurity leader in reducing external ...


๐Ÿ“ˆ 25.2 Punkte

๐Ÿ“Œ Reducing Nitrogen Use Key To Human and Planetary Health, Study Says


๐Ÿ“ˆ 25.2 Punkte

๐Ÿ“Œ Light Pollution Rapidly Reducing Number of Stars Visible To Naked Eye, Study Finds


๐Ÿ“ˆ 25.2 Punkte

๐Ÿ“Œ HODOR: Reducing Attack Surface on Node.js via System Call Limitation


๐Ÿ“ˆ 24.45 Punkte

๐Ÿ“Œ 34C3 - On the Prospects and Challenges of Weather and Climate Modeling at Convection-Resolving Reso


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ IDA Pro Scripting Intro - Automate Dynamic Import Resolving for REvil Ransomware (OALabs Tutorial)


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ DEF CON Safe Mode Voting Village - Jody Westby - Policy Approach to Resolving Cybersecurity Problems


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Vodafone works on resolving nationwide network outage


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving Embedded Files at Runtime via strace


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving Availability vs. Security, a Constant Conflict in IT


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving rogue robots directives


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Who's Resolving This Domain?, (Mon, Jan 23rd)


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving Azure Data Studioโ€™s Identity Crises


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Regular Pen Testing Is Key to Resolving Conflict Between SecOps and DevOps


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving โ€œFailed to Download Repository Informationโ€ errors in Ubuntu


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving code review comments with ML


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving Memory Fragmentation for Linkedlist Heap Allocator


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ How to Reduce Risk While Saving on the Cost of Resolving Security Defects


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Aruba ESP: Predicting and resolving problems at the network edge before they happen


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Most dev and IT practitioners work extra hours resolving incidents


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving conflicts between security best practices and compliance mandates


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ One Simple Trick For Resolving Java Runtime Dependency Issues


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Resolving Circular Imports in Python


๐Ÿ“ˆ 21.32 Punkte

๐Ÿ“Œ Engineering Manager: Resolving Intrapersonal Conflicts


๐Ÿ“ˆ 21.32 Punkte











matomo