Adding content to 102-resiliency

2 years ago · 7acb0250fc
parent ec95c7b7f5
commit 7acb0250fc
11 changed files with 71 additions and 9 deletions
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/101-high-availability/circuit-breaker.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/101-high-availability/circuit-breaker.md
@ -1,5 +1,6 @@
 # Circuit Breaker
 Circuit Breaker in system design is a pattern that is used to prevent an application from repeatedly trying to perform an action that is likely to fail. By tripping the circuit breaker when an operation fails a certain number of times, the system can prevent cascading failures, provide fallback behavior, and monitor system health. It can be implemented in several different ways such as State machine, and Hystrix (library for Java).
 Learn more from the following links:
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/101-high-availability/index.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/101-high-availability/index.md
@ -1,5 +1,6 @@
 # High availability
 High availability in system design refers to the ability of a system to continue operating even in the event of a failure or outage. This is often achieved by designing the system to be redundant, meaning that multiple copies of the system are running at the same time, and if one copy fails, the others can take over. It can be achieved by using Redundancy, Load balancing, and Failover. It can be measured using metrics such as Mean Time Between Failures (MTBF), Mean Time To Recovery (MTTR) and Availability.
 Learn more from the following links:
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/bulkhead.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/bulkhead.md
@ -1 +1,8 @@
-# Bulkhead
+# Bulkhead
 Bulkhead in system design refers to a technique for isolating different parts of a system to prevent one part from affecting the performance of the whole system. The term "bulkhead" is used to refer to the partitions or walls that are used to separate different parts of the system. It allows to Isolate critical parts of the system, prevent cascading failures and provide isolation for different types of requests. It can be implemented in several different ways such as Thread pools, Circuit breakers, and Workers.
 Learn more from the following links:
 - [Bulkhead pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead)
 - [Get started with Bulkhead](https://dzone.com/articles/resilient-microservices-pattern-bulkhead-pattern)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/circuit-breaker.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/circuit-breaker.md
@ -1 +1,8 @@
-# Circuit breaker
+# Circuit Breaker
 Circuit Breaker in system design is a pattern that is used to prevent an application from repeatedly trying to perform an action that is likely to fail. By tripping the circuit breaker when an operation fails a certain number of times, the system can prevent cascading failures, provide fallback behavior, and monitor system health. It can be implemented in several different ways such as State machine, and Hystrix (library for Java).
 Learn more from the following links:
 - [Circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern)
 - [Overview of Circuit Breaker](https://medium.com/geekculture/design-patterns-for-microservices-circuit-breaker-pattern-276249ffab33)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/compensating-transaction.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/compensating-transaction.md
@ -1 +1,8 @@
-# Compensating transaction
+# Compensating Transaction
 A Compensating Transaction in system design refers to a mechanism for reversing or undoing the effects of a previously executed transaction in a system. It can be used to ensure that the system remains in a consistent state, even if a subsequent transaction fails or is rolled back. Typically used in systems that implement the principles of ACID transactions, it can be implemented in several different ways such as undo logs, savepoints.
 Learn more from the following resources:
 - [Compensating Transaction pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction)
 - [Intro to Compensation Transaction](https://en.wikipedia.org/wiki/Compensating_transaction)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/health-endpoint-monitoring.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/health-endpoint-monitoring.md
@ -1 +1,8 @@
-# Health endpoint monitoring
+# Health Endpoint Monitoring
 Health Endpoint Monitoring in system design refers to a technique for monitoring the health of a system by periodically sending requests to a specific endpoint, called a "health endpoint", on the system. The health endpoint returns a response indicating the current status of the system, such as whether it is running properly or if there are any issues. It allows to Monitor the overall health of the system, Provide insight into the system's performance, and automate the process of monitoring. It can be implemented in several different ways such as Periodic requests and Event-based monitoring.
 To learn more visit the following links:
 - [Health Endpoint Monitoring pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/health-endpoint-monitoring)
 - [Explaining the health endpoint monitoring pattern](https://www.oreilly.com/library/view/java-ee-8/9781788830621/5012c01e-90ca-4809-a210-d3736574f5b3.xhtml)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/index.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/index.md
@ -1 +1,7 @@
-# Resiliency
+# Resilience
 Resilience in system design refers to the ability of a system to withstand and recover from disruptions, failures or unexpected conditions. It means the system can continue to function and provide service even when faced with stressors such as high traffic, failures or unexpected changes. Resilience can be achieved by designing the system to be redundant, fault-tolerant, scalable, having automatic recovery, and monitoring and alerting mechanisms. It can be measured by Recovery Time Objective (RTO), Recovery Point Objective (RPO), Mean time to failure (MTTF), and Mean time to recovery (MTTR).
 Learn more from the following links:
 - [System Resilience: What Exactly is it?](https://insights.sei.cmu.edu/blog/system-resilience-what-exactly-is-it/)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/leader-election.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/leader-election.md
@ -1 +1,8 @@
-# Leader election
+# Leader Election
 Leader Election in system design is a pattern that is used to elect a leader among a group of distributed nodes in a system. The leader is responsible for coordinating the activities of the other nodes and making decisions on behalf of the group. Leader Election is important in distributed systems, as it ensures that there is a single point of coordination and decision-making, reducing the risk of conflicting actions or duplicate work. Leader Election can be used to ensure a single point of coordination, provide fault tolerance, and scalability. There are several algorithms such as Raft, Paxos, and Zab that can be used to implement Leader Election in distributed systems.
 To learn more, visit the following links:
 - [Overview of Leader Election](https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/)
 - [What is Leader Election in system design?](https://www.enjoyalgorithms.com/blog/leader-election-system-design)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/queue-based-load-leveling.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/queue-based-load-leveling.md
@ -1 +1,7 @@
-# Queue based load leveling
+# Queue-Based load leveling
 Queue-based load leveling in system design refers to a technique for managing the workload of a system by using a queue to buffer incoming requests and process them at a steady pace. By using a queue, the system can handle bursts of incoming requests without being overwhelmed, as well as prevent idle periods where there are not enough requests to keep the system busy. It allows to smooth out bursts of incoming requests, prevent idle periods, Provide a way to prioritize requests, and provide a way to monitor requests. It can be implemented in several different ways such as In-memory queue and Persistent queue.
 To learn more visit the following links:
 - [Queue-Based Load Leveling pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/queue-based-load-leveling)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/retry.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/retry.md
@ -1 +1,8 @@
-# Retry
+# Retry
 Retry in system design refers to the process of automatically re-executing a failed operation in the hopes of getting a successful outcome. Retries are used to handle transient failures such as network errors, temporary unavailability of a service, or other issues that may be resolved quickly. Retries can be an effective way of dealing with these types of failures, as they can help to ensure that the system continues to function, even in the face of temporary disruptions.
 Learn more from the following resources:
 - [Introducing Retry](https://engineering.grab.com/designing-resilient-systems-part-2)
 - [Retry pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/retry)
--- a/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/scheduler-agent-supervisor.md
+++ b/src/roadmaps/system-design/content/118-cloud-design-patterns/103-reliability-patterns/102-resiliency/scheduler-agent-supervisor.md
@ -1 +1,7 @@
-# Scheduler agent supervisor
+# Scheduling Agent Supervisor
 Scheduling Agent Supervisor in system design is a pattern that allows for the scheduling and coordination of tasks or processes by a central entity, known as the Scheduling Agent. The Scheduling Agent is responsible for scheduling tasks, monitoring their execution, and handling errors or failures. This pattern can be used to build robust and fault-tolerant systems, by ensuring that tasks are executed as intended and that any errors or failures are handled appropriately.
 Learn more from the following links:
 - [Scheduler Agent Supervisor pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor)