MS SRE Workshop Notes Taken
what level of flows are healthy, storage account, 5ms healthy, analyse the failure, implement mitigations, put pods into 10 mins crashloop, what will happen, response time will increase. get to the public website critical, understanding what is first step, e.g. website large customers run web and backend traffic in different clusters, it can start from one cluster for small footprints, apim for AI, better observability, AKS egress node pool, NSG login for different apps cpus, no functional requirements, how do i know the good and bad, response time, architects capture the non functional requirements, product owners, infra and platform team. design the health level as a flow level john runs some chaos experiments Chaos Mesh Overview | Chaos Mesh install chaos studio pre-rep and then create chaso studio target and experiments