- rough transactional flow diagram - example application under test - implemnted for training and experimentation purposes - hypothetical scenario - production - Offers API slows down - sometimes taking up to 20 sec - offers api owned by diff team - no time to remediate - if this persists > a few min, Travel app remains unresponsive - processing backlog of requests - support team engaged to recycle - make travel application more resilient by adding a timeout - business agrees to 5 sec timeout - if no response from offers, travel api will return error. But user can proceed ------- - experiment designed to recreate conditions seen in production - injecting a 20 second response latency into the GetOffers method in Offers API - a load test on the AUT - Our Hypothesis is that response time for our Consuming components (for example Travel API) will be at least 20 seconds and probably degrade further due to thread starvation. ------- - observe outcome of experiment with dynatrace dashboard - 5 SRE Golden Signals - lest - latency, errors, saturation, throughput - start and end indicated by red lines - before first red line we are at steady state - throughput is consistent - response time and saturation are low - virtually no failures - at first red line - we inject 20 sec delay into offers - latency increased up to 4 min - 20-30% failure rate - threads fully staturated - response times now include time waiting for avail thread + offers delay - As observed in prod - after 2nd red line - AUT did not recover - threads remained maxed out - recycle required to restore service - successfully recreated the conditions seen in production in our test lab ------- - apply mitigation and retry - modified the call from the Travel API to the Get Offers method of the Offers API - request timeout set to 5 seconds - if no offers response - error will be retured to the business logic layer within the Travel API - Our Hypothesis is that User response time wil be at least 5 seconds but may also degrade further due to thread starvation. ------- - Back in our Dynatrace dashboard - we see differences - As before - generate production like load + inject 20 seconds of latency into Offers API - latency is at 20 seconds - but for travel api its flat at 5 seconds - error chart is showing 100% failures - but this is expected - 80-90% thread saturation - dip in throughput - largest item of note here --> AUT was able to recover - Additional tuning and experimentation can be done - follow up experiment - inject 4.5 sec delay. offers responds just under the timeout value ---- - So how did we do this? - Several capabilties and tooling developed over the past few years - Traffic - Service Virtualization for Performance platform - developed in-house at TD - virtualize components within our AUT - inject behaviours in between 2 components - response time delays - erroneous responses - PACE - The Performance Agent for Chaos Engineering - developed in-house at TD - inject failures at application, network and infrastructure level. - Perfhawk - in-house developed load testing stack - Generate load on the AUT - Dynatrace - observing the impacts and cascade of effects - working on practices for utilizing other tools - Chaos Monkey for Springboot - - dependency that can be included in the application - inject chaos from within component - Chaos Toolkit - provide developers ability to define chaos experiments in yaml