At Illuminate, Genesys Cloud principal architect Kal Patel shared how they use Sumo Logic, how they manage it at scale, and some of the lessons they’ve learned in the last year.
Genesys Cloud is a contact center offering, a complex deployment made up of around 370 microservices. They are NOC-free, and everything is automated and immutable. Their environments are currently deployed across about 10 core accounts in 9 AWS regions, alongside some that are still in R&D and testing. They are PCI and HIPAA compliant.
Their teams take on an entrepreneurial approach, meaning “if you build it, you're responsible for not only just building it, but running it, supporting it, securing it and owning it all the way,” as Kal shared. This level of ownership across their teams helps ensure conscientiousness in identifying the right tools and microservices needed to address problems that they’re trying to solve.
How Genesys uses Sumo Logic
Genesys uses SumoLogic to stay on top of all of their deployments. Sumo Logic is used across multiple groups in the company, with each group deriving value based on their needs from the same set of data ingested into Sumo. At Genesys, whoever has access to Sumo Logic can query across any data in their account.
Their development team uses Sumo Logic in training new hires--bringing them up to speed on the pull requests, code reviews, and logs. Kal shares that having new hires drop in to help debug triage issues in Dev and Test environments helps training as well.
Every service team has their own set of dashboards they use to monitor their services, something really important for a NOC-free operation. What has proven useful are the PagerDuty alerts that provide context via references to the dashboards and the playbooks needed to resolve issues swiftly.
They try to make sure the dashboards they use in Prod are exactly the same in the Dev and Test environments, and this applies to alerting as well.
Genesys’ testing team creates their own custom dashboards from deployment data, test results, and actual application failure rates. Having all this data in one place helps the test team debug and triage everything together. They create and configure dashboards to track progress on their activities, alerts to monitor and identify what needs to be done.
For performance and scale monitoring, since they are trying their best to build dashboards in Dev that would get promoted all the way up, the Dev and Test teams are looking at the same dashboards with the same data and metrics.
Once dashboards are promoted to production, the Ops and Problem Management team is able to use those dashboards to monitor, debug and triage blast radiuses for any incident. The Ops team also uses Sumo Logic to do capacity planning, especially since capacity planning in the cloud is different from that in an on-premises environment.
The Genesys team uses dashboards to monitor service usage--from there they can have capacity conversations with vendors.
Kal shared an example of how they use Sumo Logic dashboards for infrastructure monitoring by showing their AWS CloudTrail dashboard. They can internally keep an eye on what kind of errors they’re running into, as well as detailed context behind the errors, while also having good information that helps with capacity planning, if they’re going above thresholds. From there, they can reach out to AWS and ask why they’re being rate limited or throttled, and they can discuss with the AWS team if there’s something that needs to be adjusted on how they’re doing things in a certain environment.
Kal shared as another example, the load testing dashboard is used by multiple teams internally to communicate when there's an incident. According to Kal, because all teams are looking at the same dashboard, there’s very open communication between everyone on what’s exactly going on, what the timelines are, and so on. This is the case for both the Load Testing dashboard and the CloudTrail dashboard, and many other dashboards being used by teams at Genesys.
This approach has allowed them to hold blameless postmortems that anyone can attend where they do thorough root cause analysis (RCA) reviews and Sumo Logic log walkthroughs. When needed, they’re also able to provide pointers to the logs via a shared link or an actual shared code from Sumo for those who want to follow along during the call or want to learn more about it after the call.
The Security team uses Sumo Logic for monitoring compliance and triage. Here are some of the common security monitoring metrics Kal shared:
Publicly exposed security groups
Services going out of compliance
Externally facing instances
Genesys security team doesn’t patch any of their boxes--they operate by a rule that every service should be deployed at least once a month. Every redeploy is run on a new AMI so everything is picked up along the way through the new AMI instead of patching in between.
To ensure the enforcement of this rule, they set up a dashboard that tracks which service or group hasn’t deployed lately, as well as if there are services that have been running for more than 30 days. Kal shared that this approach to dashboarding means it’s good when there’s nothing in it. If you see something pop up, that’s when you take action. This approach has allowed them to stay focused and know what action exactly to take.
The Support team at Genesys has very similar use cases to the Operations and Security teams.
An observation Kal shared is that not all support engineers are as comfortable writing complex queries for Sumo Logic. This is where templatized queries and parameters in Sumo Logic have been a big help as they enable quicker turnaround times and have promoted regular collaboration between teams. When support engineers have questions, they can be very specific by referencing templates.
Group effort: How Genesys manages Sumo Logic at scale
The Genesys team has nine instances generating around 275 terabytes of logs per month and no dedicated Sumo Logic management team. Sumo Logic management is a shared responsibility divided in two parts, centrally-managed and managed by service teams. Everybody that has access to the Sumo Logic account can query logs, create alerts, and build dashboards.
Genesys has set three ground rules to guide their management effort.
Infrastructure as code
All changes must be idempotent
Trust but verify
The first two rules are standard, as Kal said, but he went deeper into how they have operationalized Trust but verify in their teams.
Trust but verify is crucial to them, especially because they need to stay HIPAA and PCI compliant. Because there are many instances running and everything is automated, he shared that there can be times when they apply a configuration and just forget about it. Deployments happen in a click of a button. In this situation, there may be times when there’s a silent failure and a configuration would be missed. They have set up a great system, and this ground rule is part of it--the state of all accounts must always be reviewed and consistency ensured.
Two approaches to management: Prevention and deflection
Genesys takes two approaches to managing Sumo Logic: Prevention and deflection.
Everything is automated and they build a lot of abstractions. Their internal teams use their internal tools instead of going to the Sumo Logic platform. This helps direct users to exactly what they need to do.
Proactively engage users. They can see dashboard queries that tell them exactly what users are trying to do. Kal shared that when they see users encountering failures or errors, they can quickly go in and proactively reach out to address them.
Empower users to troubleshoot their issues. They provide checklists and reference implementations so users can quickly address narrow down issues as they arise and ask for help when needed.
Build community. Genesys has established a chat room and guild meetings so users can help each other. They have also instituted more “lunch and learn” office hours where they pick a specific topic and start digging into it so that users can ask questions directly to the Sumo experts come in for Q&A.
Build feedback loops. Kal stressed the importance of building feedback loops not just with users, but also the account teams. Their internal chat room and guild meetings help further facilitate these feedback loops.
Genesys uses three things to monitor and optimize costs.
Ingest budget - Notifies the team when somebody is pushing us too much data
Data tiers - They are in constant contact with Sumo to ensure they reduce and manage cost accordingly.
They use the Log Ingest dashboard to monitor and optimize cost by making sure they are staying within the upper and the lower bounds. They also use prediction functions within Sumo Logic to ensure they’re trending in the right direction.
Lessons Using Sumo Logic
To conclude his presentation, Kal shared the best practices the Genesys Cloud team has formed in the past year with Sumo Logic.
Encourage collaboration among users and with Accounts team when possible
Constantly review configuration and access patterns
Help teams find a balance between automation and feature requests
Naming conventions are important
Plan for success, things might not work the same at scale
Kal Patel is a principal architect for Genesys Cloud where he runs DevOps and SRE.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.