Showing posts with label Incident Response. Show all posts
Showing posts with label Incident Response. Show all posts

Wednesday, November 20, 2024

Building AI Models for IT Operations

Sunil Mallya (@sunilmallya, Co-Founder & CTO @_FlipAI) talks about building AI models to help IT Operations teams simplify troubleshooting, incident response and pattern detection. 

SHOW: 874

SHOW TRANSCRIPT:
The Cloudcast #874 Transcript

SHOW VIDEO:
https://youtube.com/@TheCloudcastNET

CLOUD NEWS OF THE WEEK -
http://bit.ly/cloudcast-cnotw

NEW TO CLOUD? CHECK OUT OUR OTHER PODCAST:
"CLOUDCAST BASICS"

SHOW NOTES:


Topic 1 - Welcome to the show.

Topic 2 - Let’s begin by talking about your broader view of where IT Operations is today for large companies. What are their biggest challenges, and where are their opportunities for big improvements?

Topic 3 - One of the biggest challenges for IT Operations is so many systems, so many tools, so many data sources and the difficulty of coordinating (troubleshooting, incident response, pattern detection, etc.) across them. Is there a modern approach that’s emerging about how to make this better?

Topic 4 - Where does the potential of AI models come into play here? It is more focused on being able to ask the system for diagnosis, or can it start getting closer to taking independent actions?

Topic 5 -
How do you train an AI model with so many different pieces of customer data?

Topic 6 - Let’s talk about AI Agents and what they can begin to accomplish within an Enterprise.

FEEDBACK?

Wednesday, February 26, 2020

DevOps and Incident Response Evolution

Chris Riley (@hoardinginfo, DevOps Advocate, @Splunk) talks about the state of DevOps, the evolution of Incident Response with Machine Learning, Service vs. Site Reliability, and using Incident Response to increase quality of development

SHOW: 439

SHOW SPONSOR LINKS:

CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotw

SHOW NOTES:

Topic 1 - Welcome to the show. Tell everyone a little about yourself, you’ve been active in the DevOps space for quite some time.

Topic 2 - About a year ago we had your peer and good friend of the show, Josh Atwell, on to talk about the State of DevOps in 2019. What are your thoughts on changes over the last 12 months and where we headed in 2020?

Topic 3 - One item in particular that has drawn my attention is your discussions on Incident Response and Machine Learning. Can you tell everyone a little bit about that and why you believe it will be valuable going forward?

Topic 4 - This in a way feels almost like a transition into the next evolution of our model. First we had separate dev and ops and no one talked, then we put them together, then we had every device and app start spitting out logs and alerts and next thing you knew, we were drowning in data… The complexity of the systems has grown exponentially. Fair?

Topic 5 - You recently did a post over on the Victor Ops blog about SRE and the meaning of the “S” in that blog. You propose more and more it should stand for Service Reliability Engineer vs. the more traditional Site Reliability Engineer, especially as we move into a subscription based model world. Can you explain to everyone your thoughts there?

Topic 6 -
When I think Incident Response, I think production environments. As part of VictorOps I’m sure you see a lot of use cases and have solved some pretty unique customer problems. How can this be applied outside of production, say for application testing or quality before hitting production? Is that a valid approach?

FEEDBACK?