ADR-84: Crashbot

More details about this document
Latest published version:
https://adr.decentraland.org/adr/ADR-84
Authors:
manumena
Feedback:
GitHub decentraland/adr (pull requests, new issue, open issues)
Edit this documentation:
GitHub View commits View commits on githistory.xyz

Context and Problem Statement

Given the growth of the community, the roadmap, and the new members of the product-engineering teams, the amount of entropy introduced into the system increased exponentially causing an inconvenient effect that affects the reliability of the system. Even though for teams that hold a really high technical bar based on automated and mature processes and mechanisms, the number of incidents tends to increase at a logarithmic scale with the organization size, providing visibility on those incidents to the rest of the organization and the systems’ users is key.

Nowadays, the #crash slack channel is where the incidents are communicated, updated, resolved, using the channel`s subject as a status display. This is only visible to people with access to decentraland's slack workspace, there is a lack of transparency there, anyone should be able to know the status of our incidents, community members need to. Also, it can get messy or difficult to read, specially if there are more than one incident ocurring simultaneously. For this reasons, a more sophisticated (and automated) incident management process must be implemented.

Goals

Boost internal communications and alignment on incident management-related matters by automating the process while at the same time we increase the transparency and visibility of the platform status with the community.

Proposed solution

crashbot: a service that acts as an interface with slack for incident’s contact and point to update the incident information, while collecting information that can be shared with the community.

The crashbot scope will include the following:

Design

stateDiagram-v2
    SupportTeam --> SlackApp
    
    SlackApp --> Server
    
    Server --> Database
    Database --> Server

    Server --> StatusPage
  1. SupportTeam communicate to the service via commands in a slack app like /create-incident and /update-incident.
  2. Slack app calls the corresponding server's endpoint
  3. Server updates the database
  4. Status page hits server's endpoint /list and receives a json to populate

Endpoints

License

Copyright and related rights waived via CC0-1.0. Living