Microservice architectures amplify the volume and complexity of security vulnerabilities, making it increasingly difficult for security and SRE teams to decide which services to patch, when to patch them, and how to coordinate patches under strict cost and availability constraints. Traditional prioritization schemes based on CVSS scores or static business criticality heuristics ignore inter-service dependencies, deployment topologies, and operational costs such as downtime, rollback risk, and engineering effort. In this paper, we propose M-VP2a microservice-oriented vulnerability patch planning framework that formulates patch scheduling as a cost-aware multi-agent reinforcement learning (MARL) problem. Each microservice is modeled as an autonomous agent that selects patching actions over time (e.g. patch now, defer, or batch with other changes), while a joint reward function balances security risk reduction, patching and downtime cost, and compliance with service-level objectives. The environment captures call-graph dependencies, cascading failure modes, and temporal exploit likelihood, enabling agents to learn coordination strategies that avoid risky simultaneous updates on tightly coupled services. We design a hierarchical actor–critic architecture with centralized training and decentralized execution, augmented with a risk-aware reward shaping mechanism to penalize unsafe patch combinations and SLA violations. Extensive simulation experiments on synthetic and real-world–inspired microservice topologies show that M-VP2 reduces expected breach risk and aggregate patching cost by up to double-digit percentages compared with CVSS-based heuristics, greedy risk–cost ranking, and single-agent RL baselines, while producing patch plans that are more stable, interpretable, and aligned with operational constraints.