Since the emergence of convolutional neural networks (CNNs), and later vision transformers (ViTs), the standard paradigm for model development has been using a set of identical block types with varying parameters/hyper-parameters. To leverage the benefits of different architectural designs (e.g., CNNs and ViTs), we propose alternating structurally different types of blocks to generate a new architecture, mimicking how Lego blocks can be assembled. Using two CNN-based and one SwinViT-based blocks, we investigate three variations to the so-called LegoNet that apply the new block alternation concept for the segmentation task in medical imaging. We also study a new clinical problem that has not been investigated before – the right internal mammary artery (RIMA) and perivascular space segmentation from computed tomography angiography (CTA). It was proven to demonstrate a prognostic value to primary cardiovascular outcomes. We compare the model performance against popular CNN and ViT architectures using two large datasets (achieving 0.749 dice similarity coefficient (DSC) on the larger dataset). We also evaluate the model’s performance on three external testing cohorts, where an expert clinician corrected model-segmented results (DSC>0.90 for the three cohorts). To assess our proposed model for suitability in clinical use, we perform intra- and inter-observer variability analysis. Finally, we investigate a joint self-supervised learning approach to determine its impact on model performance.