Acoustic side-channel attacks (ASCAs) exploit unintended sound emitted by keyboards to infer typed input, but existing methods generally assume manually-labelled training data and controlled environments, limiting their applicability to realistic scenarios such as online lectures. We develop a pipeline that automatically labels keystroke-sound samples captured from online coding tutorials: video frames are processed with optical character recognition (OCR) to extract the ground-truth character sequence, audio is segmented into clips centred on detected click events, and the two streams are aligned. A convolutional neural network (CNN) is trained on mel-spectrogram features, with transfer learning used to adapt the pretrained model to a target user with minimal samples. Our dataset contains 50 unique keys from standard QWERTZ keyboards recorded during real programming lectures. On a held-out test set the CNN achieves 98.1 % top-1, 99.4 % top-2 and 100 % top-3 accuracy. Transfer learning retains strong performance with as few as 13 samples per key. Pairing OCR-derived ground truth with acoustic CNN classification removes the labelling bottleneck that has limited previous ASCAs, and the transfer-learning stage makes the attack viable with minimal per-victim data. All code, trained models, and labelled datasets are released to support reproducible research.