M.Sc Thesis

M.Sc StudentFarhi Ido
SubjectDeep-Learning Based Echocardiogram
View Classifiction Using Spatio-Temporal
DepartmentDepartment of Biomedical Engineering
Full Thesis textFull thesis text - English Version


            Improvements in medical imaging have enabled the collection of increasingly detailed and precise data useful for making diagnosis and monitoring treatments. Specifically, ultrasound imaging of the heart, known as echocardiography, has proven itself as an effective imaging tool. It’s thus routinely used in the diagnosis, management, and follow-up of patients with suspected or known heart diseases.

            Recently, the ever shrinking size and increased mobility of ultrasound machines has improved their efficiency and expanded their use at the hands of less trained operators. Although the acquisition of echocardiography videos can be performed by a less trained operator, their analysis requires interpretation by a highly trained professional.

            The essential first step in the development of an autonomous echocardiography diagnostic system is automatic view classification. Following classification, it’ll be possible to analyze the heart’s performance, compare to previous exams of the same patient or to other patient’s exams and finally, offer treatment or diagnosis.

            Automatic echocardiography view classification is considered a challenging problem. Several views may look very similar to different parts of their clips, and although intra-view variability is high, inter-view variability is relatively low. The noise, low signal-to-noise ratio, concealment, reflections and artifacts typical in ultrasound imaging exacerbate the problem. In addition, since the acquisition process is manual, there’s an inherent inconsistency between samples as well as variability in relation to the "gold standard".

            This work presents a novel method for automatic classification of the six standard echocardiography views (Mitral valve, Papillary muscle, Apex, 2 Chamber, 3 Chamber, 4 Chamber) using machine learning. The method presented uses a deep learning network with 3D convolution layers, which allow the network to "run" across the spatial and temporal dimensions to find patterns that characterize each view. The dataset used in this study was relatively small. Therefore, a unique method was developed to increase the number of samples by breaking down each video into 5 clips. Final video classification is based on the aggregate results of all five clips. The breakdown offers two significant benefits. the decomposition increases training samples by a factor of 5. Secondly, analogous to a method known as "ensemble", the independent classification of 5 clips adds a degree of noise to the calculations, which helps improve accuracy by several percentage points.

            Finally, network decisions were analyzed using Grad-Cam, which colored the areas of the video used for classification, and showed that the network uses similar features to those a human expert would. In addition, in order to test whether the network uses temporal information, it was trained once with frames in the original order, and once in random order. A significant decrease in the network's performance when learning on frames in random order proves the network uses both temporal and spatial information.

            The network achieves 90% accuracy on test set videos which were not used for training or parameter optimization. The network especially excels in distinguishing between long-axis and short-axis view groups reaching nearly 100% accuracy. When optimizing the network parameters using 10-fold cross validation, accuracy reached 96.5%.