Artificial Intelligence in Skin Cancer Diagnosis: A Reality Check
The field of skin cancer detection offers a compelling use case for the application of artificial intelligence (AI) within the realm of image-based diagnostic medicine. Through the analysis of large datasets, AI algorithms have the capacity to classify clinical or dermoscopic images with remarkable accuracy. Although these AI-based applications can operate both autonomously and under human supervision, the best results are achieved through a collaborative approach that leverages the expertise of both AI and human experts. However, it is important to note that most studies focus on assessing the diagnostic accuracy of AI in artificial settings rather than in real-world scenarios. Consequently, the practical utility of AI-assisted diagnosis in a clinical environment is still largely unknown. Furthermore, there exists a knowledge gap concerning the optimal use cases and deployment settings for these AI systems as well as the practical challenges that may arise from widespread implementation. This review explores the advantages and limitations of AI in a variety of real-world contexts, with a specific focus on its value to consumers, general practitioners, and dermatologists.
Introduction
The first instance of a writer speculating that humans could be replaced by intelligent machines is often attributed to Aristotle. In a section of his work Politics, he states “For if each of the tools were able to complete its own work when commanded, or by perceiving in advance, (…) master-craftsmen would need no subordinates nor masters of slaves” (Lagrandeur, 2020). Since the time of Aristotle, humanity has experienced continual technological evolution, progressing from simple tools to modern computers. In the dawn of the computer age, researchers initially believed that they could achieve meaningful results through rule-based systems, much like the syllogisms proposed by Aristotle. In recent years, however, data-driven artificial intelligence (AI) has gained prominence over rule-based AI.
Modern machine learning techniques rely on vast datasets to identify patterns useful for classification. Diagnostic imaging represents one of the most promising arenas for AI research. Skin cancer detection in particular serves as an appealing application for AI, given that diagnoses often hinge on the subjective visual interpretation of clinical and dermoscopic images. AI-assisted diagnosis promises several advantages. For instance, AI could improve access to specialist-level expertise. The scarcity of dermatologists is a serious problem in many regions, often leading to protracted waiting times for specialist appointments. In addition, there is growing optimism that AI-based systems might offer greater consistency and higher accuracy than human experts. Esteva et al (2017) first demonstrated the efficacy of convolutional neural networks (CNNs) for the task of image-based classification in dermatology. CNNs are specialized types of neural networks that are optimally suited for image analysis and are predominantly trained using supervised learning techniques. This requires that the images of the training dataset must be labeled with a diagnosis, serving as the ground truth. This label is essential for the system to learn the relationship between the input data and the corresponding diagnosis. Esteva et al (2017) demonstrated that an AI trained in this manner could differentiate between malignant and benign skin lesions at an expert level. Subsequent studies, such as those by Haenssle et al (2020) and Tschandl et al (2019b), corroborated these findings across different image sets and categories of skin disease. Fueled by advances in computing power and the increasing availability of training images, the diagnostic accuracy of AI systems continued to improve over time. However, the practical utility of these AI systems in real-world settings remains a subject of ongoing inquiry because most studies have been conducted in artificial environments. In this review, we explore AI applications in various real-world scenarios, shedding light on both their strengths and limitations.

AI Tools for Consumers
Over recent decades, numerous smartphone apps designed for self-evaluation of skin lesions have emerged. More than half of these apps equip users with self-monitoring features, including the ability to log, organize, and track moles using the built-in camera on their mobile devices. Some apps serve as intermediaries, transmitting images to qualified experts for evaluation of potential risks on the basis of the submitted photos. Other apps autonomously categorize lesions as either high or low risk, offering guidance on whether medical consultation may be necessary.
In 2018, the Cochrane Skin Cancer Diagnostic Test Accuracy Group conducted a systematic review of studies published up to August 2016 examining the diagnostic accuracy of smartphone apps for identifying melanoma and borderline lesions (Chuchu et al, 2018). Their review highlighted the overall low methodological quality and scarcity of robust evidence in this area. Of the 1051 studies assessed, 203 were reviewed, 16 were identified as potentially eligible, and ultimately, only 2 were included. The primary reasons for exclusion were inappropriate study population, inappropriate index tests, absence of categorical outcome measures in the form of 2 × 2 tables, and the study being derivative in nature. Among the 2 studies that were included, one was a monocentric retrospective case-control study that aimed to analyze the accuracy of 4 undisclosed mobile apps in diagnosing 188 skin lesions (Wolf et al, 2013). The other was a prospective trial focused on testing the diagnostic capabilities of the SkinVision App in 144 lesions (Maier et al, 2015). The reference standard for both studies was the histopathological report. Both studies were considered to have high risks of bias and applicability concerns regarding population selection. They were conducted in specialized centers, which differed from the community setting where these apps are intended for use. In addition, many lesions were excluded owing to poor image quality, an issue that should be considered in real-world applications. Both studies also received high-risk ratings for bias concerning flow and timing. Across the 4 AI-based apps tested in these studies, true positive rates ranged from 7% (95% confidence interval [CI] = 2–16) to 73% (95% CI = 52–88), and true negative rates ranged from 37% (95% CI = 29–46) to 94% (95% CI = 87–97). The authors concluded that smartphone apps using AI-based analysis did not show sufficient accuracy and that they were associated with a high likelihood of missing melanomas (Chuchu et al, 2018).
Similar conclusions were reached in a more recent systematic review that also focused on the diagnostic accuracy of AI-powered smartphone apps for detecting skin cancer (Freeman et al, 2020). In addition to the 2 studies analyzed by Chuchu et al (2018), the authors included 5 more studies published after 2016 as well as 2 studies excluded by Chuchu et al (2018) owing to the low number of lesions included (Chadwick et al, 2014; Robson et al, 2012). Nearly all the included studies were prospective, and only 2 studies were retrospective (Chadwick et al, 2014; Wolf et al, 2013). One study broadened its focus beyond melanocytic lesions and included other types of skin cancers such as basal cell carcinoma (BCC) and squamous cell carcinoma (SCC) (Thissen et al, 2017). Three of the studies used expert assessment as the reference standard (Chung et al, 2018; Nabil et al, 2017; Ngoo et al, 2018), a criterion considered to be at high risk for bias. Only 2 trials were rated as being at low risk for patient selection bias (Dorairaj et al, 2017; Thissen et al, 2017), whereas all the studies were deemed to have a high risk of bias concerning flow and timing, patient selection, and index tests. One significant critique concerning all included studies was that the image quality was likely better than what would be expected in real-world settings. In retrospective studies, archived images were selected on the basis of their quality, whereas in prospective studies, clinicians acquired images using a standard protocol under optimized conditions with a single camera rather than participants using their own devices. It is worth noting that only 2 (TeleSkin skinScan and SkinVision) of the 6 mobile apps mentioned in this systematic review were still available at the time the manuscript was drafted. Two had been withdrawn from the market after investigations by the American Federal Trade Commission investigations, and 2 others were no longer available. Although no peer-reviewed studies evaluating the TeleSkin skinScan app could be located, the review authors did provide an analysis of the sensitivity and specificity of the SkinVision app, conducting a per-lesion assessment across various trials. As reported by Thissen et al (2017) in their study, the top-performing app exhibited a sensitivity of 88% and a specificity of 79%. In a hypothetical population of 1000 adults, assuming a melanoma prevalence of 3%, this level of performance would result in 4 of 30 melanomas going undetected. In addition, over 200 individuals would be burdened with false-positive results. The authors concluded that the poor performance was likely due to the low methodological quality of the included studies. Specific issues highlighted included selective participant recruitment, inadequate reference standards, differential verification processes, and a high frequency of images that could not be evaluated.
The SkinVision app was also assessed in a single-center, comparative, observational cohort study carried out in Switzerland from January to June 2021 (Jahn et al, 2022). In this study, patients underwent comprehensive examinations conducted by dermatologists, which included total body photography (TBP). In addition, all nevi measuring 3 mm or larger as well as any smaller melanocytic lesions deemed suspicious were evaluated using smartphones equipped with the specialized app. Of 1204 pigmented skin lesions analyzed, the smartphone app classified 980 (81%) lesions as benign and flagged 224 (19%) as carrying an increased risk for melanoma. In contrast, dermatologists diagnosed 1195 (99.3%) lesions as benign and identified only 9 (0.7%) as suspicious. As a result, the app classified pigmented skin lesions 27 times more frequently as suspicious than did the dermatologists, indicating a concerning rate of false positives.
Since 2020, the landscape of algorithm-based apps for skin cancer detection has been rapidly evolving, in part owing to significant improvements in the processing power and camera quality of mobile devices. Sun et al (2022) identified 25 apps focused on skin cancer self-detection by searching the Apple App Store, Google Play Store, and Google Search. They then tested these apps using a small independent set of clinical images comparable with those typically submitted through smartphones. The study highlighted considerable variability in the apps’ outputs, including diagnosis, risk category, and risk score, with mean diagnostic accuracies of 56, 60, and 64%, respectively. The authors concluded that although a direct comparison was challenging, the accuracy of these apps was highly variable and generally low. Such inconsistencies could potentially lead to false reassurance and reluctance to seek medical care or, conversely, to emotional distress and increased healthcare utilization. However, a significant drawback of this study lies in its retrospective design. Specifically, if the mobile app under test did not permit image uploads, images were displayed on a 4K ultrahigh-definition monitor and then captured through smartphone, a method far removed from the real-life usage.
In a separate study, 2 new market-approved AI algorithms designed to detect and analyze skin tumors were evaluated in a single-center, prospective validation trial. Conducted in a tertiary referral center in Austria between June 2018 and December 2019 (Kränke et al, 2023), the study employed 5 different mobile phones. A total of 1171 lesions were included in the analysis. When classifying the lesions into benign versus nonbenign categories, the algorithm designed for analysis showed a sensitivity of 95.4% (95% CI = 93.5–97.3), whereas the algorithm focused on detection had a sensitivity of 96.4% (95% CI = 93.5– 98.9). Specificity was 90.3% (95% CI = 88.1–92.5) for the analysis and 94.9 (95% CI = 92.5–97.2) for the detection algorithm. Although these results are promising, the study authors noted that the potential for limited applicability may arise when the individuals selecting and capturing the images are patients rather than clinicians.
In conclusion, although there has been notable progress in the regulation of consumer apps designed for skin cancer diagnosis, there remains an ongoing requirement for stronger clinical validation to substantiate their purported benefits. These potential benefits are that consumer apps could offer immediate preliminary assessments right from the comfort of one’s home, thus increasing access to healthcare services for those who may be limited by geography, time, or financial constraints. Furthermore, these apps could encourage regular self-skin examinations, promoting early detection and more proactive healthcare habits. However, utilizing consumer-generated images of skin lesions for skin cancer detection without proper control could give rise to various risks rooted in the possible absence of expertise, standard procedures, and supervision usually offered by clinicians.
AI as a Tool for General Practitioners
In theory, the incorporation of AI in primary care, when equipped with sufficient sensitivity, holds the potential to enhance the effective triage of high-risk lesions toward secondary care. This would ensure that individuals with skin cancer promptly receive appropriate medical attention. Conversely, a high specificity could minimize unnecessary referrals and promptly alleviate patient anxiety. Likewise, the fusion of AI with teledermatology could enable the transmission of referral urgency to specialized clinics or dermatologists supplemented by relevant clinical information and images. This integration has the potential to augment the efficiency and precision of these referral processes. However, it is crucial to maintain a high level of specificity to avoid unacceptable spikes in referral rates to dermatologists and specialized centers (Ferrante di Ruffano et al, 2018). Jones et al (2022) conducted a systematic review encompassing all studies related to AI technologies aimed at facilitating the early detection of skin cancer within primary and community care settings. The review included studies conducted from January 1, 2000 to August 9, 2021 (Jones et al, 2022). The authors categorized study populations into high-prevalence and low-prevalence populations and found that only 2 studies were conducted in low-prevalence populations. Hence, they opted to review the data from all 272 studies that employed AI and machine learning to assess skin lesions because these studies remain relevant to the implementation of these technologies in primary care. Although diagnostic accuracy was reasonable for melanoma (89.5%), SCC (85.3%), and BCC (87.6%), methodological heterogeneity and lack of data from low-prevalence clinical settings suggest caution in recommending widespread adoption of AI in primary care. The authors stressed that the populations under examination did not adequately reflect the diversity of the broader population. Furthermore, they highlighted challenges such as limited access to nonpublic datasets, insufficient transparency regarding training methodology, lack of prospective studies, the potential for bias and overfitting, and concerns regarding AI performance with out-of-distribution images (Dick et al, 2019; Jones et al, 2022).
The first randomized controlled trial in this field aimed to investigate whether a multiclass AI algorithm could enhance the accuracy of nondermatologists examining patients with suspicious skin lesions (Han et al, 2022). These skin lesions were identified either by the patient or a clinician without the use of dermoscopy. A total of 524 biopsy-proven cases and 52 clinically diagnosed cases were randomly selected for evaluation with or without AI support involving first-year dermatology residents and nondermatology trainees. The AI-aided group achieved a top 1 accuracy of 53.9%, contrasting with the unaided group’s accuracy of 43.8%. The improvement was statistically significant among nondermatology trainees with limited dermatology experience, whereas it was not significant for dermatology residents.
Escalé-Besa et al (2023) conducted a prospective study to compare the diagnostic accuracy of AI with that of general practitioners (GPs) in a real-life setting. Eleven GPs were tasked with making a diagnosis and then subjecting the image to an AI model. Three questions probed the GPs on whether the model’s results aided in diagnosis and management. The scope of this study extended beyond skin tumors to include various inflammatory conditions, which the AI was not adequately trained for, such as granuloma annulare, scabies, and hidradenitis. Despite adjustments for inadequate training, the diagnostic accuracy of the AI was lower than that of GPs and dermatologists (Escalé-Besa et al, 2023).
Regarding satisfaction and the acceptance of AI, Escalé-Besa et al (2023) found that 92% of GPs responded affirmatively to the question of whether AI aided in their differential diagnosis approach. In 60% of cases, the AI tool was instrumental in arriving at the diagnosis, and in 34% of cases, teledermatology consultation could have been avoided. Two survey-based studies yielded similar results, with GPs expressing strong support for AI and perceiving significant advantages, contrary to dermatologists’ concerns about replacement (Samaran et al, 2021; Sangers et al, 2023).
AI as a Tool for Dermatologists
Within dermatology, AI finds application across diverse tools dedicated to the detection of skin cancer, including TBP and dermoscopy. Market-approved AI-powered software for dermatologists, such as the MoleAnalyzer Pro (FotoFinder ATMB) and DEXI (Vectra WB360), are typically integrated with specialized hardware. These AI-powered apps have undergone testing in retrospective and prospective trials (Cerminara et al, 2023; Haenssle et al, 2020; Jahn et al, 2022; Winkler et al, 2023).
When applied to TBP for patients with multiple nevi, AI assumes a supportive role for dermatologists. In such cases, AI assists in sorting and visualizing nevi, empowering dermatologists to effectively compare images and identify changes and outliers. AI further facilitates the swift recognition of new or evolving lesions (Figure 1) by juxtaposing total-body images from current and previous visits (Salerni et al, 2012).
Oryginal article published at https://www.sciencedirect.com/science/article/pii/S0022202X23029640 by Gabriella Brancaccio, Anna Balato, Josep Malvehy, Susana Puig, Giuseppe Argenziano, Harald Kittler