The Importance of Data Sources for Machine Learning Applications in Autism: A Mini Review

8 Autism spectrum disorder (ASD) is a group of lifelong heterogeneous neurodevelopmental conditions with a 9 wide range of severity levels that affect social communication and social interaction. Diagnosis of ASD relies on 10 subjective observation of these clinical phenotypes. The growing body of big data generated by subjective 11 methods and more recently by objective high-throughput technologies such as omics for the detection of 12 biomolecules, is being successfully applied to a rapidly-growing number of machine learning (ML) algorithms to 13 inform research for diagnostics and interventions for patients with ASD. While most reviews in this area are 14 focused on the ML approaches, we highlight the impact of the database on the expected outcomes in ML-based 15 ASD research studies.


Introduction
Autism spectrum disorder (ASD) is a set of neurodevelopmental conditions diagnosed by a qualified clinician such as a developmental pediatrician or neurologist 1 .It is characterized by qualitative impairments in social interaction and communication, as well as restricted, repetitive, and/or stereotyped patterns of behavior 2 .ASD diagnosis is not a straightforward process and is often made long after initiation.In most cases, assessment is reliable at the age of 2 years 3 , and sometimes at 18 months 4 , while onset can occur as early as the first-or second-trimester 5 as fever-associated immune disturbances in response to prenatal infectious agent exposure 6 lead to a pleiotropic effect on metabolic pathways 7 .
There is no "one-pill-fits-all" approach for ASD treatment.Personalized educational and behavioral therapies are the main approaches, supplemented with prescription medication in 48% of children 8 .Evaluation of treatment effectiveness in children with ASD is challenging due to the variability in symptoms expressed and in severity levels both among children with ASD and within each child over time.Also, it requires stepwise assessments that often involve family member and care giver interactions.Dykens et al. 9 showed that these in-person interactions for ASD evaluation can introduce additional variation resulting from the child's distress, impacting the outcome.
With the significant increase in data availability, ML algorithms are being utilized to develop new methods to diagnose, distinguish categories of patients, predict and monitor the efficacy of therapy, and identify the underlying conditions of ASD.This minireview aims to provide an overview of ASD ML studies with a focus on the importance of database selection for ML applications and the ability to incorporate bioinformatics tools such as systems biology and disease genomics to achieve the desired outcomes.

Data Sources
The significant increases in the number of ASD cases, amount of ASD-related data from multiple technologies (e.g.genomics, rs-fMRI, etc.) and number of sources (e.g.national database, foundations for ASD research etc.) are currently driving the growth in database sources available for ML applications 10,11 .However, ML predictions in life sciences are heavily dependent on high quality data characterized by: 1) correct experiment designinvestigators can estimate the errors and understand the bias and sensitivity of the data; 2) standardization of data repositories -the processes for data extraction, analysis, and quality control are standardized; and 3) reproducibility -a statistical design and analysis that ensure reproducibility of a study at the experimental, empirical, computational and ethical levels 12 .While the information regarding experimental design and data repositories are documented in the source, reproducibility is often an unknown factor, with higher impact in data from subjective evaluation.In a study aiming to determine the validity of the findings of 100 peer-reviewed studies published in three psychology journals, the authors found 50% of the studies could not be reproduced 13 .
Although successful replication provides only validity of the results, it is a prerequisite for medical and physiological interpretation of ML predictions.
The availability of multiple high quality data sources for clinical phenotypes obtained via a variety of modalities including observations by parents, clinicians, video, and audio devices 11 and omics techniques that numerically quantify fundamental biological processes 14 have the potential to associate behavior with omics information in children with ASD, but present challenges for interpretation.For example, linking available data from clinical phenotypes of ASD to genetic factors such as the high-confidence ASD (hcASD) genes during fetal development 5 , environmental factors such as maternal nutrition, viral and bacterial infections 15 , and cultural beliefs at the community-level that can delay early intervention and impact the severity clinical phenotypes of ASD 16 , is possible, but interpretation is difficult.
Figure 1 illustrates the general path from genotype to clinical phenotype and is divided into two parts: 1) the cellular, which is evaluated by objective omics methods; and 2) the clinical phenotype, in which subjective human interpretation is required during the process.Although the figure presents a straight line between the two parts, results are not entirely a continuation, as objective and subjective evaluations capture different aspects of the diagnosis and act as complementing rather than overlapping information.This can affect the ability of ML to predict clinical phenotype directly from genetics.

Cellular Level
At the single-cell level, the technologies to extract data belong to the omics disciplines, a suffix used in life sciences to describe the large-scale data/ information required to understand a complete biological system 17 .
Using cellular features such as DNA and mRNA in a high-throughput manner, researchers can characterize different biological systems in a static or dynamic mode and connect the information from DNA all the way downstream to a metabolite.Genomics databases can integrate with disease genomics to identify diseaseassociated genes and disease-causing mutation biomarkers, and multiple omics databases can integrate with bioinformatics platforms such as systems biology to construct networks, predict interactions and monitor dynamic responses 18 .Since gene expression is regulated at the mRNA and protein levels from transcription initiation to protein degradation, metabolomics has the best tools to link the individual physiological/pathophysiological state to both downstream objective methods and upstream subjective methods while factoring in the impact of genetics, environmental stimuli, diet, and gut microbiome 19 .

System Level
At the system level (Figure 1), upstream to the omics disciplines is the brain image-derived phenotype, a quantifiable data-driven approach that associates brain activity and clinical phenotype of ASD.Linked to a specific area in the brain, functional magnetic resonance imaging (fMRI) is a noninvasive functional imaging technique in which the metabolic activity of tissues is determined indirectly via oxygen consumption 20 .The resting-state fMRI (rs-fMRI) is an advanced alternative that quantifies the spontaneous brain activity of an individual in the absence of stimuli (during resting).Interview -Revised (ADI-R), the main data sources, together with the Childhood Autism Rating Scale (CARS) and Gilliam Autism Rating Scale (GARS) 11 .Parental multiple-choice questionnaires 22 and Likert scale surveys 23 are often used with other data sources in ASD studies.
Linking omics technology and behavioral assessment was previously reported by Bent et al. 24 in a study that statistically correlated clinical phenotype in children with ASD treated with a sulforaphane supplement from broccoli and metabolomics.In this study, parental reports suggested a metabolic link between sphingolipids/ sphingomyelins and improvement in clinical phenotype.A recent study by Quillet et al. 25 identified biomarkers distinguishing ASD and typically developing (TD) groups and linked the cannabinoids THC, CBD and CBG with metabolite levels in children with ASD.It was the first to use ML algorithms on a pharmacometabolomics dataset of previously identified cannabis-responsive biomarkers and other metabolites in children with ASD that shift toward physiological levels determined in typically developing children (TD) after successful MC treatment 23,26 .

Machine Learning Applications
Since 2012, researchers have trained ML algorithms on a wide range of data types to improve diagnostic processes and the understanding of ASD 11,27 .ML applications are often used to facilitate the direct diagnosis of ASD in individual patients, integrate observational data, and facilitate the analysis of parent-reported questionnaires and reported behavior from home-recorded videos 22 , and/or kinematic and motion features from video recordings of adults 21 .ML applications are also used for ASD biomarker discovery, training on data acquired from a broad range of technologies: fMRI 20 , metabolomics 25,28 , proteomics 29 and transcriptomics 30 .At the genome level, ML has been applied for functional characterization of the genetic basis of ASD by constructing a gene-interaction network model 31 .
These examples highlight the progress made in artificial intelligence (AI) in the past few years, and its potential for healthcare applications in general and for ASD diagnostics in particular as the availability, diversity, and quality of relevant data grows, driven by the ability of ML models to find complex, non-linear relationships in the data compared to more traditional data analysis methods.The studies describe in detail the processes followed for the data processing and feature engineering steps.This is a key aspect of ML applications, as the data fed to the models is central to their performance.

Machine Learning Approaches
The quantity of data has a major impact as well.ML methods such as Support Vector Machines (SVM), Random Forest, Gradient Boosting, Deep Neural Network must be selected to suit the size of the dataset and type of data.Data sets with a large number of features per sample require more samples and more complex models, such as deep neural networks (DNN).These networks, with multiple layers of artificial neurons, or computational units, are capable of modelling non-linear relations, and are associated with the branch of AI called deep learning that has enabled recent breakthroughs in applications such as computer vision, speech recognition, language modelling and medical image analysis [33][34][35][36][37] .
In most of the ASD-related studies, a data-centric approach was adopted, where efforts focus on engineering available data to get the best result using classic ML algorithms 32 .This includes curating a subset of samples with pre-defined properties and then finding the subset of features that yield more robust predictions over the available dataset.Biomarker discovery studies use iterative approaches to develop effective ML models that obtain good diagnostic predictors [28][29][30] .This necessity for data engineering reflects the need to get relevant results from datasets with limitations.
Developing ML applications involves trade-offs between choice of models, quantity of data available and data engineering to improve the quality of the data.While there is not one definition of data quality, in the ML discipline we commonly refer to data that enables achievement of intended goals.While informative highquality data can be hard to collect, behavioral data with limited quality and interpretation biases from hundreds or thousands of surveys is often more readily available.Abbas et al. 22 studied 10 features in over 5,000 individuals with ASD and over 1,300 TD individuals.Feature engineering may be used to reduce interpretation bias, for example, in the development of diagnostic tools by focusing on finding a small subset of features with sufficient generalization power.In this respect, fMRI is conversely an extremely rich type of data, with large amount of information that is not necessarily relative to ASD itself and prone to noise 38 .Annotating this type of data and gathering large datasets is challenging and very time consuming.Authors can use pre-existing knowledge to select robust samples and to distill the data to correlations between Regions Of Interest (ROIs) so that it is possible to train a model on a smaller number of samples 38 .
Studies focusing on lower levels of systems biology such as transcriptomics, proteomics or metabolomics contain feature rich datasets with limited numbers of samples.These can be managed through data-centric methods to obtain potential diagnostic solutions [28][29][30] .However, the data sources are of higher quality and point to a broader range of questions that can be answered given sufficient resources and larger datasets.For example, Quillet, et al. 25 successfully used pharmacometabolomics approach for distinguishing ASD groups and pharmacodynamics indications of cannabinoids using 645 features in 15 children with ASD and 9 TD children.This study linked metabolic changes in children with ASD to known biomarkers that can indicate clinical phenotype such as the stress biomarker cortisol and the aggression biomarker dehydroepiandrosterone sulfate (DHEA-S).

Bioinformatics Integration
Both supervised and unsupervised ML techniques have been successfully integrated with multi-omics databases.
Feldner-Busztin et al. 39 indicated the potential of the technologies while emphasizing the need to increase the sample size for each omics and the overall overlapping omics data per sample, namely the genomics, epigenomics, transcriptomics and metabolomics per sample.A range of analytical techniques are applied in several papers covering genomics 31 , RNA signature 30,40 , proteomics 29 , and metabolomics 25,28 .In particular, two high-level bioinformatics annotation engines (Gene Ontology: geneontology.org;and KEGG: www.genome.jp/kegg/)are applied across these patient-derived bioinformatics data sets to permit classification of genes, RNA, proteins and metabolites.Annotations and clusters demonstrate relevance: 1) to medical condition (in this case ASD vs. TD); 2) with cellular and organ location; 3) with metabolic pathways permitting elucidation of high-level effects such as inflammation; and 4) with neuronal activity (e.g.endocannabinoid pathways and neuronal signaling).

Future Perspective
We are at an inflection point where the omics and analytics fields are maturing, ML is being applied across omics data, and pharmacometabolomics biomarkers (cannabis-responsive) are being identified.Providing the right sample size and features with the available bioinformatics tissue-, patient-, cohort-, and pathophysiologyspecific, knowledge will allow ML applications to associate the current clinical phenotypes with underlying conditions of ASD and assist in diagnostic and therapeutic solutions.The growth in qualitative and quantitative data, the growing affordability of personal collecting devices and omics instruments together with standardization of databases show promise to provide the much-needed breakthroughs to effectively diagnose and treat ASD.These methods can also help to elucidate the extended endocannabinoid metabolism and related pathways, and to drive drug discovery and development, as well as to permit quantitative diagnosis for ASD.

Figure 1 :
Figure 1: A simplified presentation of the hierarchical path from genotype to clinical phenotype with respect to the effect of genetics and environment in patients with ASD and the indicated methods to extract data and integrate it into a dataset.