--- output: pdf_document: citation_package: natbib keep_tex: true fig_caption: true latex_engine: pdflatex template: mod-svm-latex-ms.tex header-includes: - \linespread{2} - \usepackage[left]{lineno} - \linenumbers title: "Do androids dream of electric sorghum?" subtitle: "Predicting Phenotypes from Multi-Scale Genomic and Environmental Data using Neural Networks and Knowledge Graphs" author: - name: Ryan P Bartelme^[Corresponding Author, rbartelme@arizona.edu] affiliation: University of Arizona #email: "rbartelme@arizona.edu" # https://github.com/rbartleme # orcid: 0000-0002-6178-2603 - name: Michael Behrisch affiliation: Utrecht University # email: m.behrisch@uu.nl - name: Emily J Cain affiliation: University of Arizona # https://github.com/MagicMilly # email: ejcain@arizona.edu # orcid: 0000-0002-4671-1524 - name: Remco Chang affiliation: Tufts University # http://www.cs.tufts.edu/~remco # email: remco@cs.tufts.edu # orcid: 0000-0002-6484-6430 - name: Ishita Debnath affiliation: Michigan State University # email: debnathi@msu.edu # orcid: 000-0002-8626-4167 - name: P. Bryan Heidorn affiliation: University of Arizona # https://github.com/BryanHeidorn # http://si.arizona.edu/users/bryan-heidorn # email: heidorn@arizona.edu # orcid: 0000-0002-4601-8180 - name: Pankaj Jaiswal affiliation: Oregon State University # email: jaiswalp@science.oregonstate.edu # orcid: 0000-0002-1005-8383 - name: David S LeBauer affiliation: University of Arizona # https://github.com/dlebauer # email: dlebauer@arizona.edu # orcid: 0000-0001-7228-053X - name: Ab Mosca affiliation: Tufts University # email: amosca01@cs.tufts.edu # orcid: 0000-0002-9008-5516 - name: Monica Munoz-Torres affiliation: Oregon State University # https://github.com/monicacecilia # email: munoztmo@oregonstate.edu # orcid: 0000-0001-8430-6039 - name: Arun Ross affiliation: Michigan State University # https://www.cse.msu.edu/~rossarun/ # rossarun@msu.edu # orcid: 0000-0001-8850-3013 - name: Kent Shefchek affiliation: Oregon State University # email: shefchek@oregonstate.edu # orcid: 0000-0001-6439-2224 - name: Tyson L Swetnam affiliation: University of Arizona # https://github.com/tyson-swetnam # email: tswetnam@arizona.edu # orcid: 0000-0002-6639-7181 - name: Anne E Thessen affiliation: Oregon State University # https://github.com/diatomsRcool # email: annethessen@gmail.com # orcid: 0000-0002-2908-3327 abstract: "The interplay between an organism's genes, its environment, and the expressed phenotype is dynamic. These interactions within ecosystems are shaped by non-linear multi-scale effects that are difficult to disentangle into discrete components. In the face of anthropogenic climate chance, it is critical to understand environmental and genotypic influences on plant phenotypes and phenophase transitions. However, it is difficult to integrate and interoperate between these datasets. Advances in the fields of ontologies, unsupervised learning, and genomics may overcome the disparate data schema. Here we present a framework to better link phenotypes, environments, and genotypes of plant species across ecosystem scales. This approach utilizing phenotypic data, knowledge graphing, and deep learning, serves as the groundwork for a new scientific sub-discipline: *“Computational Ecogenomics”*." keywords: "machine learning, genomics, phenotype" date: "`r format(Sys.time(), '%B %d, %Y')`" fontsize: 11pt geometry: margin=1in fontfamily: mathpazo spacing: double bibliography: references.bib biblio-style: plantphenomics --- ```{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(echo = TRUE) #library("magick") library("here") library("tinytex") library("yaml") ``` # Introduction Unprecedented anthropogenic climate change necessitates us to have the ability to adapt crops and modify ecosystems. Understanding genomic responses of plants and animals to environmental variation allows prediction [@ungerer2008ecological; @des2013genotype]. Environmental factors that influence organismal phenotype, fecundity, morbidity, and mortality in turn affect agricultural and natural ecosystems (Figure 1). However, multi-scale effects over time and space combined with the non-linearity of natural systems [@lorenz1963deterministic; @ruel1999jensen; @west2009general] obscures the signal of biological processes, interactions with the environment, and the resulting (observable) phenotypes. Maintaining the innumerable benefits and services agronomic and natural systems provides is therefore critical to our survival and prosperity. ![System level interactions. Divided into environmental variation (brown cartoons, right of line) interacting across all levels of organismal structure (blue cartoons, left of line)](Figure1.png) Existing models for predicting phenotypes from genetic and environmental data focus on single species or single ecosystems. However, both societal need and technical capabilities are moving toward addressing larger scale questions that require integration of multi-modal data. In addition to relatively small and heterogeneous data sets, researchers are relying on national and global data collection efforts [@balch2020neon] such as the National Ecological Observatory Network (NEON) [@keller2008continental] and airborne and space-based Earth Observation Systems (EOS). These efforts produce large quantities of homogeneous "born-digital" data, but a significant interdisciplinary data-integration task remains. Ontologies and knowledge graphs using semantic similarity to cope with problems of granularity and terminology [@mungallintegrating2010; @stuckyplant2018] facilitate data integration at scales beyond a single taxon or single ecosystem. In addition, as the data sets and questions become more complicated, more non-linear predictive models are needed to address them. # Challenges of Dataset Interoperability Incorporating genomic data into phenomics is challenging. Many recent studies have used only environmental features and machine learning to predict phenotypes of lilacs, honeysuckle, rice, and wheat [@ALDERMAN20171; @nissanka2015calibration; @mehdipoor2019geocomputational]. There are a number of methods linking genomic data to environments or traits. For example, genome wide association studies (GWAS) enable researchers to examine the influence of single nucleotide polymorphisms (SNP) on phenotypes in both natural and controlled settings [@beyer2019loci; @schlappi2017assessment; @spindel2016genome]. GWAS often provides generalized and mixed linear model associations between SNPs and environmental variables, roughly analogous to Genes + Environment = Phenotype (G+E=P). There are limitations in the assumptions made by existing methodologies, such as GWAS, that directly attribute plant phenotypes to environmental variables. These methods do not explicitly incorporate biological and molecular interactions, such as post-translational modification of macromolecules [@running2014role], the importance of plant-microbe interactions [@oyserman2019extracting], or endogenous siRNA [@katiyar2006pathogen]. However, a machine learning approach allows for these biological phenomena to be accounted for as latent variables while probing the interactions of genomes, environments, and phenotypes in a multidimensional manner. Conventional observations and statistical models are shifting toward remotely sensed observation and trait collection, which rely on machine learning (ML) and computer vision for measurements. For example, Bayesian Belief Networks [@cooper1990computational] may be implemented to associate environmental variables with traits, and Generative Adversarial Networks [@radford2015unsupervised] for classifying plant phenotype imaging. Rather than simply generating large quantities of machine readable data [@hampton2013big] and implementing ML methods ad-hoc [@pichler2020machine], ecologists are now grappling with how to interpret the massive quantities of unstructured data that are available at scale. Unfortunately, the ML that provides a scalable, non-linear method for using these data, relies on complex, “black box” methodologies to assess explanatory variables, which does not aid interpretation. Here we introduce the GenoPhenoEnvo project, an effort to predict phenotype from genetic and environmental data, while developing novel representations of the ML “black-box” internals. # Future Directions Our project has the long-term goal of developing predictive analytics based on an organisms’ genetic code and its associated phenotypic response to environmental change. To design an initial analytical framework and workflow, we will first use phenomic, genomic, and environmental data about sorghum (_Sorghum bicolor_). These data are available through the TERRA-REF (Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform) project [@lebauer2020data; @burnette2018terra]. After training the ML model on the highly controlled and thorough TERRA-REF data set, we aim to test and further develop the model with data from less controlled and lower resolution environments using data from sources such as NEON and the National Phenology Network. The first challenge is to prepare these data for use as input into ML models. In addition to empirical data, ontologically-supported knowledge graphs can be used to inform the ML [@mungall2017monarch]. Knowledge graphs (KG) are directed acyclic graphs that represent knowledge in a computational format and are an integral part of Google’s Answer Box and IBM’s Watson. KGs can help constrain and prioritize results, provide quality control, fill in data gaps with inferencing, and integrate heterogeneous data. In this project, KGs provide an easier way to integrate data from established plant phenome databases such as Planteome [@cooper2018planteome], Gramene [@jaiswalgramene2011], and TAIR [@pooletair2007]. The true power of ontologies and KGs is a formal logical structure, enabling inferential and similarity analyses [@mungall2017monarch; @washington2009linking]. In particular, it is this latter feature that will enable data set interoperability. As we begin to make predictions in more complicated systems with multiple species and heterogeneous data, the knowledge graphs will be critical for managing phenotype data. The GenoPhenoEnvo (GPE) project aims to predict phenotype with genomic and environmental data using a multimodal approach to training ML models. We are actively developing a visualization tool to increase understanding of why the model gave a particular result. In this way, the GPE project will work toward phenotype predictions and an increased understanding in the biological and molecular processes that translate genotype to phenotype. In addition to increased awareness of molecular effects, the ML models could enable specific ecological hypothesis testing or predicting long-term speciation events driven by environmental factors. Ultimately, we believe this combination of methods will generate a new scientific sub-discipline, one we have called “_Computational Ecogenomics._” # Author contributions The ordering of authors following first is alphabetical with senior author last, contributions included: conceptualization (c), editing (e), review (r), and text writing (w). RPB: cew, MB: c, EC: cr, RC: c, ID: c, PBH: c, PJ: c, DSL: cerw, AM: c, MMT: cerw, AR: c, KS: c, TLS: cerw, AET: cerw. # References