Analytical Theories and Techniques in Healthcare: Genome Sequencing
By Nathan B. Smith
Big data analytics (BDA) uses a wide range of mathematical methods that may be utilized to get insight from very vast and diverse datasets that may be unstructured, semi-structured, or structured. These algorithms include machine learning (ML) and deep learning (DL) techniques, which lie under the umbrella of artificial intelligence (AI). The purpose of BDA is to provide more rapid and accurate decision-making, modeling, and forecasting of future events. Classification and pattern matching are also essential BDA applications.
Discussion
Genetics is a branch of molecular biology that examines heredity through the production of proteins. Proteins are the building blocks of living creatures and are synthesized according to DNA or deoxyribonucleic acid instructions. The DNA molecule is a double-helical structure composed of guanine, adenine, cytosine, and thymine nucleotides. The nucleus of every live cell contains the whole genome. The genome contains the instructions for constructing all proteins required for the organism's development. A human, for instance, will have a genome consisting of about 3 million nucleotide base pairs arranged in a precise sequence, which is the basis of the double helix concept. The DNA double helix unzips during cell division to provide a template for messenger ribonucleic acid synthesis. The construction of an mRNA molecule is known as transcription. mRNA is a single-stranded molecule that serves as a template for the synthesis of proteins. Their precise order or sequence determines the formation of proteins. Translation is the process of encoding a protein from a single-stranded mRNA molecule (Poulson, 2018).
Most big data analytics work on the human genome is directly tied to medicine and health. DNA is responsible for more than just regulating the creation of good protein. Diseases like cancer can be caused by mutations (or errors/variations) in the DNA sequence. As a result, customized genomics, as facilitated by BDA, permits a groundbreaking new field of tailored medicine. BDA is a technique for detecting significant genetic variants in individuals as they respond to personalized pharmacological therapies (Topol E. J., 2019).
The Human Genome Project (HGP) attempts to discover the sequence of a person's genetic material (genes). The result is the human genome, the blueprint for human reproduction. The project began in 1990 and was finished in 2003, utilizing many DNA sequencing techniques, the most notable of which was a DNA polymerase-based technique created by Frederick Sanger (Sanger & Coulson, 1975). Other, more effective approaches, such as next-generation sequencing (NGS), have been developed and marketed by firms like Illumina. NGS has made DNA sequencing a reasonably affordable and rapid technique (Kulski, 2016).
Next-generation sequencing
Next-generation sequencing (NGS) technology has led to an unrivaled explosion in the amount of genomic data. This escalation has collaterally raised the challenges of sharing, archiving, integrating, and analyzing these data. The scale and efficiency of NGS have posed a challenge for the analysis of these vast genomic data, gene interactions, annotations, and expression studies. However, this limitation of NGS can be safely overcome by tools and algorithms using big data framework. Based on this framework, the current state of knowledge of big data algorithms for NGS reveals hidden patterns in sequencing, analysis, and annotation. The APACHE-based Hadoop framework gives an on-interest and adaptable environment for substantial scale data analysis. It has several components for partitioning large-scale data onto clusters of commodity hardware in a fault-tolerant manner. Packages like MapReduce, Cloudburst, Crossbow, Myrna, Eoulsan, DistMap, Seal, and Contrail perform various NGS applications, such as adapter trimming, quality checking, read mapping, de novo assembly, quantification, expression analysis, variant analysis, and annotation. This review paper deals with the current applications of the Hadoop technology with their usage and limitations from the perspective of NGS (Tripathi et al., 2016).
Genomics and natural language processing
The Human Genome Project and MEDLINE are two of the most heavily mined databases globally. Genetic sequencing is addressed in the biomedical literature, but it also appears that sequence may reveal a lot about the biomedical literature. Biological natural language processing is a new field of study that investigates the links between genes, sequences, and the scientific literature to provide a foundation for a new generation of data-mining technologies.
A typical DNA sequence “read” is captured as plain text data representing the order of the four nucleotides. Nowadays, search engines and databases are used to explore and manage enormous text repositories, including DNA “reads” These engines and databases are built on a set of text processing, indexing, and search capabilities known as 'natural language processing (NLP) technologies (Yandel & Majoros, 2002).
Exploring and maintaining biomedical literature with these tools provides certain complications due to the links between biomedical texts and biological sequences. Biomedical literature's links between biological sequences and texts are unique. Understanding complicated gene, sequence, and text relationships are complex. Advances in genetics offer new ways to study texts and blur the barriers between bioinformatics and NLP. Biological NLP (bio-NLP) combines bioinformatics and NLP for sequence and textual analysis (Yandel & Majoros, 2002).
Some bio-NLP researchers use texts to learn about protein interactions and struggle to adapt to standard NLP technology. Others are using texts to improve sequence-retrieval algorithms and to annotate sequences. Bio-NLP must go beyond information management and create gene-function predictions that can be confirmed at the bench to reach its full potential. Using sequence and text to extract latent information from biomedical literature shows potential. Realizing this promise will require more and better ontologies, software that can make inferences from sequence and textual information, and entire article content (Yandel & Majoros, 2002).
Million Veteran Program
The Department of Veterans Affairs' Million Veteran Initiative (MVP) is a national study program to determine how genes, lifestyle, and military experiences impact health and sickness. Since its inception in 2011, approximately 870,000 Veteran partners have joined one of the world's most extensive genetics and health projects. Veterans who collaborate with MVP contribute to the betterment of the lives of their fellow veterans and, eventually, of all people. Already, scientific findings from MVP are assisting the Veterans Health Administration in achieving the goal of altering the health of future generations (Department of Veterans Affairs, 2022).
There are approximately 40 current projects having access to MVP big data for genomic and epidemiological investigations as of May 2022, with over 125 publications in journals including Nature, Cell, and PLOS since 2018. The idea is that research findings using MVP data will eventually aid in the advancement of individualized treatment in VA health care (Department of Veterans Affairs, 2021).
MVP can be described as an observational cohort study and mega-biobank in the Department of Veterans Affairs (VA) health care system and its design and current operations. Questionnaires, the VA electronic health record, a blood test for genomic (DNA), and other tests are all used to gather information from participants. MVP is tied to several existing initiatives, both as peer-reviewed research studies and as actions to help build an infrastructure for future, broad-based research applications. MVP's formal planning began in 2009, with the protocol being approved in 2010 and enrollment beginning in 2011. Over 397,104 veterans have been enrolled as of August 3, 2015, with a stable state of 50 recruitment locations countrywide. The majority of participants (92.0 percent) are males between the ages of 50 and 69 years old (as predicted) among the N = 199,348 with currently available genotyping data (55.0 percent). White (77.2 percent) and African American (13.5 percent) populations are well represented based on self-reported race (Gaziano et al., 2016).
Conclusion
Big data analytics approaches play a critical role in facilitating next-generation sequencing of DNA. It is all about information encoded as sequences of the nucleotides of genetic DNA. The human genome comprises approximately 3 million nucleotide base pairs and contains over 20,000 genes. Hundreds of terabytes of data are required to map a genome. To map the genes, find patterns, and evaluate the interaction of numerous genes, BDA methods such as ML and DL are applied. Many research studies falling under the Million Veteran Program (MVP) umbrella leverage the discipline of next-generation sequencing (NGS). Data in the form of sequence “reads” are captured as massive text-based datasets. These datasets are then analyzed using natural language processing (NLP), an essential technique of artificial intelligence, machine learning, and big data analytics.
References
Department of Veterans Affairs. (2021). VA's Million Veteran Program Publications. Washington, DC: Department of Veterans Affairs. https://www.research.va.gov/MVP/publications.pdf
Department of Veterans Affairs. (2022). VA Million Veteran Program. Retrieved from Department of Veterans Affairs website: https://www.mvp.va.gov/pwa/
Gaziano, J. M., Concato, J., Brophy, M., & Fiore, L. (2016). Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of Clinical Epidemiology, 70, 214-223. https://doi.org/10.1016/j.jclinepi.2015.09.016
Johns Hopkins University. (2021). Mastering Software Development in R Specialization. Retrieved from Coursera: https://bit.ly/2YnIBcM
Kulski, J. K. (2016). Next-generation sequencing: An overview of the history, tools, and "omic" applications. In J. Kulski, Next-generation sequencing: Advances, applications, and challenges (pp. 1-60). London, UK: Open Access Publishing. https://doi.org/10.5772/61964
Poulson, B. (2018). The data science of healthcare, medicine, and public health. Linkedin Learning: https://www.linkedin.com/learning/the-data-science-of-healthcare-medicine-and-public-health-with-barton-poulson/applying-data-science-to-healthcare-medicine-and-public-health
Sanger, F., & Coulson, A. R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology, 94(3), 441-446. https://doi.org/10.1016/0022-2836(75)90213-2
swirl. (2021). Retrieved from swirl: Learn R, in R: https://swirlstats.com
Topol, E. J. (2019). Deep medicine: How artificial intelligence can make healthcare human again. New York, NY: Basic Books.
Tripathi, R., Sharma, P., Chakraborty, P., & Varadwaj, P. K. (2016). Next-generation sequencing revolution through big data analytics. Frontiers in Life Science, 9(2), 119-149. https://doi.org/10.1080/21553769.2016.1178180
Yandel, M. D., & Majoros, W. H. (2002). Genomics and natural language processing. Nature Reviews Genetics, 3, 601-619. https://doi.org/10.1038/nrg861
Comments
Post a Comment