Big Data Analytics, Frameworks, Applied Statistics, and Tools for Bioinformatics

By Nathan B. Smith

Over the past few years, the amount of genetic data that is freely accessible to the public has significantly increased. This phenomenon coincides with a dramatic decline in the cost of genome sequencing. Recent research and cohorts have developed massive datasets including more than 100,000 persons. These datasets have been analyzed simultaneously to extract genetic variation across populations. As a result, massive volumes of data about variance have been produced for each cohort.

Discussion

Genomic medicine uses a patient's genetic information to develop tailored approaches to diagnostic or therapeutic decision-making. This concept is called "personalized medicine." By analyzing data on a massive scale and using a variety of data sets, a technique known as "big data analytics" can unearth previously unseen patterns, correlations, and other insights. Integration and manipulation of diverse genomic data, as well as comprehensive electronic health records (EHRs) on a Big Data infrastructure, present some challenges; however, they also present a workable opportunity to develop a method that is both efficient and effective in identifying clinically actionable genetic variants for individualized diagnosis and therapy. Difficulties are associated with manipulating large-scale next-generation sequencing (NGS) data and heterogeneous clinical data collected from electronic health records (EHRs) for genomic medicine. This task may be done by analyzing genomic data (He, Ge, & He, 2017).

The Million Veteran Initiative (MVP) is a national study initiative being run by the Department of Veterans Affairs to establish the role that factors such as genetics, lifestyle, and previous experiences in the military play in determining health and illness. One of the world's most extensive genetics and health initiatives has had roughly 870,000 new partners sign up for it since it began in 2011. Veterans who work together with MVP to improve the lives of their comrades in the military and, in the long run, of all people make a positive contribution to society. Already, the scientific discoveries made at MVP are supporting the Veterans Health Administration and is moving closer to its aim of making a difference in the health of future generations (Department of Veterans Affairs, 2022).

An excellent case study involving big data analytics in the arena of genomics and personalized medicine is the Million Veterans Project (MVP), conducted by the Veteran’s Health Administration. As of May 2022, roughly 40 active projects have access to the large data housed within MVP to conduct genomic and epidemiological research. Since 2018, there have been over 125 publications in journals such as Nature, Cell, and PLOS. Hopefully, discoveries from research utilizing MVP data would eventually contribute to developing tailored treatment options within the Veterans Affairs health care system (Department of Veterans Affairs, 2021).
An observational cohort study and a mega-biobank are aspects of the MVP project, which focuses on the Department of Veterans Affairs (VA) health care system and its design and ongoing operations. Participants' information is gathered through various methods, including questionnaires, the electronic health record of the VA, a blood test for genomic (DNA) analysis, and other procedures. MVP is connected to some ongoing efforts, including research papers that have been assessed by experts in the field and activities that contribute to developing an infrastructure for future applications of broad-based research. 2009 saw the beginning of MVP's official planning, followed by the protocol's approval in 2010 and the program's launch in 2011. 

As of the 3rd of August in 2015, there were about 397,104 veterans registered in the program, and there was a consistent state of 50 recruitment stations around the country. The bulk of participants, which accounts for 92.0 percent of the total, are men between the ages of 50 and 69 (as was anticipated), and there are a total of N = 199,348 individuals with genotyping data now available (55.0 percent). According to self-reported race, white populations make up the majority (77.2%), and African American populations make up 13.5% of the total population (Gaziano et al., 2016).

Processing the ever-increasing volumes of genomic data is a formidable task, especially in light of the fast advancement of next-generation sequencing technologies. As a result, there is an immediate demand for computing systems that are extremely powerful and highly scalable. Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing. Introducing the resilient distributed dataset abstraction ensures high fault tolerance and scalability. Apache Spark is one of the state-of-the-art parallel computing platforms. In terms of performance, Spark may be up to 100 times quicker than Hadoop when accessing memory and up to 10 times faster when accessing disks. In addition to this, it offers advanced application programming interfaces in the languages of Java, Scala, Python, and R. Further, it is compatible with several sophisticated components, such as Spark SQL for the processing of structured data, MLlib for machine learning, GraphX for the computation of graphs, and Spark Streaming for the processing of streams of data. The Spark-based applications are leveraged in next-generation sequencing (NGS) and third-generation sequencing (TGS) as well as in other biological fields, including epigenetics, phylogeny, and drug development (Guoet al., 2018). 

Conclusion

In recent years, more genetic data has become publicly available. This factor aligns with falling genome sequencing costs. Recent research and cohorts have produced massive databases with over 100,000 people. Genomic medicine tailors diagnostic or therapeutic decisions using a patient's genetic information—personalized medicine. "Big data analytics" analyzes large amounts of data to uncover patterns, correlations, and other insights. The VA's Million Veterans Project (MVP) used big data analytics in genomics and tailored care. Processing genomic data is a daunting endeavor, especially with the rapid improvement of next-generation sequencing technology. Apache Spark is a parallel computing framework. Spark is substantially faster than Hadoop in accessing memory and storage.

References

Department of Veterans Affairs. (2021). VA's Million Veteran Program Publications. Washington, DC: Department of Veterans Affairs. https://www.research.va.gov/MVP/publications.pdf

Gaziano, J. M., Concato, J., Brophy, M., & Fiore, L. (2016). Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of Clinical Epidemiology, 70, 214-223. https://doi.org/10.1016/j.jclinepi.2015.09.016

Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7, 1-10. https://doi.org/10.1093/gigascience/giy098

He, K. Y., Ge, D., & He, M. M. (2017). Big data analytics for genomic medicine. International Journal of Molecular Sciences, 18(2), 1-18. https://doi.org/10.3390/ijms18020412



Comments

Popular posts from this blog

Innovative Discoveries: Serendipity, Errors, and Exaptation

Think Tanks and Futuring