Big Data Storage and Networked Clusters

By Nathan B. Smith

The objective of this discussion is to develop the building blocks of a system for healthcare big data analytics and compare them to a system of DNA networked clusters which is typically used by genomic sequencing companies.

Discussion

In the field of bioinformatics, several high-throughput methods, such as next-generation sequencing, lead to an onslaught of sequences that originate from a variety of sources but are not defined. Processing this enormous amount of sequencing data using the traditional methods is a laborious and challenging undertaking. In addition, the processing of vast amounts of diverse, complicated, and complex data demands a significant number of resources. The production of findings from such an analysis might take several hours or even days. 

To expedite the processing of data, bioinformatics research, in general, relies on platforms that are high in both computational power and storage capacity. Cloud computing has traditionally been the one to save the day and deliver service-oriented architectures. These architectures might take the shape of hardware platforms, application platforms, or operating environment platforms. Cloud computing services can, to a certain degree, assist researchers in bioinformatics overcome the challenges they have while attempting to carry out their work in a manner that is both efficient and economical. However, to do analysis, the scientists will need to migrate their data and methods into the cloud environment. 

Recently, an emerging trend of on-the-fly solutions for bioinformatics problems, as well as rising concerns about privacy, require bringing computational resources closer to users to reduce the remoteness between computing platforms and the data of users. This is accomplished by bringing computing resources closer to users. The services are brought closer to the edge of the network using fog computing, which is an extension of cloud computing. This puts the benefits and capabilities of the cloud closer to the location where the data is produced, so assisting with and accelerating the process of developing on-the-fly solutions for applications involving bioinformatics. 

Fog computing is an extension of cloud computing and might be utilized in the field of biology to tackle difficult issues on the run. Cloud Computing's building pieces include the physical, virtualization, and service layers. Virtualization is key to cloud computing. Virtualization hides the platform's physical properties from front-end users. Virtualization emulates and abstracts computation. Cloud Computing uses clusters and grids for high-performance applications like simulations. Other valuable aspects of cloud computing include a service-oriented architecture enabled using web services, such as Amazon Web Services or Microsoft Azure.

Genomics

The genetic material in humans and virtually all other species is called DNA, or deoxyribonucleic acid. A person's body has almost identical DNA in every cell. A minor quantity of DNA can also be found in the mitochondria, although the majority of DNA is contained in the cell nucleus (where it is known as nuclear DNA) (where it is called mitochondrial DNA or mtDNA). Cellular organelles called mitochondria transform the energy from meals into a form that can be utilized by cells.

Adenine (A), guanine (G), cytosine (C), and thymine (C) are the four chemical bases that make up the code that stores the information in DNA (T). More than 99 percent of the 3 billion bases that make up human DNA are the same in every person. Like how the letters of the alphabet occur in a certain order to create words and sentences, the order, or sequence, of these bases dictates the information accessible for creating and sustaining an organism.

DNA nucleotides link together to create units referred to as base pairs, A with T and C with G. A sugar and phosphate molecule are also joined to each base. A nucleotide is made up of a base, a sugar, and a phosphate. A double helix is a spiral structure made up of two long strands of nucleotides. The base pairs serve as the ladder's rungs while the sugar and phosphate molecules serve as the ladder's vertical side rails in the double helix shape.

By exploiting patients' genetic data, genomic medicine aims to provide personalized approaches for making diagnostic or therapeutic decisions. Big Data analytics analyzes several large-scale data sets to find hidden patterns, undiscovered relationships, and other insights. Even though the integration, manipulation, and use of various genomic data and extensive electronic health records (EHRs) on a big data infrastructure present several challenges, they also present a practical opportunity to create a method that is quick, easy, and accurate for finding clinically useful genetic variants for individualized diagnosis and treatment.  He, Ge, & He (2017) discuss the difficulties involved in using different clinical data collected from electronic health records (EHRs) and large-scale next-generation sequencing (NGS) data for genomic medicine. There are many potential solutions to several problems associated with handling, managing, and interpreting genetic and clinical data to enable genomic medicine. 

NGS (next-generation sequencing) requires the separation of DNA into a significant number of segments. The term "read" refers to each individual section. The distribution of reads across the genome may be uneven due to biases introduced during sample processing, library preparation, sequencing-platform chemistry, and bioinformatics methods for genomic alignment and assembly of the reads. This inconsistency may also exist in the length of the reads. Therefore, certain genomic areas are covered by a greater number of reads while others are covered by a smaller number of reads. Read depth refers to the average number of times each base is read, as was discussed above. For example, a read depth of 10 indicates that each base is present in an average of 10 readings. Read depth is measured in reads. When discussing RNAseq, read depth is more often referred to in terms of the number of millions of reads. Read alignment is the process of fitting up the sequence reads to a reference sequence so that sequence data from a sample sequenced may be compared to the sequence data from the reference genome. In terms of distributed big data infrastructures, a variety of alignment tools have been created, among which include Cloudburst, Crossbow, and SEAL A variety of quality control (QC) measurements are made possible by alignment, including the percentage of all reads that are aligned to a reference sequence, the percentage of unique reads that are aligned to a reference sequence, and the number of reads that are aligned at a particular locus. The accuracy of variant calling is impacted by these quality control methods.

Big data analytics architectures for bioinformatics

Using the MapReduce programming style, Apache Hadoop is an open-source infrastructure that enables dependable and scalable distributed processing of massive data sets across clusters of computers. Apache HBase is an open-source, distributed, versioned, and non-relational database that is based on Google's Bigtable. Similar to the way Bigtable uses the Google File System's distributed data storage, on top of Hadoop and the Hadoop distributed file systems (HDFS), HBase offers Bigtable-like functionality. HBase is made to offer real-time random read/write access to structured huge data. It can support very big tables with millions of columns and billions of rows per. The 100K Genomes Project, which will soon be a reality, can use this capability to modify hundreds of thousands of WGS samples. Quality control, alignment, single nucleotide polymorphism (SNP) calling, variant annotation, and general workflow management are just a few of the Hadoop-based tools that have been created and used to analyze sequencing data for NGS investigations. Additionally, a lot of additional NGS research projects are constructed using the Hadoop platform. These include the Hadoop-BAM14 library for modifying BAM files based on Hadoop and the SeqPig15 library for utilizing Apache Pig's powerful high-level functionality to edit aligned and unaligned sequence data through scripting. But a trustworthy computational toolkit that can effectively work with genome-wide collections of variations, their functional annotations, and every-site coverage (read depth), as well as analyze NGS data to find disease-causing genes based on a scalable big data infrastructure, is urgently needed. 

SeqHBase

He et al. (2015) developed SeqHBase, a big data toolkit built on Apache Hadoop and HBase, to analyze massive family-based sequencing data and find mutations that may cause diseases. Figure 1 (see attachment) provides a description of the SeqHBase architecture in its most fundamental form. The study of each pedigree makes advantage of the manipulations that SeqHBase does on coverage information (which includes every site in the genome), genetic variants, and the functional annotations associated with them. Using MapReduce programming paradigms, SeqHBase is able to handle, store, and retrieve this sequencing data in an effective manner. It does this by using Apache Hadoop and HBase. 

According to the architecture presented in Figure 1, users can load three distinct varieties of sequencing data into HBase by supplying BAM or pileup files generated by SAMtools16 for coverage information, VCF, or vcf.gz files for genetic variations, and comma-separated values (CSV) files for annotated variants. These file types can be found in the following categories: coverage information, genetic variations, and annotated variants. The incoming data set is then broken up into separate chunks using the MapReduce processing paradigm by SeqHBase. These chunks are then handled in a totally parallel fashion by the map jobs. In parallel, the reduction jobs are responsible for the extraction of coverage information, genetic variants, and variant annotations. These are done in combination with a pedigree file. SeqHBase makes mutations that are unique, inherited homozygous, or compound heterozygous by using an analytical engine that was designed based on the big data infrastructure. Java was used in the creation of SeqHBase and is freely available for use by the academic community (He, Person, & Hebbring, 2015).

Integration of Hadoop and R for Bioinformatics

It is common knowledge that data is the most important asset of an organization, and it would not be an exaggeration to call it the most valuable asset. The reason for this is that data is the most precious item that an organization has. But in order to deal with such a large structure and unstructured we need an effective tool that could effectively do the data analysis, so we get this tool by merging the features of both the R language and the Hadoop framework of big data analysis, which results in an increment in its scalability as a result of the merging of these two frameworks. As a consequence, we will not be able to get superior insights and results from the data until we have combined the two approaches. In the near future, we will discuss the many different approaches that might facilitate the integration of these two.

R is a programming language that is available for free online and is widely used in the fields of statistical and graphical analysis. R provides support for a wide range of statistical-mathematical-based library functions (such as linear and nonlinear modeling, classical-statistical testing, time-series analysis, data classification, data clustering, and so forth) as well as graphical ways for effectively processing data.

R's ability to construct charts of high design and quality quickly and easily, complete with appropriate mathematical symbols and formulas, is one of its most notable characteristics. If a researcher is in a situation where powerful data-analytics and visualization capabilities, then using this R language with Hadoop will be a viable option. It is an object-oriented programming language that can be expanded in many ways and has powerful graphical features. The capabilities of single-threaded software built for home computers are being outpaced by the expanding scope and complexity of computational workloads for genome-wide association studies (GWAS). The BlueSNP R package implements genome-wide association studies (GWAS) statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop. Hadoop is the de facto standard framework for distributed data processing that uses the MapReduce formalism. For large genotype–phenotype datasets, BlueSNP makes it possible to conduct computationally intensive analyses such as estimating empirical p-values through data permutation and searching for expression quantitative trait loci across thousands of genes. These analyses include searching for expression quantitative trait loci (Hailiang, Tata, & Prill, 2013).

Conclusion

The purpose of this discussion was to construct the fundamental components of a system for healthcare big data analytics and then compare those components to a system of DNA networked clusters, the kind of system that is often used by organizations that specialize in genomic sequencing. In the area of bioinformatics, various high-throughput technologies, such as next-generation sequencing, lead to an assault of sequences that come from a number of sources but are not characterized. One such method is the Sanger sequencing technique. The processing of such a massive quantity of sequencing data using the conventional approaches is a long and difficult endeavor. Apache Hadoop is an open-source infrastructure that utilizes the MapReduce programming language to allow consistent and scalable distributed processing of enormous data sets across clusters of computers. Hadoop was developed by the Apache Software Foundation. Open-source, distributed, versioned, non-relational, and based on Google's Bigtable, Apache HBase is a kind of database that was developed by the Apache Software Foundation. BlueSNP is a robust platform for large data analytics that blends Hadoop and R. The BlueSNP R package performs statistical tests for genome-wide association studies (GWAS) in the programming language R and runs the computations across computer clusters built with Apache Hadoop. GWAS stands for genome-wide association studies.


Calabrese, B., & Cannataro, M. (2015). Cloud computing in healthcare and biomedicine. Scalable Computing: Practice and Experience, 16(1), 1-18. https://doi.org/10.12694/scpe.v16i1.1057

Casola, V., Castiglione, A., Choo, K. R., & Esposito, C. (2016). Healthcare-related data in the cloud: Challenges and opportunities. IEEE Cloud Computing, 3(6), 10-14. https://doi.org/10.1109/MCC.2016.139

GCN. (2021, August 24). VA readies cloud-based data platform. Government Computer News Website: https://gcn.com/cloud-infrastructure/2021/08/va-readies-cloud-based-data-platform/316208/

Hoang, D., & Dat, D. T. (2015). Health data in cloud environments. Proceedings of the Pacific Asia Conference on Information Systems (PACIS 215) (pp. 96-111). Singapore, SG: Association for Information Systems. http://hdl.handle.net/10453/39471

Regola, N., & Chawla, N. V. (2013). Storing and using health data in a virtual private cloud. Journal of Medical Internet Research, 15(3). https://doi.org/10.2196/jmir.2076



Comments

Popular posts from this blog

Innovative Discoveries: Serendipity, Errors, and Exaptation

Think Tanks and Futuring