Big Data Analytics, Frameworks, Applied Statistics, and Tools: Integration of Hadoop and R for Bioinformatics
By Nathan B. Smith
The purpose of this paper is to discuss published papers regarding data analytics in various health care fields that discuss the theories and techniques covered in terms of data analytics for big data. The academic and professional communities in this realm have covered the data analytics theories and techniques and how they are used in health care and biomedical research.
"Big data" refers to very large volumes of information that, when properly analyzed, may do amazing things. Because it conceals a significant amount of untapped potential, in the last two decades it has developed into a subject that has garnered a lot of attention. Big data is being generated, stored, and analyzed across a variety of businesses, both in the public and commercial sectors, with the goal of improving the services being offered. Big data may come from a variety of sources in the healthcare business. Some of these sources include hospital records, the medical records of patients, the results of medical exams, and devices that are connected to the internet of things. A sizeable amount of the big data that is pertinent to the field of public healthcare is produced by biomedical research as well. To generate information with any kind of significance from this data, effective administration and analysis are required. If this is not avoided, finding a solution through analyzing large amounts of data soon becomes analogous to looking for a needle in a haystack (Dash et al., 2019).
When working with large amounts of data, each stage has its own unique set of obstacles, which can only be overcome by using high-end computer systems designed specifically for big data analysis. Because of this, for healthcare providers to provide meaningful solutions for the improvement of public health, they are necessary to be completely outfitted with the requisite infrastructure to systematically gather and analyze large amounts of data. The effective management, analysis, and interpretation of large amounts of data have the potential to completely alter the game by introducing novel approaches to contemporary medical practice. For this same reason, many other companies, including the healthcare industry, are making significant strides to transform this potential into superior services and monetary benefits. There is a possibility that contemporary healthcare organizations may be able to transform medical treatments and customized medicine if they successfully integrate biological and healthcare data (Dash et al., 2019).
Discussion
The analysis of vast amounts of data is one of the most significant obstacles that healthcare companies must surmount. The proliferation of diverse healthcare applications has led to an increase in the number of devices used in healthcare that create a wide range of data. Processing the data and conducting an accurate analysis of it are both necessary steps before improved decisions can be made. Computing in the cloud is an exciting new technology that has the potential to provide services on demand for storing, processing, and analyzing data. Conventional data processing technologies are unable to handle the volume of data that is now being collected. We need a more advanced distributed system that operates in a cloud environment so that we may improve the system's performance and address any concerns about its scalability. A system that can handle enormous amounts of data in a distributed setting is called Hadoop. To process massive amounts of healthcare data, Hadoop may be installed on cloud environments. Applications for medical care are increasingly being made available over the internet and cloud computing rather than being run on conventional software. To offer great medical treatment, providers of healthcare need to have access to information in real-time (Rallapalli, Gondkar, & Ketavarapu, 2016).
Installing Hadoop in a Docker container
There are numerous commercial Hadoop distributions and web services available via Amazon Web Services, Cloudera, Horton Works, Oracle, Azure, and so forth. However, to gain an appreciation for standing up a Hadoop cluster from scratch, an Apache Hadoop instance will be created.
A container is a standardized piece of software that encapsulates an application's source code together with all its dependencies in order to ensure that the program is able to operate swiftly and reliably regardless of the computing environment in which it is being executed. A Docker container image is a lightweight, independent, executable package of software that contains the source code, runtime, system tools, system libraries, and settings required to execute an application. This software package is known as a Docker image.
Images of containers transform into containers when they are executed, and the same is true for Docker containers, which transform into containers when they are run on Docker Engine. Containerized software will always operate the same, notwithstanding the underlying infrastructure, and it is available for programs running on both Linux and Windows. Containers separate software from the environment in which it runs and guarantees that the program continues to function normally regardless of changes in the environment, such as those that occur between development and staging. Docker images are advantageous because they are assembled according to an industry standard, allowing enabling their portability at any location. Additionally, containers use the same kernel that is used by the machine's operating system; therefore, they do not need their own operating system for each application. This results in increased server efficiency as well as decreased costs for licensing and servers. Finally, applications are safer when they are contained in containers, and Docker has the greatest default isolation capabilities of any similar computing framework (Docker, 2022).
For the purposes of creating a functioning Hadoop instance, a custom Ubuntu (Xenial version) container is created within another Ubuntu instance hosted on a VMWare hypervisor operating on a Windows 11 machine. The following steps were followed to create this container:
1. Install VMWare Player (free) on Microsoft Windows Machine.
2. Download and install free Ubuntu Xenial image on VMWare Player.
3. Update Xenial image from terminal using apt-get and apt-upgrade commands.
4. Install the Docker engine on the Ubuntu instance
a. Uninstall any existing Docker software
b. Set up repository
c. Add docker official CPG key
d. Install Docker engine
e. Verify Docker Engine is functional using the “hello-world” container
f. Verify Docker Engine version
g. Create Docker user group and add user
5. Run a Xenial Ubuntu container and initial an interactive terminal session
a. Docker pulls Xenial image (since the image is not yet locally available
b. From the command-line interface (CLI), view a list of the default directories using ls command.
6. Install pre-requisites for Hadoop
a. Install Open Java Runtime Environment (JRE)
b. Install Open Java Development Kit (JDK)
c. Set $JAVA_HOME system variable
d. Install Nano text editor (needed to edit various configuration files
e. Disable SELinux (known to cause issues with Hadoop
f. Restart container and verify Java is properly installed and configured.
g. Install wget using apt-get install wget command.
7. Download Apache Hadoop using wget command
8. Unpack Hadoop .gz file
9. Configure Hadoop to initially operate in LocalJobRunnerMode
a. Set $HADOOP_HOME system variable
b. Configure Hadoop
c. Create users and groups for HDFS and YARN
d. Change group and permissions for the Hadoop release files
e. Test Hadoop installation using the built-in Pi Estimator example included with the Hadoop release
f. Success! Hadoop is functional in LocalJobRunner mode. Pi is calculated at 3.142500000000
10. Configure Hadoop to operate in pseudo-distributed mode
a. Edit core-sire.xml file
b. Edit hdfs-site.xml file
c. Edit yarn-site.xml file
d. Edit mapred-site.xml file
e. Format HDFS on NameNode
f. Start Namenode and DataNode daemons
g. Start ResourceManager and NodeManager (YAR) daemons
h. Verify daemons are running by viewing Java processes using jps command
i. Create user directories
j. Test Hadoop operation in pseudo-distributed mode using the Pi Estimator MapReduce job (as before)
k. Success! Hadoop is running in pseudo-distributed mode. Pi is estimated at 3.142500000000
(Aven, 2017)
Big data analytics using R
The large amount of information that is now accessible on the molecular level gives great prospects to describe the genetic foundations of complex disorders while also uncovering new biological pathways that contribute to the course of diseases. A wide range of biomedical or healthcare analytic techniques and software tools for the analysis of data are derived from population-based genetic investigations. These techniques and tools can be implemented in R for the study of genetic variation within populations. Researchers from a variety of fields, such as medicine, public health, epidemiology, statistics, and computer science, commonly conduct statistical genetics with R to be helpful in their exploration of this emerging field because it provides a clear and convincing presentation of several fundamental statistical approaches (Foulkes, 2009).
While Hadoop (and later Spark) provide a distributed computing framework consisting of a coordinated system of distributed servers, a statistical programing language such as R or Python (configured with statistical and visual packages including Panda, Seaborn, Scikit-learn, and so forth) is essential to for data analysis. For the purposes of this paper, R is installed and configured to work with the Hadoop pseudo-distributed computing environment operating in an Ubuntu Docker container.
Installing R
The following procedure was used to install R in the Xenial Ubuntu container previously described.
1. Install all dependencies needed to add a new repository over HTTPS
2. Add the CRAN repository to the container sources list
3. Install R using apt install command
4. Verify R is installed by using the R - -version command
5. Success! R version 3.2.3 “Wooden Christmas Tree” is installed
6. Start R from the command-line interface using R command
Although R is fully functional from the command-line interface (REPL), a far more user-friendly environment is available using the server version of R Studio. Again, the Xenial Ubuntu container is used as created above.
1. Install prequisites
a. Install gdebi-core package using apt-get command
2. Download R Studio server .deb package using wget command
3. Install R Studio server .deb package
4. The RStudio server is ready to use. It is accessed from outside of the container, using a web browser.
5. Once authenticated, the following RStudio web GUI is available for use
Integrating R with Hadoop
RHadoop helps in an integrated interaction of R with Hadoop. RHadoop is a collection of R packages that enable R to use Hadoop data management.
1. Install the following rJava packages using apt-get:
2. Install the following RHadoop packages (using apt-get) that enable R to use the Hadoop framework.
a. plymr
b. rmr
c. rhdfs
d. rhbase
3. Run the following commands to test the R/Hadoop integration:
> Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
> Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
> ys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming2.7.1.jar")
> library(rJava)
> library(rhdfs)
> hdfs.init()
> library(rmr2)
> sample<-1:10
> small.ints<-to.dfs(sample)
> out<-mapreduce(input = small.ints, map=function(k,v) keyval(v,v^2))
> from.dfs(out)
> df<-as.data.frame(from.dfs(out))
> print(df)
(University of North Carolina Greensboro Computer Science, 2022)
Conclusion
The term "big data" refers to very enormous amounts of information that, when correctly evaluated, has the potential to perform incredible things. Over the course of the last two decades, it has evolved into a topic that has received a lot of interest as a result of the fact that it hides a sizeable amount of latent potential that has not yet been used. A wide range of organizations, in both the public and private sectors, are collecting, storing, and analyzing large amounts of data with the intention of enhancing the quality of the services that are currently being provided.
One of the most critical challenges that healthcare providers and other businesses in this industry need to solve is the analysis of enormous volumes of data. The proliferation of different healthcare applications has led to a rise in the number of devices used in healthcare that generate a broad variety of data. This growth has been caused by the expansion of varied healthcare applications. Before better judgments can be made, the data must first be processed, and then it must be accurately analyzed. Both of these processes are required before improved decisions can be made. Cloud computing is an intriguing new technology that has the ability to deliver services on demand for storing, processing, and analyzing data. Cloud computing may also be thought of as internet-based computing. The amount of data that is now being gathered is too much for the data processing systems that have been in use up until now.
Commercial entities including as Amazon Online Services, Cloudera, Horton Works, Oracle, Azure, and many more make a wide variety of Hadoop distributions and web services accessible to its customers. On the other hand, in order to have a better understanding of how to set up a Hadoop cluster from the ground up, an Apache Hadoop instance was established. After that, the R programming language was added to the Docker container that had been created. The RStudio Server was installed, and users may access it from outside the Docker container using a standard web browser. This was done in order to make R more user-friendly. In the end, a number of RHadoop packages were added in order to make it possible to execute R from inside the Hadoop framework.
The RHadoop framework described in this paper provides a useful environmental for performing biomedical and health care research requiring an enhanced, distributed big data analytics framework.
References
Aven, J. (2017). Sams teach yourself Hadoop in 24 hours. Indianapolis, IN: Sams.
Dash, S., Shakyawar, S. K., Sharma, M., & Kaushik, S. (2019). Big data in healthcare: management, analysis and future prospects. Journal of Big Data, 6, 1-54. https://doi.org/10.1186/s40537-019-0217-0
Docker. (2022). Use containers to build, share and run your applications. Docker Website: https://www.docker.com/resources/what-container/
Foulkes, A. (2009). Applied statistical genetics with R. New York, NY: Springer.
Rallapalli, S., Gondkar, R. R., & Ketavarapu, U. P. (2016). Impact of processing and analyzing healthcare big data on cloud computing environment by implementing Hadoop cluster. Procedia Computer Science, 85, 16-22. https://doi.org/10.1016/j.procs.2016.05.171
RStudio. (2022). Download and install RStudio Server for Debian & Ubuntu. RStudio Website: https://www.rstudio.com/products/rstudio/download-server/debian-ubuntu/
University of North Carolina Greensboro Computer Science. (2022). Installation of R, RStudio, and Packages for RHadoop. University of North Carolina, Greensboro Computer Science Website: https://home.uncg.edu/cmp/downloads/files/Part%203.pdf
Comments
Post a Comment