Caveats of Supervised Learning and Big Data

Data mining and analytics are used to test hypotheses and detect trends from large data sets. In statistics, the significance is determined to some extent by the sample size. By facilitating process optimization, boosting insight discovery, and enhancing decision making, the big data revolution has the potential to revolutionize our daily lives, our workplaces, and our ways of thinking. Learning patterns from data to enhance performance on different tasks is the goal of machine learning, a subfield of computer science. Machine learning's capacity to learn from data and give data-driven insights, judgments, and predictions is central to data analytics and crucial to realizing this enormous promise. While this new context is ideal for some machine learning tasks, traditional methodologies were designed in a previous era and, as a result, are predicated on several assumptions, such as the data set fitting totally in memory. The inconsistencies between these assumptions and the realities of big data make conventional methods difficult to use (L'Heureux et al., 2017).

Discussion

The focus of this discussion is to introduce the fundamentals of supervised machine learning (ML for prediction), including key concepts, terms, algorithms, and methods for model construction, validation, and assessment. However, the differences between supervised learning and unsupervised learning, as well as the kinds of research objectives that may be accomplished with each, must be established before supervised learning can be discussed. Defining a study issue is the initial step in employing machine learning techniques. Data science research may be divided into three categories: description, prediction, and causal inference. Each of these three endeavors is like machine learning, while conventional statistical techniques may be sufficient and more suited depending on the nature of the research subject (Jiang, Gradus, & Rosellini, 2020).

Machine learning

Machine learning is a term commonly used in applied research to characterize methods of pattern recognition that are computationally intensive, automated, and highly adaptable (such as nonlinear associations, interactions, underlying dimensions, or subgroups). This term is used to differentiate between "traditional" parametric methods, which make many statistical assumptions and demand an upfront knowledge of the dimensions or subgroups of interest, the functional form of the relationship between predictors and outcomes, and interactions between predictors.

Because of their focus on discovering associations within a data structure without having a quantifiable consequence, unsupervised machine learning approaches excel at description tasks. Since there is no response variable to act as a supervisor throughout the study, this type of machine learning is unsupervised. Discovering hidden dimensions, components, clusters, or trajectories inside a dataset is the focus of unsupervised learning (Jiang, Gradus, & Rosellini, 2020).

Using supervised learning to address the statistical analysis dilemma that everything is significant

Machine learning (ML) needs many data to function effectively. To distinguish the important "features" (or “signal”) from the surrounding noise in a dataset, the algorithms require a large amount of data. Machine learning algorithms can incorporate a degree of domain data such that the essential characteristics and properties of the target data are already known, reducing the amount of data necessary in certain circumstances. Then, the sole goal of learning can be to maximize output. Machine learning is so data-hungry that it is necessary to "imbue" human knowledge into the algorithm from the beginning. A significant amount of creativity must initially take place in selecting input data for machine learning to drive innovation genuinely. Overfitting and bias are two prevalent issues brought on by inadequate data curation. A training data set that does not accurately reflect the volatility of production data leads to overfitting, which results in output that can only handle a fraction of the whole data stream (Council, 2019).

Estimates of impact size from large observational studies may be statistically significant but implausibly small compared to the actual effects. Specific scientific best practices, such as the generation of a hypothesis in a written protocol, detailed analytical plans noting specific methods and safeguards against bias, and transparent reporting with the justification of any changes to plans can help remove or at least reduce the obstacles to drawing valid inferences (Jiang, Gradus, & Rosellini, 2020).

Importance of purpose in supervised learning random sampling

The term "problem framing" refers to breaking down a complex issue into its constituent parts. The technical viability of a machine learning project may be established, and progress can be measured with the aid of well-defined objectives, both of which can be established through problem framing. Problem framing is crucial when thinking about an ML solution and can influence the success or failure of your product. One may define a problem in ML terms if it is established that ML is the appropriate approach and appropriate data are available. The following steps help define an issue in ML terms:

Specify the desired result and model's objective.

Find the result of the model.

Set benchmarks for achievement.

Sampling can be especially valuable with data sets that are too large to thoroughly examine properly, such as in big data analytics applications or surveys. Finding and examining a representative sample is more time and money efficient than polling the entire population or data set. Sampling has various advantages, including lower cost and faster processing than dealing with richer or more complete datasets. However, the size of the needed data sample and the potential for creating a sampling error are essential factors to consider. Sometimes, a small sample of data collection might disclose the most crucial details. In some instances, choosing a more significant sample might enhance the possibility that the data will be appropriately represented, even while the greater sample size may make manipulation and interpretation more difficult (Suresha, 2021).

Large datasets with significant correlations explain minimal variance.

A correlation coefficient measures the degree of relationship between two measurements. +1 (perfect positive correlation) to -1 (no correlation) (perfect negative correlation). Correlations are easy to compute but complex to evaluate since many factors may change their magnitude. Variance is the expected squared departure of a random variable from its population or sample mean. Variance measures a set's dispersion from its average value.

Statistically significant relationships are substantial enough to be unlikely in the sample if they do not exist in the population. Establishing cause-and-effect linkages from experimental data depends on whether a finding is unlikely to be random. Randomization makes the treatment groups similar at the start of an experiment is well-planned, except for who goes into which group.

Big data versus little data

Big data is a crucial area of research and fascination for the general population. Small data is still around, in any case. The same sociological and technical factors that have produced big data have produced many more tiny datasets. More data would initially appear to be unquestionably preferable to fewer data. This is accurate, everything else being equal. Acquiring more data will increase expenses and make the analysis more challenging. There are trade-offs between quality and quantity in the real world with fixed budgets. Small data can sometimes outperform big data in terms of speed, accuracy, and cost to draw the appropriate conclusions. Big data is the study of extensive, observational, and machine-analyzed data. Small data comes from experimental or purposefully gathered human-scale data, emphasizing understanding and causation rather than prediction. (Faraway & Augustin, 2017).

Conclusion

Small data frequently offers insight and clarity into a phenomenon that complicated analytical wizardry cannot always offer. Big data and predictive analytics frequently assist in performing the tasks you presently perform more quickly, effectively, or precisely. Small data can frequently inform one of whether their first actions are correct. Small Data provides solutions to issues fundamental to strategy clarity and superior execution (Silectis, n.d.). 

References

Council, G. (2019, April 15). The machine learning data dilemma. TWDI WebSite: https://tdwi.org/articles/2019/04/15/adv-all-machine-learning-data-dilemma.aspx

Faraway, J., & Augustin, N. (2017). When small data beats big data. University of Bath: https://people.bath.ac.uk/jjf23/papers/smallvbig.pdf

Jiang, T., Gradus, J. L., & Rosellini, A. J. (2020). Supervised machine learning: A brief primer. Behavioral Therapy, 51(5), 675-687. https://doi.org/10.1016/j.beth.2020.05.002

L'Heureux, A., Grolinger, K., Elyamany, H. F., & Capretz, M. A. (2017). Machine learning with big data: Challenges and approaches. IEEE Access, 5, 7776-7797. https://doi.org/10.1109/ACCESS.2017.2696365

Silectis. (n.d.). Why you should be focused on small data, not big data. Silectis: https://www.silect.is/blog/why-small-data-matters/

Suresha, H. P. (2021, January 15). Sampling: Statistical approach in machine learning. Analytics Vidhya: https://medium.com/analytics-vidhya/sampling-statistical-approach-in-machine-learning-4903c40ebf86




Comments

Popular posts from this blog

Innovative Discoveries: Serendipity, Errors, and Exaptation

Think Tanks and Futuring