Futurists: Mainak Mazumdar’s Thoughts on Data for Innovation

By Nathan B. Smith

As a doctoral computer science student with a specific interest in big data analytics, I am becoming aware of the problem of bad data. Shortly before the world was thrust into the current pandemic, I was fortunate enough to visit Denver for the CTU Doctoral Symposium and talk with many members of the faculty and many of my fellow students about current issues in data science and some of the problems that our community is facing. One of the exercises we participated in was to pair up with a fellow student and do an impromptu talk (not quite TED caliber, but a valuable experience). Over three days, I got to know Chris Tzvetcoff, a global wells engineering and operations lead working for BP in Anchorage, Alaska (and a fellow doctoral student). As it turns out, the oil and natural gas exploration industry heavily depends on big data. During this 5-minute impromptu exercise, Chris explained the problem of bad (or dirty) data, which plagues the oil and gas exploration industry. I reciprocated and with a similar problem in the commercial aviation maintenance industry.

Artificial intelligence (AI) and machine learning (ML) are tools that help us to analyze, learn, and discern what we would struggle to understand with traditional DBMS tactics from the past. AI and ML are inextricably dependent on data, and lots of it. We need good, clean data to effectively harness the power of AI and ML in our innovative projects. 

After perusing the TED Talks website, I discovered a talk by Mainak Mazumdar, Nielson's Chief Data and Research Officer. The Nielsen ratings reveal which audiences saw certain shows and commercials. We calculate exposure using a variety of measures, including reach, frequency, averages, and ubiquitous ratings, which represent the fraction of a particular population who saw a given piece of content or advertisement. As you might imagine, Nielson is all about big data.

Discussion

According to Mazumdar (2021), in 10 years, AI might add $16 trillion to the global economy. This economy will be constructed by computers and algorithms, not humans or industries. AI has simplified processes, increased efficiency, and improved our lives. AI has not delivered on its promise of impartial policy decision-making. AI decides who gets a job and a loan in the economy. AI reinforces and accelerates our prejudice with societal ramifications. Mazumdar wonders if AI is failing society. Or are researchers constructing biased and erroneous algorithms? Potentially, skewed data, not the algorithm, drives these conclusions. 

Mazumdar suggests that society must reset AI for humanity and civilization. Data should replace algorithms. We are spending time and money to scale AI over creating and gathering contextual data. We must cease using skewed data and work on data infrastructure, data quality, and data literacy.

Mazumdar highlights a glaring bias in Duke University's AI model PULSE, which improved hazy images into recognized photos of people. This method made nonwhite individuals look Caucasian. Underrepresentation of African-American photos in the training set led to incorrect predictions. We have probably seen AI misidentify Black people before. Underrepresentation of racial and ethnic groups led to biased AI outcomes despite improvements. Nevertheless, not all data biases are scholarly. Biases hurt.

Mazumdar then considers Census 2020. The census provides the basis for many social and economic policy choices; thus, it must count 100% of the U.S. population. With the epidemic and citizenship politics, minorities may be undercounted. Mazumdar predicts undercounting hard-to-find, -contact, -persuade, -interview minority groups. Undercounting introduces bias and degrades data quality.

Consider undercounts in the 2010 census. Final counts missed 16 million individuals. This is the same as Arizona, Arkansas, Oklahoma, and Iowa's population in that year. A million children under five were undercounted in the 2010 Census.

Minorities are sometimes undercounted in national censuses because they are hard to contact, mistrust the government, or dwell in a politically unstable area.

The 2016 Australian Census undercounted Indigenous populations by 17.5%. We expect 2020 undercounting to be substantially larger than 2010, with significant repercussions.
Consider the census's ramifications. Census is the most reliable, accessible, and publicly available demographic data. Businesses have private information on customers, but the Census Bureau produces official, public counts on age, gender, ethnicity, race, employment, family status, and geographic distribution. When minorities are undercounted, AI models for public transit, housing, healthcare, and insurance may ignore populations who need them most.
To improve results, make the database representative of age, gender, ethnicity, and race per census statistics. Nielson must count 100% since the census is exceptionally crucial. Data quality and accuracy are vital to making AI possible for everyone, not just the wealthy.

Most AI systems employ already available or cheaply collected data. Data quality needs genuine dedication. Defining, collecting, and measuring bias is often overlooked in a speed, scale, and convenience world.

Mazumdar and his colleagues visited retail outlets outside Shanghai and Bangalore to collect data for Nielsen. This visit measured shop sales. They traveled miles outside the city to find informal, hard-to-reach businesses. Why these stores specifically? We might have chosen a store in the metropolis where electronic data could be readily incorporated into a data pipeline. Why are we so concerned about these repositories' data quality and accuracy? Because their data matters. ILO: 40% of Chinese and 65% of Indians live in rural regions. Imagine the choice bias when 65% of India's consumption is removed from models, favoring urban over rural.
Retail brands will make poor pricing, advertising, and marketing decisions without a rural-urban context and signal lifestyle, economy, and values. Alternatively, urban bias will lead to bad rural health and investment decisions. AI does not make bad judgments. It is a data issue that eliminates measurable areas. Not algorithms, but context data is the priority.

Mazumdar paints another example of this problem. He visited Oregon trailer park houses and New York City apartments to invite residents to Nielsen panels. Panels are a statistically representative sampling of houses we measure over time. Nielson collected data from Hispanic and African houses using over-the-air TV antennas to include everyone in the measurement. According to Nielsen, 15% of US families live in these dwellings or 45 million people. Every attempt was made to obtain data from the 15% of hard-to-reach groups.

This large demographic is crucial for marketers, brands, and media firms. Without data, marketers, businesses, and their models could not reach and advertise to these vital minority communities. Without ad money, Telemundo or Univision could not provide free programming, especially news media, which is vital to our democracy.

Artificial intelligence (AI) covers a wide variety of issues, and there is a wide variety of opinions on AI; therefore, there is a need for clarification on the core principles of the discipline, the potential AI brings, and the obstacles AI poses. For this reason, Paschen et al. (2020) offer a synopsis of the six pillars of artificial intelligence: structured data, unstructured data, preprocesses, significant processes, a knowledge base, and value-added information outputs in the context of business innovation. The authors then use this information to create a typology that managers can use as a diagnostic tool as they try to understand how AI will affect their particular field. 

The typology takes into account the consequences of AI-enabled innovations along two dimensions: the bounds of the innovations and the effects on the competencies of the implementing organization. The product-facing and process-facing innovations are separated in the first dimension of the typology. The second axis of the typology classifies innovations as either competence-enhancing (those that improve existing knowledge and abilities) or competence-destroying (those that make them obsolete). This framework provides helpful context and structure for managers to make crucial strategic decisions by evaluating their markets, opportunities, and risks coming from those markets.

Conclusion

Artificial intelligence (AI) is our most revolutionary technological advance. AI businesses are using applications, and different forms of data (structured, unstructured, and semi-structured) are being leveraged to create unified procedures in every sector.

Although businesses recognize AI's significance and potential influence, many have trouble transitioning from the pilot stage to total production. Costs (i.e., hardware accelerators and compute resources), a lack of skilled personnel, a lack of machine learning operations tools and technologies, a lack of adequate volume and quality of data, and trust and governance issues are the top challenges that organizations must address in order to scale AI initiatives.

Businesses and society need this data. Data is key to reducing human bias in AI. Instead of building new algorithms, Nielson’s Mainak Mazumdar is attempting to build an ethical AI data infrastructure.
References

Mazumdar, M. (2021). How bad data keeps us from good AI. TED Talks Website: https://www.ted.com/talks/mainak_mazumdar_how_bad_data_keeps_us_from_good_ai

Paschen, U., Pitt, C., & Kietzmann, J. (2020). Artificial intelligence: Building blocks and an innovation typology. Business Horizons, 63(2), 147-155. https://doi.org/10.1016/j.bushor.2019.10.004




Comments

Popular posts from this blog

Innovative Discoveries: Serendipity, Errors, and Exaptation

Think Tanks and Futuring