Introduction
Hi, I'm Kensuke from the Research and Development team in ZEALS. My position in R&D is data analytics, ranging from BI consolidation to hypothesis testing for enhancing the chatbot experience. As a data analyst with three years of experience, I wish to cover the nature of data in this article. "Don't get fooled by data".
Background
ZEALS end-users are users that transitioned from the website to the chatbot that we provide; thus, data mainly consists of users' actions during their chatbot experiences. As I conduct the analysis on users’ actions every day, I often step back to reconsider the data's reliability — What to trust, and what to doubt.
Cases
In this article, I’d like to highlight two cases that show how data can fool us. Here we go...
Case 1: Don't Trust a Questionnaire Survey
There is a famous story from everyone's favorite McDonald’s. The company’s Product Planning Department surveyed randomly selected customers to observe the popularity of "Chicken Tatsuta Burger," which was released in McDonald’s Japan. From the results of the questionnaire, one can see a large number of people mentioned that it was their favorite burger. However, the actual sales for Chicken Tatsuta were stagnant after its release.
In the statistics world, a questionnaire survey is notorious for its difficulty to acquire consumer behavioral data. This is because there is an ambiguity in the wording of questions and choices among questioners and respondents, respectively.
So, how do we confront this difficulty? - It is necessary to understand what consumer behavior is and pursue the recourse that reflects their behavior well. A "receipt" is a good example of this, as it shows what the customer wanted—a mirror of their mind, if you will.
As proof, ONE, an app where people can sell any receipt to enterprises, is one of the most successful businesses in Japan, ONE is an app that makes daily shopping fun and profitable, and usually provides users with a 1 to 10 JPY cashback just by taking a picture of the receipt purchased at a supermarket or convenience store. Enterprises, such as JCB, then purchase those receipts that were being sold by the consumers to use for market research. The ONE business is successful because the supply and demand match between consumers and enterprises; enterprises are craving for consumer behavioral data.
Case 2: The Truth of the "Raw Data"
There are several definitions for "raw data," such as the three examples below:
#####① Techtarget.com
Raw data (sometimes called source data, atomic data, or primary data) is data that has not been processed for use. A distinction is sometimes made between data and information to the effect that information is the end product of data processing. Raw data that has undergone processing is sometimes referred to as cooked data.
#####② Statista.com
Raw data or primary data are collected directly related to their object of study (statistical units). When people are the subject of an investigation, we may choose the form of a survey, an observation, or an experiment.
#####③ Definitions.net
Raw data is a term for data collected from a source. Raw data has not been subjected to processing or any other manipulation and is also referred to as primary data. Raw data is a relative term.
To get straight to the point, I believe there is no such thing as raw data. This is because data will always be collected with cognitive, cultural, and institutional processes.
For example, it is now accepted that social science data is prone to human subjectivity. In scientific research, it is very important to know what to measure and how to measure it. However, in the field of social science, there is no clear unit of measurement. This is where the interpretation and bias of the person making the measurement come into play. If you are collecting data by visiting many homes and interviewing people, who are the people collecting this data? Do they have the skills to ask the right questions? In addition, the answers given by the respondents reflect their personal intentions. For instance, if you are conducting a survey on income and the respondents assume that the results will be used for tax-related policies, households may not answer questions about the income they’re trying to evade taxes for. From these points of view, it is not an exaggeration to say that all data is cooked, processed with the cognitive, cultural, and institutional backgrounds that data collectors and respondents have.
To cope with this issue, first, you have to understand that all data is cooked and biased in multiple vectors. Secondly, statistically speaking, you should not look at a single value, but a range of values. Don't just look for a number, look for its confidence interval. This will prevent you from getting deceived by the specific values or outliers. Lastly, collect more data before drawing conclusions. You should compare the results of multiple analyses and ask how the data was collected.
Closing
In the era where data can be collected in a blink, the economic disparity between companies with high data literacy and those with low data literacy is getting bigger and bigger. One problem here is that if you have low data literacy, you are unlikely to realize that you are using the data incorrectly, or even that you are being deceived by others. Therefore, enhancing your data literacy is crucial in this era to not "get fooled by data."