Table of Contents
-
Introduction
-
Step 1. Ask
-
1.1 Ask: Identifying stakeholders and their expectations
-
1.2 Ask: Statement of business task
-
Step 2. Prepare
-
2.1 Prepare: Dataset introduction
-
2.2 Prepare: Does the data ROCCC?
-
2.3 Prepare: Dataset limitations & data selection
-
Step 3. Process
-
3.1 Process: Examining the datasets
-
3.2 Process: Examining the datasets
-
3.3. Process: Data cleaning and verification
-
3.4 The data cleaning and verification details
-
3.5 Data merging/splitting possibilities
-
Step 4. Analyze
-
4.1 Analyze: General statistics
-
4.2 Analyze: Identifying trends and relationships
-
Step 5 Share
-
Step 6 Act
-
6.1 Act: Insights
-
6.2 Act: Recommendations
-
Skills Used
-
Limitations and future scope
-
References
Introduction
Bellabeat is a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.
About Bellabeat
- Founders: Urška Sršen and Sando Mur
- Founded: 2013
- Core business: Manufacturer of health-focused smart products
- Mission statement: Empowering women to reconnect with themselves, unleash their inner strengths and be what they were meant to be
- Products:
- Bellabeat app
- Leaf (classic wellness tracker)
- Time (wellness watch)
- Spring (smart water bottle)
- Bellabeat membership (membership program for users)
- Growth: By 2016, Bellabeat had opened offices worldwide. Through extensive investments in advertising (traditional and digital, including Google Search and social media), Bellabeat products became available through an increasing number of online retailers.
In this case study, I'm going to define a business task and follow a data analysis approach in order to suggest new growth opportunities for the company.
Bellabeat founders believe that there are even more growth opportunities out there, and an analysis of available consumer data would reveal them. The marketing analytics team has to make it happen by following the steps of the Google data analysis process: ask, prepare, process, analyze, share, and act
The primary objectives of this analysis are:
- To identify trends in smart device usage, particularly physical activity and sleep behavior.
- To determine how these trends can be applied to Bellabeat’s customer base.
- To inform high-level marketing strategy by presenting data-driven recommendations.
The analysis follows the six phases of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Using tools such as R and ggplot2, the study explores relationships between :
- total steps
- sedentary behavior
- activity intensities
- sleep efficiency
You can find the R script here
This offers insights into how daily habits may affect wellness outcomes.
Ultimately, this case study not only supports Bellabeat’s mission to empower women through health-focused technology, but also serves as a portfolio-ready project that showcases essential skills in data wrangling, visualization, and strategic thinking.
The data used in this project is publicly available data from the FitBit Fitness Tracker dataset on Kaggle
Step 1. Ask
1.1 Ask: Identifying stakeholders and their expectations
In this step, I have to define the problem I are trying to solve; however, the first step would be to identify the stakeholders and make sure their expectations are understood. Based on provided information in the case study, the project’s stakeholders and their expectations are identified.
Stakeholders and their expectations
Stakeholder |
Stakeholder Type |
Stakeholder Group |
Position / Organizational Unit |
Expectations |
Urška Sršen |
Primary stakeholder |
Executive team |
Cofounder and Chief Creative Officer |
|
Sando Mur |
Primary stakeholder |
Executive team |
Cofounder and Mathematician |
|
Bellabeat marketing analytics team |
Secondary stakeholder |
Data analysis team |
Organizational unit |
|
Bellabeat founders believe that there are even more growth opportunities out there, and an analysis of available consumer data would reveal them. The marketing analytics team has to make it happen by following the steps of the Google data analysis process: ask, prepare, process, analyze, share, and act
1.2 Ask: Statement of business task
After identifying stakeholder expectations, it is time to state the business task clearly. Before starting the project, it is necessary to ask a couple of questions, create a summary of the key information and communicate it with the stakeholders.
Bellabeat case study information summary
- What is the problem?Bellabeat believes that there are more opportunities for growth. The marketing analytics team should investigate other non-Bellabeat smart devices, identify potential trends and consumer behaviors, and influence Bellabeat marketing strategy using the gained insight.
- Can it be solved with data?If so, what data? Yes. ‘FitBit Fitness Tracker Data’; it is a public domain data made by Mobius that explores smart device users’ daily habits.
- Where is this data?Does it exist, or do you need to collect it? Are you using private data that someone will need to give you access to, or publicly available data? It exists. It is a public domain data, and the executives encouraged marketing analytics team to use it.
- Who are the relevant sponsors and stakeholders for this project?Who is involved, and how? The executive team is the project sponsors; primary and secondary stakeholders are also identified in previous slide.
- What are the boundaries for your project?The analysis is limited to suggested dataset and proposing high-level recommendations about Bellabeat marketing strategy using gained insights.
Bellabeat statement of business task
Analyze non-Bellabeat smart devices data (in this case, FitBit), and identify customer behavior and potential trends (insights).
Apply the insights to Bellabeat products and propose high-level recommendations to guide the company’s marketing strategy.
At the end of the ‘Ask’ phase:
- ✔ Project key stakeholders and their expectations are identified
- ✔ Business tasks are set, and the data analysis team knows exactly what it is being asked to accomplish
Step 2. Prepare:
2.1 Prepare: Dataset introduction
The ‘Prepare’ phase ensures that you have all of the data you need for your analysis and that you have credible and useful data; an introduction to the dataset might be a good starting point.
FitBit’ dataset introduction
- The project founders encouraged to use ‘FitBit fitness tracker data’ made available by Mobius on Kaggle to address the business tasks.
- The dataset contains personal fitness data from thirty eligible FitBit users consented to the submission of personal tracker data, including physical activity, steps, heart rate, and sleep monitoring that can be used to explore users’ habits.
- According to Mobius on Kaggle, the dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-12- 03 to 2016-12-05
- Columns information available on Fitbit's data dictionary.
- ‘FitBit’ dataset has CC0 license which allows creators to give up their copyright and put their works into the worldwide public domain. CC0 allows users to distribute, remix, adapt, and build upon the material in any medium or format, with no conditions.
- The database contains 18 files in
.csv
format.
2.2 Prepare: Does the data ROCCC?
Fitbit’ dataset ROCCC analysis
Perspective |
Description |
Notes |
Status |
Reliable |
Accurate, complete, and unbiased data |
The data seems accurate as it is collected from fitness tracker smart devices. However, it contains only 30 respondents, so it is not complete and could be biased. |
Low |
Original |
Validating second or third-party data with original source |
The dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk, which is third-party data (even though reliable). |
Medium |
Comprehensive |
All critical information needed to answer the questions is available |
An acceptable range of wellness parameters, including physical activity intensity, steps taken, heart rate, sleep monitoring, calories used, weight record, etc., are recorded; however, in a small sample size. |
Medium |
Current |
Up-to-date and relevant data |
Provided data dates back to 2016. Therefore, it cannot be considered up-to-date, and the insights gained could be irrelevant now.. |
Low |
Cited |
Who created the dataset? Is this dataset from a credible organization? |
Data is collected using a survey via Amazon Mechanical Turk. Lack of further information somewhat blurs the credibility of the dataset. |
Medium |
What is the ROCCC approach? A good dataset should be Reliable, Original, Comprehensive, Current, and Cited. Actually, it's ROCCC.
2.3 Prepare: Dataset limitations & data selection
The two most important limitations of ‘FitBit’ dataset are the dataset small sample size (30 respondents), and the outdated data (dataset dates back to 2016).
FitBit’ dataset limitations
- The smallest sample size for which the Central Limit Theorem (CTL) is still valid would be 30.
- The sample size for some ‘FitBit’ datasets (e.g., sleep monitoring and weight record) is below 30 ‘FitBit’ dataset was generated using a survey in 2016 (9 years ago). Considering the potential changes in consumer habits, the insights gained through analysis could be misleading now (outdated data).
- There is missing data in some datasets (e.g., daily activity and sleep monitoring).
- There is no evidence to verify the validity of the survey, and consequently the integrity of the collected data.
Considering the data collection approach (survey) and the fact that I are unable to communicate with stakeholders, modifying the dataset would be impossible at the moment.
Project data selection
From the 18 provided datasets, 10 were selected 8 datasets were excluded, mainly because they had duplicate variables across different datasets or they did not match the project context (too many details, where high-level insights are required).
- ❌ Excluded Datasets:
dailyCalories_merged
(duplicate variables)dailyIntensities_merged
(duplicate variables)dailySteps_merged
(duplicate variables)heartrate_seconds_merged
(too many details)minuteCaloriesWide_merged
(too many details)minuteIntensitiesWide_merged
(too many details)minuteSleep_merged
(vague data)minuteStepsWide_merged
(too many details)
At the end of the ‘Prepare’ phase:
- ✔Required data is collected and stored
- ✔The credibility of the data is determined
- ✔The data limitations are recognized
Step 3. Process:
3.1 Process: Examining the datasets
Now that I decided to continue the project using the current datasets despite the limitations,in order to answer the business tasks, I will need to clean the data, so that the analysis will be error-free. The first step would be to import and examine the selected datasets.
Importing the project datasets RStudio was used to get an overview of the datasets. The following packages were used during analysis: tidyverse and lubridate.
library(tidyverse) # For data manipulation
library(skimr) # For data summary
library(janitor) # For cleaning column names
library(lubridate) # For date manipulation
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
The selected datasets were imported using R. Below, you can find the codes used to import the dataset.
dailyActivity <- read_csv("data/dailyActivity_merged.csv")
hourlyCalories <- read_csv("data/hourlyCalories_merged.csv")
hourlyIntensities <- read_csv("data/hourlyIntensities_merged.csv")
hourlySteps <- read_csv("data/hourlySteps_merged.csv")
minuteCalories <- read_csv("data/minuteCaloriesNarrow_merged.csv")
minuteIntensities <- read_csv("data/minuteIntensitiesNarrow_merged.csv")
minuteMETs <- read_csv("data/minuteMETsNarrow_merged.csv")
minuteSteps <- read_csv("data/minuteStepsNarrow_merged.csv")
sleepInfo <- read_csv("data/sleepDay_merged.csv")
weightInfo <- read_csv("data/weightLogInfo_merged.csv")
Next, it is necessary to examine the imported datasets and decide how to process them.
3.2 Process: Examining the datasets
A summary of the imported datasets is provided below:
- This dataset collection consists of ten files that track various aspects of users’ physical activity, sleep, and weight. The
dailyActivity_merged
file includes 15 variables such as steps, distance, activity minutes, and calories for 33 users across 940 daily records. Three hourly-level datasets that'shourlyCalories_merged
,hourlyIntensities_merged
, andhourlySteps_merged
containing calorie, intensity, and step data respectively, each with 22,099 records from 33 respondents. - Four datasets in a minute-level (long format) structure that is
minuteCaloriesNarrow_merged
,minuteIntensitiesNarrow_merged
,minuteMETsNarrow_merged
, andminuteStepsNarrow_merged
capturing fine-grained information on calories, intensity, METs, and steps. Each has 1,325,580 rows covering 33 users. - The
sleepDay_merged
file logs sleep details including total minutes asleep and time in bed, with 5 variables over 413 records from 24 users. Lastly, theweightLogInfo_merged
file contains 8 variables related to weight, BMI, and body fat across 67 entries from 8 users.
3.3. Process: Data cleaning and verification
Now that I learned about the datasets, it is time to focus on the data cleaning and verification processes. They ensure that the data is clean and structured into a suitable format that makes it easy to analyze; the following steps are taken to make it happen.
Data Cleaning Checklist
- Check preliminary requirements by reviewing the business objectives, backing up data beforehand, and fixing any errors at their source when possible.
- Remove unwanted data, such as duplicates, inaccuracies, or incorrect entries.
- Check data accuracy and correctness by validating value ranges (e.g., minimums, maximums, percentages), binary and numeric types, and correcting spelling, variable names, capitalization, punctuation, and extra spaces.
- Manage formatting by standardizing font, size, color, bolding, and italicizing where needed for clarity in reporting or analysis.
- Handle missing values through context-appropriate methods like imputation, removal, or retention.
- Check for merging or splitting opportunities to address inconsistent or misaligned data columns.
- Ensure data is correctly fielded, making sure, for example, that a country name isn’t placed in a city column.
- Verify data length for specific fields (e.g., confirming that year entries are exactly four digits).
3.4 The data cleaning and verification details
Based on the previous checklist, the data cleaning and verification were done step by step.
The following describes the details of each step.
Remove unwanted data:
Considering the business task, there was irrelevant data in some datasets; thus, the related columns were removed. Below, you can find the mentioned dataset and the removed variables (columns):
weightLogInfo_merged:
ThelogId
variable was removed, and theFat
variable was dropped due to insufficient data (only 2 non-empty records). The columnIsManualReport
was renamed toReport Type
, and its boolean values (True/False) were replaced with descriptive labels (Manual/Automatic).dailyActivity:
TheActivityDate
column was renamed toDate
, andTotalSteps
was renamed toSteps
. Three less relevant columns that isTrackerDistance
,LoggedActivitiesDistance
, andSedentaryActiveDistance
were removed.sleepInfo:
TheSleepDay
column was renamed toDate
. A new columnTimeAwake
was created by subtractingTotalMinutesAsleep
fromTotalTimeInBed
, and theTotalSleepRecords
column was removed.
Important: All MET values exported from Fitabase are multiplied by 10. Please divide by 10 to get accurate MET values.
Identifying duplicate records and removing them is the next step:
The following R code is used to examine how many duplicate rows are there in the sleepDay_merged
dataset, and the code was used similarly in all selected datasets; there were only 3 duplicates in the sleepDay_merged
dataset and no other duplicates were found.
# Check for duplicates
sum(duplicated(sleepInfo))
To remove the duplicates, the unique() function was used, and to verify the removal of the duplicates, the above code was used again; this time, there was no duplicate.
# verify the removal of the duplicates
sleepInfo <- unique(sleepInfo)
To identify NA values, the following code was used for each selected dataset(here, sleepDay_merged dataset); however, no NA value was found.
# To identify NA values
sum(is.na(sleepInfo))
The following describes the details of each steps:
Data accuracy and correctness:
- Min and Max of each variable were checked (e.g., active minute, active distance, and weight columns should not have any negative value, and the values should be in a reasonable range)
- Dates and times were also examined to be in the proper format. The data types were “character”; therefore, some modifications would be required (discussed in next slides).
- The data spelling and naming columns (e.g., using a single name for all date (ActivityDate), time (ActivityTime), and the total steps taken (TotalSteps) columns in different datasets), capitalization, and any potential extra space were checked and removed.
Data formatting: the datasets were checked to have a consistent format (e.g., font type, font color, font size, etc.)
Missing and misfielded values: there were missing values under some of the columns; in some cases, the variables were removed (above table); in other cases, I had to just ignore them, as I did not have access to the respondents to complete the missing values. No misfielded value was found
Length of data: The Id columns were checked in all datasets to have exactly ten digits and no error was found
Check for merging/splitting possibilities: Merging
According to the selected datasets characteristics table (slides 9 and 10), 3 datasets (hourlyCalories_merged
, hourlyIntensities_merged
and hourlySteps_merged
) have two common variables (Id
, ActivityHour
) and an equal number of rows. The 3 datasets merged into a single one.
library(dplyr)
hourlyParameters <- left_join(hourlyCalories, hourlyIntensities, by = c("id", "activity_hour"))
hourlyParameters <- left_join(hourlyParameters, hourlySteps, by = c("id", "activity_hour"))
Also, 3 datasets (dailyActivity_merged
, sleepDay_merged
, and weightLogInfo_merged
) contain costumers daily data. Therefore, it would be possible to merge them into a single one ( dailyParameters_merged
) using R merge() function .
dailyParameters <- merge(dailyActivity, sleepInfo, by = c("id", "date"), all.x = TRUE, all.y = FALSE)
dailyParameters <- merge(dailyParameters, weightInfo, by = c("id", "date"), all.x = TRUE, all.y = FALSE)
Similarly, 4 datasets (minuteCaloriesNarrow_merged
, minuteIntensitiesNarrow_merged
, minuteMETsNarrow_merged
, and minuteStepsNarrow_merged
) have two common variables (Id
, ActivityMinute
) and an equal number of rows. Therefore, to merge the 4 datasets into a single one (minuteParameters_merged
), the following R codes were used:
minuteParameters <- left_join(minuteCalories, minuteIntensities, by = c("id", "activity_minute"))
minuteParameters <- left_join(minuteParameters, minuteMETs, by = c("id", "activity_minute"))
minuteParameters <- left_join(minuteParameters, minuteSteps, by = c("id", "activity_minute"))
By merging the above datasets, the total number of selected datasets decreased from 10 to 3 following datasets:
dailyParameters_merged.csv
.hourlyParameters_merged.csv
.minuteParameters_merged.csv
.
3.5 Data merging/splitting possibilities
Check for merging/splitting possibilities: Splitting.
- All datasets had a column with date and/or time data. These columns were split into two date and time columns . For example, the following R codes were used for the
minuteParameters_merged.csv
dataset (which was created in the merging process in the previous slide) to split theactivity_minute
column (containing date and time data) into twoDate
andTime
columns. An extra column (Weekday
) is also inserted to convert the dates to weekdays: - In the
sleepDay_merged
dataset, the time for all records was 12:00:00 AM. Therefore, before merging it with thedailyActivity_merged
dataset (which was discussed in previous slide), the date-time format was changed to the date only using the following R codes: - Number of steps:
- The daily average of total steps taken by users is 7638. That is less than the 10,000 steps most references recommend.
- Activity duration and total steps taken by users have a negative relationship at the sedentary level. A positive correlation can be seen in the other three activity levels ("very active", “fairly active" and “lightly active").
- The total steps taken and the degree of intensity are highly correlated (p-value < 0.05, R2: 0.802).
- Users take more steps in the afternoons (12-2 PM) and evenings (5-7 PM) than at other times of the day.
- Calories:
- The average calories burned (2,304) seems reasonable, compared to the official guidelines (1,600 to 3,000). Most data points are also concentrated between the above range
- There is a positive significant correlation between the activity duration and the number of calories burned in “very active”, “fairly active”, and “lightly active” levels. This means that if you are more active, you can probably burn more calories for less activity duration.
- At the “sedentary” level, the relationship is negative.
- The calorie consumption increases as the number of steps taken increases.
- Activity levels:
- Users spent 81% of their time in sedentary mode (lack of physical activity), which is surprisingly high. They were active (very or fairly) only 3% of the time.
- Sedentary time seems to have a negative effect on sleep time. Therefore, more physical activity can probably improve users' sleep quality and time.
- Users spend a daily average of more than 7.5 hours in bed (7 hours of sleep and half an hour in bed).
- Monday has the highest sedentary time, and Thursday has the least. Saturday has the highest active time, and Sunday has the least.
- Weight:
- The average BMI is 25.19 which falls into the "overweight” category" (25.0 to 29.9).
- The average of metabolic equivalent of task (MET) is 14.69, which represents light activity (e.g., walking at a slow pace) for a short duration (less than 10 minutes).
- It appears to be little or no correlation between the user’s “activity time” and “weight”; only at the ‘Fairly active’ and “Lightly active” levels, the correlation seems to be tatistically significant (p-value < 0.05).
- The following steps are a general process, the Bellabeat is recommended to follow for each insight discussed earlierin 6.1:
- Share each insight with current and potential customers using smart marketing methods (digital marketing, marketing campaigns, classic ads, etc.)
- Highlight the potential risks, including exposure to various diseases (cancer, diabetes, etc.), weight gain, etc., that an unhealthy lifestyle could cause, preferably with statistics.
- Improve the Bellabeat app and include a user-friendly notification system to provide users with customized alerts. The objective is to inform users about their health habits, and encourage them to improve them.
- Promote the benefits of using Bellabeat products and the app to avoid negative health habits.
- Bellabeat can focus on the “activity level”, “activity duration”, and “total steps” of the users, and improve related products and functions of the app. It should be highlighted that users spent more than 80% of their time in sedentary mode, while they were active (very or fairly) only 3% of the time. They took less than the 10,000 steps (7638 steps) that most guidelines recommend, and therefore, had less than average activity intensity (high positive correlation between “total steps” and “intensity”)
- The company can organize social challenges and special events, targeting days when sedentary time may be at its peak (Mondays), active time seems to be at its lowest (Sundays), or hours of a day when the number of steps taken is less, using the Bellabeat app. It can also offer incentives such as some company products, special discounts, free membership, etc. to the winners
- Bellabeat should explain the importance of physical activity on sleep quality and sleep duration. Bellabeat needs to improve sleep tracking function in both its products (e.g., required hardware) and the app. It should emphasize the fact that the usage of Bellabeat’s smart devices could help customers increase their activity levels and monitor their sleep habits
- The average calorie intake of users appears to be in the acceptable range (2,304 calories per day, which is within the recommended range of 1,600 to 3,000). However, the average BMI (25.19) is in the "overweight" category, and the average MET (14.69) indicates very light activity for short periods of time. Therefore, it may be a good idea to encourage users to actively manage their weight/BMI (weight management practices)
- The direct positive correlation between “activity duration” and “calorie consumption” means that if you are more active, you can burn more calories for less activity duration. Bellabeat smart devices and the app can help customers monitor their activity level and try to improve it
- Further analysis needs to be done to get more accurate and reliable results.It is mainly because there were some data limitations (i.e. the small sample size for some ‘FitBit’ datasets, outdated datasets and potential changes in consumer habits, missing or manual data input in some cases, etc.), and I were unable to modify it properly due to the data collection approach (survey in 2016)
- Data Cleaning: Preparing raw data for analysis by handling missing values, outliers, and inconsistencies.
- Data Visualization:Creating informative and aesthetically pleasing visualizations using ggplot2
- Statistical Analysis:Performing statistical tests and modeling to draw insights from data.
- ggplot2:Utilizing the ggplot2 package for advanced data visualization
- dplyr:Using dplyr for efficient data manipulation and transformation.
- Tidying Data:Structuring data in a tidy format for easier analysis and visualization.
- R Markdown:Documenting and presenting data analysis using R Markdown for reproducible research.
- Exploratory Data Analysis (EDA):Analyzing data to summarize its main characteristics, often with visual methods.
- Data Importing and Exporting:Reading data from various sources and writing results to files.
- Date Functions:Handling and manipulating date and time data.
- Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686.
- Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25, https://www.jstatsoft.org/v40/i03/.
- Kirill Müller (2021). hms: Pretty Time of Day. R package version 1.1.1, https://CRAN.R-project.org/package=hms.
- Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1, https://CRAN.R-project.org/package=scales.
- Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0, https://CRAN.R-project.org/package=janitor.
- Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2021). skimr: Compact and Flexible Summaries of Data. R package version 2.1.3, https://CRAN.R-project.org/package=skimr.
- Joshua Kunst (2022). highcharter: A Wrapper for the ‘Highcharts’ Library. R package version 0.9.4, https://CRAN.R-project.org/package=highcharter.
- Brinton, Julia E, et al. “Establishing Linkages between Distributed Survey Responses and Consumer Wearable Device Datasets: A Pilot Protocol.” JMIR Research Protocols, vol. 6, no. 4, 2017, https://doi.org/10.2196/resprot.6513.
- CDC. "Assessing Your Weight" CDC, September 17, 2020, https://www.cdc.gov/healthyweight/assessing/index.html.
- CDC. "How much physical activity do adults need?" CDC, October 7, 2020, https://www.cdc.gov/physicalactivity/basics/adults/index.htm.
- CDC. "How Much Sleep Do I Need?" CDC, March 2, 2017, https://www.cdc.gov/sleep/about/?CDC_AAref_Val=https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html
- Fitbit. "What should I know about Fitbit sleep stages?" Fitbit, 2022, https://help.fitbit.com/articles/enUS/Helparticle/2163.htm.
- Möbius. “Fitbit Fitness Tracker Data.” Kaggle, 16 Dec. 2020, https://www.kaggle.com/arashnic/fitbit.
- Asimakopoulos, Stavros, Grigorios Asimakopoulos, and Frank Spillers. "Motivation and user engagement in fitness tracking: Heuristics for mobile healthcare wearables." In Informatics, vol. 4, no. 1, p. 5. MDPI, 2017, https://www.mdpi.com/2227-9709/4/1/5.
minuteParameters <- minuteParameters %>%
mutate(activity_minute = parse_date_time(activity_minute, "%m/%d/%Y %I:%M:%S %p"),
Date = as.Date(activity_minute),
Time = format(activity_minute, "%H:%M:%S"),
Weekday = weekdays(Date))
sleepInfo <- sleepInfo %>%
rename(Date=SleepDay) %>%
mutate(Date=as.Date(Date,format="%m/%d/%Y")) sleepInfo$Weekday <- weekdays(sleepInfo$Date)
Step 4. Analyze:
4.1 Analyze: General statistics
Putting data to work in the “Analyze” step. It would be necessary to perform calculations, identify trends and relationships, uncover new insights and discover potential solutions to the business tasks.
Initial analysis of the datasets: Codes
First, general statistics were extracted, including the average, standard deviation, minimum, maximum, and percentiles of some of the selected variables in each dataset.
I started with dailyParameters_merged
dataset. similar process was repeated for the hourlyParameters_merged_splitted
and the minuteParameters_merged_splitted
datasets.
dailyParameters %>%
select(steps,total_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, calories, total_minutes_asleep, total_time_in_bed, weight_kg, bmi) %>%
summary
Initial analysis of the datasets: Daily parameters
The initial analysis of the selected variables of the dailyParameters_merged dataset are as follows:

The analysis highlights:
The daily average of the total steps is 7,638; it is less than 10,000 steps that most fitness tracking devices recommend I take daily. However, according to US Department of Health and Human Services guidelines, a range between 7,000 to 13,000 steps a day will help to get the full benefits that exercise, including protecting against diseases like cancer and diabetes and helping with weight loss
Users travel approximately an average distance of 5.5 kilometers (5.49 kms) a day
The maximum daily steps taken is 36,019 which is nearly equal to 28 kilometers daily total distance
The population spend nearly 16.5 hours in sedentary time, which is surprisingly high; according to British Heart Foundation, adults of working age in England average about 9.5 hours per day of sedentary time. Users also had an average of 3.2 hours of light activity, and only half an hour of active time (fairly and very active time)
The average calories burned (2,304) seems reasonable. According to the Dietary Guidelines for Americans 2020–2025, the average adult woman expends roughly 1,600 to 2,400 calories per day, while the average adult man expends 2,000 to 3,000
Data shows that while people spend an average of more than 7.5 hours in bed, they sleep about 7 hours a day. That means they are awake in bed for half an hour on average. It also seems normal, as sleep foundation suggests 7-9 hours of daily sleep for adults with 26-64 years old
The average BMI (a person’s weight divided by the square of height) is 25.19. According to Center for Disease Control and prevention, BMI between 25.0 to 29.9 falls within the overweight range
Initial analysis of the datasets: Hourly and minute parameters
The initial analysis of the selected variables of the hourlyParameters_merged_splitted
dataset are as follows (RStudio output):

The data for Total intensity
and Average intensity
is not clear, mainly because the unit for the mentioned variables is not stated. However, these variables will be used in our next analysis.
Users took about 320 steps and burned an average of 97 calories per hour, which seems to be in line with the average daily total steps (7,638) and average daily calories burned (2,304).
The initial analysis of the selected variables of the minuteParameters_merged_splitted
dataset are as follows (RStudio output)

The metabolic equivalent of task (MET) is the ratio of the work metabolic rate to the resting metabolic rate.For example, one MET is the rate of energy expenditure while at rest. A four MET activity expends four times the energy used by the body at rest. If a person does a 4 MET activity for 30 minutes, he or she has done 4 x 30 = 120 MET-minutes (or 2.0 MET-hours) of physical activity.
The data shows that the average MET-minutes is 1.469. Considering the definition of MET, this could correspond to a light activity with a MET value of 2 (such as walking slowly at ~3 km/h) performed for about 0.73 minutes (since 2 × 0.73 ≈ 1.469), or other combinations of very brief light activities. This aligns with generally low activity levels in the dataset..
The average total steps taken by users per hour is close to 5.3 steps, and the number of calories burned is 1.6
4.2 Analyze: Identifying trends and relationships
1. Activity time vs. Calories
Relationships between the Activity level
(very active, fairly active, lightly active, and sedentary), and the Activity duration
was examined. For Sedentary activity level, the negative slope of the regression line shows that as I spend more sedentary time, the number of calories consumed decreases
Sedentary time (minutes) vs. number of calories consumed
dailyParameters %>%
ggplot() +
geom_point(aes(x = sedentary_minutes, y = calories), color = "darksalmon") +
geom_smooth(aes(x = sedentary_minutes, y = calories), method = lm, se = FALSE, color = "cornsilk4") +
labs(
title = "Calorie vs. Sedentary Time",
subtitle = "The relationship between the sedentary time of the users and the number of calories burned",
caption = "Data collected from 33 users",
x = "X: Sedentary time (minutes)",
y = "Y: Number of calories burned"
) +
annotate("text", x = 50, y = 2750, label = "Regression line", color = "cornsilk4", size = 3, fontface = "bold") +
theme_minimal()


The p-value is equal to 0.001; a typical threshold is 0.05, and anything smaller is considered statistically significant.
2.Activity time vs. Calories
Very, fairly, and lightly active time (minutes) vs. number of calories consumed Sample codes for “Very active” level
#veryactiveminutes
dailyParameters %>%
ggplot() +
geom_point(aes(x = very_active_minutes, y = calories), color = "darkolivegreen3") +
geom_smooth(aes(x = very_active_minutes, y = calories), method = lm, se = FALSE, color = "cornsilk4", size = 0.5) +
labs(
title = "Calorie vs. very_active_minutes",
subtitle = "Relationship btw the very_active_minutes of the users and the number of calories burned",
caption = "Data collected from 33 users",
x = "X: very_active_minutes (minutes)",
y = "Y: Number of calories burned"
) +
annotate("text", x = 50, y = 2750, label = "Regression line", color = "cornsilk4", size = 3, fontface = "bold") +
theme_minimal()



If you are more active (“very” or “fairly”), you may burn more calories for less activity time.
As mentioned earlier, the average adult woman expends roughly 1,600 to 2,400 calories and the average adult man expends 2,000 to 3,000. The diagrams confirm that the majority of data points are concentrated between the above range (1600-3000 calories/day).
3.Sedentary time vs. Sleep
Data shows that sedentary time has an inverse effect on sleep time. Therefore, it seems that more physical activity can improve users' sleep time
dailyParameters %>%
drop_na() %>%
ggplot(aes(x = sedentary_minutes, y = total_minutes_asleep)) +
geom_point(color = "darksalmon") +
geom_smooth(method = "lm", se = FALSE, color = "cornsilk4",size=0.5) +
labs(
title = "Sedentary time vs. Total sleep time",
subtitle = "The relationship between users' sedentary time and their total sleep time",
caption = "Data collected from 33 users",
x = "X: Users' sedentary time (minutes)",
y = "Y: Users' total sleep time (minutes)"
) +
theme_minimal()

The diagram shows that as users’ sedentary time increases, total sleep time decreases (negative slope).
In other word, lack of physical activity seems to have a negative effect on users’ sleep time.
4. Activity time vs. BMI
Activity duration (minutes) of the users at different levels vs. BMI
A linear regression model describes the relationship between a dependent variable (y), and one or more independent variables (x).
To examine the relationship between users' different activity levels and their weight, BMI was used as a better indicator mainly because it considers the users' height.
Here, p-value and R-squared (R²) were considered in each activity level to determine any potential relationship.
P-value is used in hypothesis testing to help decide whether to reject the null hypothesis (primarily, the one that indicates there is no relationship between the variables). The smaller the p-value (a typical threshold is 0.05; anything smaller is considered statistically significant), the more likely to reject the null hypothesis.
R-squared is a goodness-of-fit measure for linear regression, and always is between 0–1. Usually, the larger the R², the better the regression model fits your observations.
Sample codes for “Very active” level
#veryActiveBMI_correlation
veryActive_BMI_correlation <- lm(very_active_minutes ~ bmi, data = dailyParameters)
summary(veryActive_BMI_correlation)

The relationship can be considered statistically significant at the ‘Fairly active’ and “Lightly active” levels; at other activity levels, no correlation is seen.
5.Share of daily activity levels
While users spend most of their time in sedentary state, the duration of moderate-to-vigorous physical activity is largely meets World Health Organization guidelines
Moderate-intensity activity: WHO recommends 150–300 minutes per week, or 21–42 minutes per day. Among the users, 82% (27 out of 33) met this guideline.
Vigorous-intensity activity: WHO recommends 75–100 minutes per week, or 11–21 minutes per day. Compliance was slightly higher, with 88% (29 out of 33 users) meeting the recommendation.
Over 80% of users had enough daily moderate, and vigorous-intensity physical activity.
The users spent the majority of their time in sedentary mode (81% of the time), while they were active (very or fairly) only 3% of the time
6.Total steps vs. Calories
There is a positive relationship between the total steps taken and the number of calories burned (p-value < 0.05). In other words, as the number of steps taken increases, the calorie consumption increases.
Total steps taken by the users vs. number of calories consumed.


The bar chart shows that there is a positive significant correlation between the user's total steps and the number of calories burned (regression line). The histogram on the right shows the distribution of total daily steps taken. The more daily steps you take, the more calories you burn
7.Activity time vs. Total steps
As the plots suggest, the longer sedentary time, the fewer steps users take; all other relationships are positive. In addition, it is observed that as the activity level increases, less time is required to take a certain number of steps..
Activity duration (minutes) at different levels vs. total number of steps taken by the users.




8.Analyze: Identifying trends and relationships – Activity time by the weekdays
Monday has the highest sedentary time, and Thursday has the least. On the other hand, Saturday has the highest active time, and Sunday has the least

The average sedentary time and the average active time (the sum of `Very active’, ‘Fairly active, and ‘Lightly active’ time averages) by the weekdays were calculated and plotted using Microsoft Excel pivot table and charts
The above charts show that sedentary time is among the lowest on Saturday (left chart), while active time is at the peak (right chart)
9.Number of steps and degree of intensity per hour
Understanding user behavior at different hours of the day could be useful. Therefore, the number of steps taken and the intensity of their activity per hour have been analyzed and plotted
Total number of steps and total intensity by hours
#Total number of steps and total intensity by hours
hourlyParameters <- hourlyParameters %>%
mutate(hour = hour(mdy_hms(activity_hour)))
# Then group by hour instead of raw timestamp
hourlyStepsSummary <- hourlyParameters %>%
group_by(hour) %>%
summarise(avgTotalSteps = mean(step_total, na.rm = TRUE))
hourlyIntensitySummary <- hourlyParameters %>%
group_by(hour) %>%
summarise(avgTotalIntensity = mean(total_intensity, na.rm = TRUE))
# Scale factor for secondary axis
scale_factor <- 0.05 * max(hourlyStepsSummary$avgTotalSteps, na.rm = TRUE)
# Plot
ggplot() +
geom_col(data = hourlyStepsSummary, aes(x = factor(hour), y = avgTotalSteps), fill = "darksalmon") +
geom_line(data = hourlyIntensitySummary,
aes(x = factor(hour), y = avgTotalIntensity * scale_factor, group = 1),
color = "dimgrey") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, face = "bold")) +
labs(
title = "Total Steps and Intensity per Hour",
subtitle = "The total number of steps taken and the degree of intensity per hour",
caption = "Data collected from 33 users",
x = "Hour of Day",
y = "Total Steps (bars) / Degree of Intensity (line)"
) +
scale_y_continuous(
sec.axis = sec_axis(~ . / scale_factor, name = "Intensity")
)
As it may seem, the total steps taken and the degree of intensity follow a relatively similar pattern. Also, data shows that most steps are taken by users in the afternoon and evening.

TotalSteps~TotalIntensity: p-value < 0.05, R2: 0.8028.
The histogram shows that as the day begins, the number of steps taken by users increases gradually until 5 PM, and then decreases until the end of the day(bell-shaped distribution); interestingly, the degree of intensity follows the same pattern.
Two significant intervals are 12-2 PM and 5-7 PM (probably after working hours), where the number of steps (and of course, the degree of intensity) is higher than the other hours of the day.
10.Relationship between total steps taken and sedentary minutes
Monday has the highest sedentary time, and Thursday has the least. On the other hand, Saturday has the highest active time, and Sunday has the least

Inverse Relationship:
More active users (those with higher step counts) are less sedentary — a logically expected pattern.
This supports the idea that increasing daily steps can help reduce sedentary behavior.
Two Activity Clusters:
There appear to be two main bands of sedentary behavior (one higher, one lower) even for the same range of steps, suggesting:
Some users may be getting in steps via short bursts (e.g., workouts) but still sitting for long durations and Others may be spreading out their activity better through the day.
Outliers:
A few users have very high steps (20k–30k+) and still high sedentary minutes, which may reflect unusual routines (e.g., workers with a long commute but active jobs).
11.Relationship between time in bed and minutes asleep

observations:
Strong linear correlation: Most data points lie close to the line, indicating that as minutes asleep increase, time in bed increases almost proportionally.This suggests a consistent pattern of sleep behavior across most users.
Efficiency observed: The slope near the main cluster implies high sleep efficiency (users are asleep for most of their time in bed).The closeness of many points to the diagonal suggests users spend only a small portion of time in bed awake.
Step 5 Share
Step 6 Act
6.1 Act: Insights
After data analysis, creating meaningful visualizations, and sharing them with the stakeholders, I have to act on the findings. The deliverables should be prepared, and the business tasks should be answered, including high-level recommendations based on insights gained.
6.2 Act: Recommendations
Based on insights gained from data analysis, it would be possible to apply the insights to Bellabeat products and propose high-level recommendations to guide the company’s marketing strategy (the business task).