For this project, IĀ  will be working on understanding the behaviors and characteristics of people who use a digital application. The product offers recommendations on nearby attractions, restaurants, and businesses based on the userā€™s location. This includes a free version for any user along with a subscription model that provides more customized recommendations for users who pay for the service.

CSV file Data: digital application user data

With free installation on a mobile device, digital applications have a low barrier to entry. They also experience high rates of attrition, as users may not continue to log in. With this in mind, the company is interested in better understanding the early experience of users with the application. A time point of 30 days was selected as an important milestone. Which factors might impact whether new users remain active beyond 30 days? Who is likely to subscribe within 30 days? The company would benefit from analyzing the available data to understand the current trends.


A simple random sample of users was taken by gathering information in the companyā€™s database. The sample was limited only to users who first installed the application in the last 6 months, when a new version of the application was released. The sample was further limited to users who signed up and had enough time for the company to measure its key milestones. To ensure reasonable comparisons, the data were limited to users in Australia, Canada, United Kingdom, and the United States, which were deemed appropriately similar in terms of their linguistic and economic characteristics.

For each user, basic information (age group, gender, and country) was collected from the userā€™s profile. Then the following characteristics were measured:

  • daily_sessions: This is the average number of sessions per day in the first 30 days for each user. One session consists of a period of active use as measured by the companyā€™s database. Then the daily sessions for a user is the total number of sessions for the period divided by 30.
  • subscribed_30: This measure (TRUE/FALSE) indicates whether the user paid for any subscription service within 30 days.
  • active_30: This measures (TRUE/FALSE) whether the userĀ remained active at 30 days. The company decided to measure this by identifying whether the user had at least one active session in the 7-day period after the first 30 days.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd('c:/Users/janetlin/Downloads/')
userdata = read.csv('digital application user data.csv')
female = filter(userdata,female==TRUE);
nonfemale = filter(userdata,female==FALSE);

We are interested in the question of whether female users have higher rates of daily sessions than other users do. The parameter to select as the metric for each group is the mean. The data set needs to be divided into two parts, female=true and female=false.

Using the data, I estimated the values of the selected parameter for female users and for other users.

library(dplyr)
userdata%>%
  group_by(female)%>%
  summarise(count = n(),mean = mean(daily_sessions, na.rm = TRUE),sd = sd(daily_sessions, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   female count  mean    sd
##   <lgl>  <int> <dbl> <dbl>
## 1 FALSE   2137  1.43 0.883
## 2 TRUE    2863  1.47 0.861
# Mean daily session for female users is 1.47 and Mean daily session for other users is 1.43.

I don’t think there appears to be an observed difference between the groups since both the mean and standard deviation are almost the same for the groups. Without performing statistical tests, I would not consider this difference to be meaningful for the business. A two-sample t-test would be the best statistical test for testing the two groups for differences in their daily sessions. 5000 samples are included in my selected statistical test, which is ultimately divided into two groups: females with 2863 samples and non-females with 2137 samples. One tail is considered in my selected statistical test.

I then performed my statistical test.

t.test(female$daily_sessions, nonfemale$daily_sessions,alternative = "greater")
##  Welch Two Sample t-test
## 
## data:  female$daily_sessions and nonfemale$daily_sessions
## t = 1.7451, df = 4539.1, p-value = 0.04052
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.002494299         Inf
## sample estimates:
## mean of x mean of y 
##  1.472535  1.428950
# data:  female$daily_sessions and nonfemale$daily_sessions
# t = 1.7451, df = 4539.1, p-value = 0.04052
# p-value = 0.04052 for the result.

Because the one-tailed t-test has a p-value of 0.04052, which is smaller than 0.05, we should reject the null hypothesis. The mean session for females is smaller or equal to the mean session for non-female. That is to say, female users have higher rates of daily sessions than other users do.


The productā€™s managers are also interested in the age groups that tend to use the product and how they vary by country. I will create a table with the following characteristics:

  • Each row represents an age group.
  • Each column represents a country
  • Each listed value shows the number of users of that age group within that country.
table(userdata$age_group,userdata$country)
##         Australia Canada  UK USA
##   18-34       282    242 439 894
##   35-49       219    204 363 792
##   50-64       191    128 255 554
##   65+          60     41 101 235
#        Australia Canada  UK USA
#  18-34       282    242 439 894
#  35-49       219    204 363 792
#  50-64       191    128 255 554
#  65+          60     41 101 235

Now, I will convert the previous table of counts by age group and country into percentages. However, we want the percentages to be calculated separately within each country. Below is the resulting table as percentages (ranging from 0 to 100) rounded to 1 decimal place.

round(prop.table(table(userdata$age_group,userdata$country),2)*100,1)
##         Australia Canada   UK  USA
##   18-34      37.5   39.3 37.9 36.1
##   35-49      29.1   33.2 31.3 32.0
##   50-64      25.4   20.8 22.0 22.4
##   65+         8.0    6.7  8.7  9.5
#        Australia  Canada    UK     USA
#  18-34      37.5%   39.3%  37.9%  36.1%
#  35-49      29.1%   33.2%  31.3%  32.0%
#  50-64      25.4%   20.8%  22.0%  22.4%
#  65+         8.0%    6.7%   8.7%  9.5%

Each country has a similar distribution of users across the age groups because each age group is approximately the same in percentage for each country. For example, the age group 18-34 is around 37% for all countries, 35-49 is around 31% for all countries. I believe that each country has a similar distribution of users across the age groups. We should use a chi-square test to determine if there are age-based differences. (Chi-square tests are designed to determine whether there is a statistically significant difference between different groups of frequencies in certain categories)

The value of the test statistic for the chi-square is below: (independently without using an existing testing function)

x = table(userdata$age_group,userdata$country)
sr <- rowSums(x)
sc <- colSums(x)
n <- sum(x)
E <- outer(sr, sc, "*")/n
x2 <- sum((x - E)^2/E)

The Chi-Squared is equal to 12.641. As for the P-value, it is 0.1795 according to the calculations below.

chisq.test(x)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 12.641, df = 9, p-value = 0.1795

In terms of interpreting the findings for the product managers of the digital application:
Because p-value = 0.1795, which is greater than 0.05, we failed to reject the null hypothesis that each country has a similar distribution of users across the age groups. Therefore, we can come to the conclusion that each country has a similar distribution of users across the age groups.


Canada and the United States are geographically connected and often having overlapping media markets. We can place them in one group and compare them to a second group with Australia and the United Kingdom. I will perform a statistical test to see whether these two groups have similar rates of users who remain active at 30 days.

datagrp <- userdata%>%
mutate(grp = case_when(
country=="Canada"~"America",
country=="USA"~"America",
country=="UK"~"UK",
country=="Australia"~"UK",))
prop.test(table(datagrp$active_30,datagrp$grp))
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(datagrp$active_30, datagrp$grp)
## X-squared = 1.7966, df = 1, p-value = 0.1801
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.047327497  0.008672718
## sample estimates:
##    prop 1    prop 2 
## 0.6105126 0.6298400

I decided to use the Proportion Test here (with H0 of the proportions in each group being the same). Since the p-value of the test is 0.1801, which is greater than 0.05, we should not reject H0. Therefore, we can come to the conclusion that these two groups have similar rates of users who remain active at 30 days.


The applicationā€™s managers would like to study the relationship between daily sessions and subscriptions. Anecdotally, they think that having at least 1 session per day could be a meaningful indicator. Using the outcome of subscriptions at 30 days, I will compare the rates of subscriptions for users with at least 1 daily session to those with fewer and perform a statistical test.

datagrp$AL1=ifelse(datagrp$daily_sessions>=1,">=1","<1")
prop.test(table(datagrp$subscribed_30,datagrp$AL1),alternative="greater")
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(datagrp$subscribed_30, datagrp$AL1)
## X-squared = 6.1882, df = 1, p-value = 0.00643
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.02761479 1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.3884420 0.3115942

Because I want to compare the rates of subscriptions for users with at least 1 daily session to those with fewer, I decide to use a two-sample proportion test. My Null hypothesis would-be users with at least 1 daily session have fewer or equal rates of subscription compared with users with less than 1 daily session. Since the p-value=0.006, which is smaller than 0.05, we shall reject the null hypothesis. Therefore, we can come to the conclusion that users with at least 1 daily session have higher rates of subscription compared with users with less than 1 daily session.

An observational study was conducted. Using this study, we can only tell the relationship between groups, but we do not know about the cause and effect of the relationship. Female users have higher daily sessions, and higher daily sessions lead to a higher rate of subscriptions. We should develop more features to attract female users and since nationality has no statistical impact on the activity at 30 days, we should not take this into consideration when trying to improve the outcomes.

Recommendations for Product Managers:

Because more daily sessions will lead to more subscriptions generally, we should encourage users to use our APP more often. To do so, we can reward credit that can be used to subscribe to our APP for each successful session the user experienced. If they are used to our APP and can get a discount on subscriptions, they will be more likely to subscribe.

As for experimentation, I suggest that the PM test the relationship between age and subscription/activity at 30 days, as well as to conduct experimentation on a multiple-variable test, for example, ā€œUK female aged 18-45ā€ versus ā€œAmerica Male aged 50-64ā€. That way, we are able to find out the most targeted subscribers.