How to assign Session ID’s to group event data.
This blog post will show how we can calculate user session_id’s and group event data. It will be illustrated with the help of R and SQL.
I’m working in the realm of (online) Data Analytics. That means I’m interested in analyzing the behaviour of people who use our products and how they interact with it. A helpful concept in this domain is that of a User Session. Maybe u have heard about it, maybe not. But here u go:
A session is a group of interactions between a user and an application that take place within a given timeframe. A single session can contain multiple activities (such as page views, events, social interactions, and e-commerce transactions)
see https://auth0.com/docs/manage-users/sessions
This concept leads to further metrics that are of interest such as the number of sessions, the average duration of a session or the probability of a user/customer performing certain actions within a session. Let’s take a simple example to illustrate this idea and narrow it down.
Imagine I’m searching for a new bike and visit Kleinanzeigen.de (a german classifieds site where u can buy and sell pretty much anything). Typically I enter the thing of interest (here a bike) and get already proposals (see screenshot). As soon as I click on any of these, the results load and are displayed. This typically triggers an event that a user has done a search.
However for the process of finding a bike for me, it is probably not sufficient just to specify that I search for a bike. I will also want to include the location and possibly apply other filter options such as a price range.
After having entered the location of e.g. Hamburg-Eimsbüttel and defined a price range from let’s say 200 to 500 €, this resulted in multiple triggered search events. But in a more general sense, we might argue that these events belong together and result in one search (session). At this point we want to group these searches together and assign one session_id. In order to make it easier to comprehend, I am creating dummy data of how these searches may be stored.
At first, I create a dataframe with 3 columns:
search_id
user_id
to define who has done the searchcreated_at
to indicate the time when the search has been done# Create dummy data
searches <- tibble(search_id = 1:10,
user_id = rep(c(1,2), each = 5),
created_at = ymd_hms(
c("2023-06-03 08:10:00", "2023-06-03 08:10:08",
"2023-06-03 08:11:21", "2023-06-03 08:11:24",
"2023-06-03 08:11:33", "2023-06-03 08:10:48",
"2023-06-03 08:11:05", "2023-06-03 08:11:11",
"2023-06-03 08:12:13", "2023-06-03 08:12:20"))) |>
arrange(user_id, created_at)
ABCDEFGHIJ0123456789 |
search_id <int> | user_id <dbl> | created_at <dttm> | ||
---|---|---|---|---|
1 | 1 | 2023-06-03 08:10:00 | ||
2 | 1 | 2023-06-03 08:10:08 | ||
3 | 1 | 2023-06-03 08:11:21 | ||
4 | 1 | 2023-06-03 08:11:24 | ||
5 | 1 | 2023-06-03 08:11:33 | ||
6 | 2 | 2023-06-03 08:10:48 | ||
7 | 2 | 2023-06-03 08:11:05 | ||
8 | 2 | 2023-06-03 08:11:11 | ||
9 | 2 | 2023-06-03 08:12:13 | ||
10 | 2 | 2023-06-03 08:12:20 |
We’ve got 10 individual searches of 2 users. Each user has done 5 searches and we want to know how to assign session_id’s to these searches. Maybe someone in our team has already done some research and came to the conclusion that it is best to define the timeframe of a search session to be 10 seconds. This means if a subsequent search of a user is done within 10 seconds, these belong to the same search session_id.
Therefore we want to first sort our data based on user_id and created_at.
Then we can calculate the time difference between searches.
In base R we’ve got the nice function diff
which as the name suggests calculates the
lagged difference.
[1] 0 2 1 -2
We have to pad the first value to get a vector of the same length.
searches |>
mutate(time_dif = c(0, diff(created_at)),
time_dif_reset = if_else(time_dif >= 10, TRUE, FALSE),
user_reset = if_else(c(1, diff(user_id)) != 0, TRUE, FALSE))
ABCDEFGHIJ0123456789 |
search_id <int> | user_id <dbl> | created_at <dttm> | time_dif <dbl> | |
---|---|---|---|---|
1 | 1 | 2023-06-03 08:10:00 | 0 | |
2 | 1 | 2023-06-03 08:10:08 | 8 | |
3 | 1 | 2023-06-03 08:11:21 | 73 | |
4 | 1 | 2023-06-03 08:11:24 | 3 | |
5 | 1 | 2023-06-03 08:11:33 | 9 | |
6 | 2 | 2023-06-03 08:10:48 | -45 | |
7 | 2 | 2023-06-03 08:11:05 | 17 | |
8 | 2 | 2023-06-03 08:11:11 | 6 | |
9 | 2 | 2023-06-03 08:12:13 | 62 | |
10 | 2 | 2023-06-03 08:12:20 | 7 |
We added a boolean variable indicating whether the time difference between the
following search is greater than 10 seconds. If this is the case the session is
“reset” and starts again (here I use dplyr::if_else()
which would not be necessary
but I feel it makes it easier to follow).
What we also have to account for is the fact that a new session starts for every
user. This can be achieved in a similar way.
Here we pad the first value of user_reset
with a 1 because we want to start our session_id variable with 1.
What is now left is to combine these 2 time_dif_reset
and user_reset
variables
into a general reset
variable (time_dif_reset OR user_reset
). The cumulative sum of this variable will then give us our
desired session_id.
searches |>
mutate(time_dif = c(0, diff(created_at)),
time_dif_reset = if_else(time_dif >= 10, TRUE, FALSE),
user_reset = if_else(c(1, diff(user_id)) != 0, TRUE, FALSE),
reset = if_else(time_dif_reset | user_reset, TRUE, FALSE),
session_id = cumsum(reset))
ABCDEFGHIJ0123456789 |
search_id <int> | user_id <dbl> | created_at <dttm> | time_dif <dbl> | time_dif_reset <lgl> | user_reset <lgl> | |
---|---|---|---|---|---|---|
1 | 1 | 2023-06-03 08:10:00 | 0 | FALSE | TRUE | |
2 | 1 | 2023-06-03 08:10:08 | 8 | FALSE | FALSE | |
3 | 1 | 2023-06-03 08:11:21 | 73 | TRUE | FALSE | |
4 | 1 | 2023-06-03 08:11:24 | 3 | FALSE | FALSE | |
5 | 1 | 2023-06-03 08:11:33 | 9 | FALSE | FALSE | |
6 | 2 | 2023-06-03 08:10:48 | -45 | FALSE | TRUE | |
7 | 2 | 2023-06-03 08:11:05 | 17 | TRUE | FALSE | |
8 | 2 | 2023-06-03 08:11:11 | 6 | FALSE | FALSE | |
9 | 2 | 2023-06-03 08:12:13 | 62 | TRUE | FALSE | |
10 | 2 | 2023-06-03 08:12:20 | 7 | FALSE | FALSE |
This tells us now that the first user has done 2 search sessions (session_id 1 and 2) and the second user has done 3 search sessions (session_id 3, 4 and 5). We would probably not divide this calculation into several steps, but combine them into a function.
# Put it into a function
create_session_id <- function(created_at, user_id, time_dif = 10) {
return(cumsum(c(0, diff(created_at)) >= time_dif |
c(1, diff(user_id)) != 0))
}
searches |>
mutate(session_id = create_session_id(created_at, user_id))
Now we also look at how this can be accomplished in SQL. We’ll create an in memory SQLite database table with our searches data.
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
# Fill this with our searches data
copy_to(con, searches)
# SQL query to create session_id's
tbl(con,
sql(
"
with
searches_prep as (
select
search_id,
user_id,
created_at,
case when
created_at - lag(created_at) over
(partition by user_id order by created_at) >= 10
or row_number() over
(partition by user_id order by created_at) = 1
then 1
else 0
end as reset
from searches)
select
search_id,
user_id,
created_at,
reset,
sum(reset) over (order by user_id, created_at) as session_id
from searches_prep
"
))
# Source: SQL [10 x 5]
# Database: sqlite 3.38.5 [:memory:]
search_id user_id created_at reset session_id
<int> <dbl> <dbl> <int> <int>
1 1 1 1685779800 1 1
2 2 1 1685779808 0 1
3 3 1 1685779881 1 2
4 4 1 1685779884 0 2
5 5 1 1685779893 0 2
6 6 2 1685779848 1 3
7 7 2 1685779865 1 4
8 8 2 1685779871 0 4
9 9 2 1685779933 1 5
10 10 2 1685779940 0 5
Here we’re using a first CTE to create our reset variable similar to the way we have defined it in R. We calculate the lagged difference by user_id and assign a 1 if it exceeds 10 seconds and for every first row of a new user we assign a 1 as well. In the bottom query we calculate the cumulative sum that gives us our session_id.
We have looked at the important concept of User Sessions and at the logic of how this can be calculated. The implementation was demonstrated with R as well as with SQL and focused on the example of Searches done in an Online Marketplace. This was a basic example and you would often add further conditions for specific cases.
These Session ID’s can be viewed as a cornerstone on which to build further metrics of interest like the average session duration or a count of specific events per session. For our example we might want to look into which search filters are used regularly by our customers and which almost never that could be removed.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gröninger (2023, June 3). Lukas R Blog: Searching for User Sessions. Retrieved from https://lukas-r.blog/posts/2023-06-03-searching-for-user-sessions/
BibTeX citation
@misc{gröninger2023searching, author = {Gröninger, Lukas}, title = {Lukas R Blog: Searching for User Sessions}, url = {https://lukas-r.blog/posts/2023-06-03-searching-for-user-sessions/}, year = {2023} }