Searching for User Sessions

R SQL

How to assign Session ID’s to group event data.

Author

Affiliation

Lukas Gröninger

 

Published

June 2, 2023

Citation

Gröninger, 2023

This blog post will show how we can calculate user session_id’s and group event data. It will be illustrated with the help of R and SQL.

The concept of a User Session

I’m working in the realm of (online) Data Analytics. That means I’m interested in analyzing the behaviour of people who use our products and how they interact with it. A helpful concept in this domain is that of a User Session. Maybe u have heard about it, maybe not. But here u go:

A session is a group of interactions between a user and an application that take place within a given timeframe. A single session can contain multiple activities (such as page views, events, social interactions, and e-commerce transactions)see https://auth0.com/docs/manage-users/sessions

This concept leads to further metrics that are of interest such as the number of sessions, the average duration of a session or the probability of a user/customer performing certain actions within a session. Let’s take a simple example to illustrate this idea and narrow it down.

Searching in an online marketplace

Imagine I’m searching for a new bike and visit Kleinanzeigen.de (a german classifieds site where u can buy and sell pretty much anything). Typically I enter the thing of interest (here a bike) and get already proposals (see screenshot). As soon as I click on any of these, the results load and are displayed. This typically triggers an event that a user has done a search.

Searching for a bike

However for the process of finding a bike for me, it is probably not sufficient just to specify that I search for a bike. I will also want to include the location and possibly apply other filter options such as a price range.

Further Filter Options

After having entered the location of e.g. Hamburg-Eimsbüttel and defined a price range from let’s say 200 to 500 €, this resulted in multiple triggered search events. But in a more general sense, we might argue that these events belong together and result in one search (session). At this point we want to group these searches together and assign one session_id. In order to make it easier to comprehend, I am creating dummy data of how these searches may be stored.

Calculating a session_id in R

# required libraries
library(tidyverse)
library(dbplyr)

At first, I create a dataframe with 3 columns:

# Create dummy data
searches <- tibble(search_id = 1:10, 
                   user_id = rep(c(1,2), each = 5), 
                   created_at = ymd_hms(
                        c("2023-06-03 08:10:00", "2023-06-03 08:10:08",
                          "2023-06-03 08:11:21", "2023-06-03 08:11:24",
                          "2023-06-03 08:11:33", "2023-06-03 08:10:48",
                          "2023-06-03 08:11:05", "2023-06-03 08:11:11",
                          "2023-06-03 08:12:13", "2023-06-03 08:12:20"))) |> 
  arrange(user_id, created_at)
ABCDEFGHIJ0123456789
search_id
<int>
user_id
<dbl>
created_at
<dttm>
112023-06-03 08:10:00
212023-06-03 08:10:08
312023-06-03 08:11:21
412023-06-03 08:11:24
512023-06-03 08:11:33
622023-06-03 08:10:48
722023-06-03 08:11:05
822023-06-03 08:11:11
922023-06-03 08:12:13
1022023-06-03 08:12:20

We’ve got 10 individual searches of 2 users. Each user has done 5 searches and we want to know how to assign session_id’s to these searches. Maybe someone in our team has already done some research and came to the conclusion that it is best to define the timeframe of a search session to be 10 seconds. This means if a subsequent search of a user is done within 10 seconds, these belong to the same search session_id.

Therefore we want to first sort our data based on user_id and created_at. Then we can calculate the time difference between searches. In base R we’ve got the nice function diff which as the name suggests calculates the lagged difference.

diff(c(1, 1, 3, 4, 2))
[1]  0  2  1 -2

We have to pad the first value to get a vector of the same length.

searches |> 
  mutate(time_dif = c(0, diff(created_at)),
         time_dif_reset = if_else(time_dif >= 10, TRUE, FALSE),
         user_reset = if_else(c(1, diff(user_id)) != 0, TRUE, FALSE))
ABCDEFGHIJ0123456789
search_id
<int>
user_id
<dbl>
created_at
<dttm>
time_dif
<dbl>
112023-06-03 08:10:000
212023-06-03 08:10:088
312023-06-03 08:11:2173
412023-06-03 08:11:243
512023-06-03 08:11:339
622023-06-03 08:10:48-45
722023-06-03 08:11:0517
822023-06-03 08:11:116
922023-06-03 08:12:1362
1022023-06-03 08:12:207

We added a boolean variable indicating whether the time difference between the following search is greater than 10 seconds. If this is the case the session is “reset” and starts again (here I use dplyr::if_else() which would not be necessary but I feel it makes it easier to follow). What we also have to account for is the fact that a new session starts for every user. This can be achieved in a similar way. Here we pad the first value of user_reset with a 1 because we want to start our session_id variable with 1.

What is now left is to combine these 2 time_dif_reset and user_reset variables into a general reset variable (time_dif_reset OR user_reset). The cumulative sum of this variable will then give us our desired session_id.

searches |> 
  mutate(time_dif = c(0, diff(created_at)),
         time_dif_reset = if_else(time_dif >= 10, TRUE, FALSE),
         user_reset = if_else(c(1, diff(user_id)) != 0, TRUE, FALSE),
         reset = if_else(time_dif_reset | user_reset, TRUE, FALSE),
         session_id = cumsum(reset)) 
ABCDEFGHIJ0123456789
search_id
<int>
user_id
<dbl>
created_at
<dttm>
time_dif
<dbl>
time_dif_reset
<lgl>
user_reset
<lgl>
112023-06-03 08:10:000FALSETRUE
212023-06-03 08:10:088FALSEFALSE
312023-06-03 08:11:2173TRUEFALSE
412023-06-03 08:11:243FALSEFALSE
512023-06-03 08:11:339FALSEFALSE
622023-06-03 08:10:48-45FALSETRUE
722023-06-03 08:11:0517TRUEFALSE
822023-06-03 08:11:116FALSEFALSE
922023-06-03 08:12:1362TRUEFALSE
1022023-06-03 08:12:207FALSEFALSE

This tells us now that the first user has done 2 search sessions (session_id 1 and 2) and the second user has done 3 search sessions (session_id 3, 4 and 5). We would probably not divide this calculation into several steps, but combine them into a function.

# Put it into a function
create_session_id <- function(created_at, user_id, time_dif = 10) {
  return(cumsum(c(0, diff(created_at)) >= time_dif | 
                c(1, diff(user_id)) != 0))
}

searches |> 
  mutate(session_id = create_session_id(created_at, user_id))

Calculating a session_id in SQL

Now we also look at how this can be accomplished in SQL. We’ll create an in memory SQLite database table with our searches data.

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")

# Fill this with our searches data
copy_to(con, searches)

# SQL query to create session_id's
tbl(con, 
    sql(
        "
         with 
         searches_prep as (
            select 
                search_id,
                user_id,
                created_at,
                case when 
                    created_at - lag(created_at) over 
                        (partition by user_id order by created_at) >= 10 
                    or row_number() over 
                        (partition by user_id order by created_at) = 1
                    then 1 
                    else 0 
                end as reset
            from searches) 
         
         select 
            search_id,
            user_id,
            created_at,
            reset,
            sum(reset) over (order by user_id, created_at) as session_id
         from searches_prep
        "
        )) 
# Source:   SQL [10 x 5]
# Database: sqlite 3.38.5 [:memory:]
   search_id user_id created_at reset session_id
       <int>   <dbl>      <dbl> <int>      <int>
 1         1       1 1685779800     1          1
 2         2       1 1685779808     0          1
 3         3       1 1685779881     1          2
 4         4       1 1685779884     0          2
 5         5       1 1685779893     0          2
 6         6       2 1685779848     1          3
 7         7       2 1685779865     1          4
 8         8       2 1685779871     0          4
 9         9       2 1685779933     1          5
10        10       2 1685779940     0          5

Here we’re using a first CTE to create our reset variable similar to the way we have defined it in R. We calculate the lagged difference by user_id and assign a 1 if it exceeds 10 seconds and for every first row of a new user we assign a 1 as well. In the bottom query we calculate the cumulative sum that gives us our session_id.

Conclusion

We have looked at the important concept of User Sessions and at the logic of how this can be calculated. The implementation was demonstrated with R as well as with SQL and focused on the example of Searches done in an Online Marketplace. This was a basic example and you would often add further conditions for specific cases.

These Session ID’s can be viewed as a cornerstone on which to build further metrics of interest like the average session duration or a count of specific events per session. For our example we might want to look into which search filters are used regularly by our customers and which almost never that could be removed.

Footnotes

  1. see https://auth0.com/docs/manage-users/sessions[↩]

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gröninger (2023, June 3). Lukas R Blog: Searching for User Sessions. Retrieved from https://lukas-r.blog/posts/2023-06-03-searching-for-user-sessions/

BibTeX citation

@misc{gröninger2023searching,
  author = {Gröninger, Lukas},
  title = {Lukas R Blog: Searching for User Sessions},
  url = {https://lukas-r.blog/posts/2023-06-03-searching-for-user-sessions/},
  year = {2023}
}