How and when I have been searching in Google for last five years?!

I wanted to have a look at my Google searches trends in past few years. Google provides this data through Google Dashboard and Google Takeout. If you have a Gmail account and you have been searching in Google while you were logged in, you can take out data on your search history up until 5 years ago. (No wonder Google knows us better than ourselves!).

Are you interested to know how You have been searching on google? Go to the links above and download your own data, then follow steps I have described in detail below to have a report exactly like this but based on your data.

Bellow I am first reading the json files Google has given me. But before putting them here in public, I have removed all of my search keywords, since I wanted to show only the time trends of “when” I have been searching on google.

For my own analysis I had a look at the search keywords too, which gave me a good insight to know what I have been doing. (In order to remove all the search keywords from the json files you exported, you can use this regex inside a decent text editor with regex support in search (like Atom, Sublime text, notepad++) and search for "query_text": (["'])(?:(?=(\\?))\2.)*?\1 then you can replace that easily with "query_text": "**" to anonymize your search keywords.

Code to read Google data and have a look

You can use the following code (applying necessary modifications that I am suggesting in comments) to have a look at your own Google search history. Ping me about the results if you did!

# clean the R workspace
rm(list = ls())
# Load libraries
# if you don't have them installed, write "install.packages("tidyverse")"
# tidyverse to allow us to manipulate data, clean it, plot it
# jsonlite to allow us to work with json files which google exports
# Here there is going to be a for loop to read all the json files you have downloaded from Google takeout
# Before, we need to build an R list to store the data so:
# an empty list to store all the time data we take out of each file
time_data_list <- list()
# We need to list all the files in the directory which have ".json" extension; to use this script on your own data, you will need to modify the directory url
json_file_urls <- list.files("./data/google_searches/", pattern = ".\\json", full.names = T)
# after listing the json files,  we are going to read them one by one and make a data frame of the search time stamps in each of them
for (j in seq_along(json_file_urls)) {
  # fromJSON is a function in Jsonlite package to read json files
  tmp_json_txt <- fromJSON(txt = json_file_urls[j])
  # call to bind_rows to make a dataframe of all the timestamps and store it as j element of our list
  time_data_list[[j]] <- bind_rows(tmp_json_txt[["event"]][["query"]][["id"]])
# call bind_rows once more on all the elements of list we built above, which are the timestamps in each file, to be integrated in one complete dataframe
time_data <- bind_rows(time_data_list)

How close are we? Me and Google !

Let’s say I want to know how many times I have searched in Google in last 5 years ?

# I call sapply to make simple vector of the "lengths" of each of the dataframes I took out in previous step, summing all of them will be the total number of searches
sum(sapply(time_data_list, lengths))
## [1] 34630

Wow that seems a lot ! it means 6926 searches in a year so how many searches in a day? the answer is 18.98; hmmm! so many! We will see more later when we divide searches based on actual dates to find most active day.

Converting time stamps to human readable date

We see that the time stamps Google have put inside the json files are not human readable and they are in so called unix or epoch time for example this is the time of first search I have done five years ago (which is the initial point of this dataset) 1346830330912343.

So what can we do to convert them to the date and time that we know? Below I am using as.POSIXct function in R. You can read more about how this function works and why we give 1970-01-01 as origin date and GMT as time zone by putting ?as.POSIXct in your R console.

# increasing number of digits R is going to show us not to see time stamps in scientific form
options(scipen = 25)
# converting time stamps from character to double to be able to convert them to date later
time_data$timestamp_usec <- as.double(time_data$timestamp_usec)
# an example of how to convert dates in the json files to human readable based on GMT
as.POSIXct(sort(time_data$timestamp_usec)[[1]]/1000000, origin = "1970-01-01", tz = "GMT")
## [1] "2012-09-05 07:32:10 GMT"

One last step before heading to more interesting questions is to use above function on all the time stamps we have to get a data frame of clearer date and times.

# from now I will use "dplyr" data frame format which gives more possibilities to work with dataframe
time_data <- tibble::as_tibble(time_data)
# adding a column which will include clear (human readable time and date)
time_data <- time_data %>% 
  mutate(new_date = as.POSIXct(timestamp_usec/1000000, origin = "1970-01-01", tz = "GMT"))
# also adding two other columns to separate day from hours to use in visualizations
time_data <- time_data %>% 
  separate(col = new_date, into = c("day", "hour"), sep = " ", remove = F)

# convert day to "date" format R will understand
time_data$day <- as.Date(time_data$day)

# adding a column which assigns months of activity (to use later for monthly reports)
time_data$month <- floor_date(time_data$day, "month")
# also let's add month names "as words" to another column, it will come handy
time_data$month_name <- months(time_data$day)
# beside that, let's take "years" out as well and save them as another column which will be useful to draw meaningfull plots
time_data$year <- format.Date(time_data$day, "%Y")

# Now how does our main search in time data frame look like?!
## Rows: 34,630
## Columns: 7
## $ timestamp_usec <dbl> 1349016915768563, 1349016882077715, 13487…
## $ new_date       <dttm> 2012-09-30 14:55:15, 2012-09-30 14:54:42…
## $ day            <date> 2012-09-30, 2012-09-30, 2012-09-27, 2012…
## $ hour           <chr> "14:55:15", "14:54:42", "11:31:56", "18:4…
## $ month          <date> 2012-09-01, 2012-09-01, 2012-09-01, 2012…
## $ month_name     <chr> "September", "September", "September", "S…
## $ year           <chr> "2012", "2012", "2012", "2012", "2012", "…

Time line of my activities

Use mouse scroll to zoom in and out to explore more, drag to move on the timeline. I am going to use these activities as a baseline of comparison to see if trends of my google searches have been affected by the type of activity I have been doing in that time.

time_line_data <- read.csv("./data/time_line_data.csv")
time_line_groups <- read.csv("./data/time_line_data_groups.csv")

timevis(data = time_line_data, groups = time_line_groups, fit = T)
# require(timeline)
# require(scales)
# require(ggthemes)
# time_data <- data.frame(
#   Person = c("Host: Paul Wouters, CWTS"),
#   Group = c("Leiden University"),
#   StartDate = as.Date(c("2018-01-15")),
#   EndDate = as.Date(c("2018-04-15"))
#   )
# time_data <- data.frame(
#   Person = c("Host: Paul Wouters, CWTS"),
#   Group = c("Leiden University"),
#   StartDate = as.Date(c("2018-01-15")),
#   EndDate = as.Date(c("2018-04-15"))
#   )
# task_data <- data.frame(  
#   Main_task = c("Evolution of sociological community, ANVUR evaluation and replication"),
#     Group = c("Leiden University"),
#   Date = as.Date(c("2018-01-22"))
# )
# ttt <- timeline(df = time_data, text.size = 3, text.color = c('black'), events = task_data, = "Group", event.col = "Date", event.line = F, event.above = F, event.text.size = 3, event.text.color = c('black'), border.color = c('black')) +
#   scale_x_date(date_breaks = "1 month", 
#                  labels = date_format("%b"),
#                  limits = as.Date(c('2018-01-01','2018-05-01')))
# ttt +  theme_minimal() + theme(legend.position = "none")

Closing words

This was a simple practice firstly for myself to remember to “learn R by doing R” and beside that to show how amazingly helpful it can be in getting to know myself and my search behavior through the time.