How and when I have been searching in Google for last five years?!
I wanted to have a look at my Google searches trends in past few years. Google provides this data through Google Dashboard and Google Takeout. If you have a Gmail account and you have been searching in Google while you were logged in, you can take out data on your search history up until 5 years ago. (No wonder Google knows us better than ourselves!).
Are you interested to know how You have been searching on google? Go to the links above and download your own data, then follow steps I have described in detail below to have a report exactly like this but based on your data.
Bellow I am first reading the json
files Google has given me. But before putting them here in public, I have removed all of my search keywords, since I wanted to show only the time trends of “when” I have been searching on google.
For my own analysis I had a look at the search keywords too, which gave me a good insight to know what I have been doing. (In order to remove all the search keywords from the json files you exported, you can use this regex inside a decent text editor with regex support in search (like Atom, Sublime text, notepad++) and search for "query_text": (["'])(?:(?=(\\?))\2.)*?\1
then you can replace that easily with "query_text": "**"
to anonymize your search keywords.
You can use the following code (applying necessary modifications that I am suggesting in comments) to have a look at your own Google search history. Ping me about the results if you did!
# clean the R workspace
rm(list = ls())
# Load libraries
# if you don't have them installed, write "install.packages("tidyverse")"
# tidyverse to allow us to manipulate data, clean it, plot it
require(tidyverse)
# jsonlite to allow us to work with json files which google exports
require(jsonlite)
# Here there is going to be a for loop to read all the json files you have downloaded from Google takeout
# Before, we need to build an R list to store the data so:
# an empty list to store all the time data we take out of each file
time_data_list <- list()
# We need to list all the files in the directory which have ".json" extension; to use this script on your own data, you will need to modify the directory url
json_file_urls <- list.files("./data/google_searches/", pattern = ".\\json", full.names = T)
# after listing the json files, we are going to read them one by one and make a data frame of the search time stamps in each of them
for (j in seq_along(json_file_urls)) {
# fromJSON is a function in Jsonlite package to read json files
tmp_json_txt <- fromJSON(txt = json_file_urls[j])
# call to bind_rows to make a dataframe of all the timestamps and store it as j element of our list
time_data_list[[j]] <- bind_rows(tmp_json_txt[["event"]][["query"]][["id"]])
}
# call bind_rows once more on all the elements of list we built above, which are the timestamps in each file, to be integrated in one complete dataframe
time_data <- bind_rows(time_data_list)
Let’s say I want to know how many times I have searched in Google in last 5 years ?
# I call sapply to make simple vector of the "lengths" of each of the dataframes I took out in previous step, summing all of them will be the total number of searches
sum(sapply(time_data_list, lengths))
## [1] 34630
Wow that seems a lot ! it means 6926 searches in a year so how many searches in a day? the answer is 18.98; hmmm! so many! We will see more later when we divide searches based on actual dates to find most active day.
We see that the time stamps
Google have put inside the json
files are not human readable and they are in so called unix or epoch time for example this is the time of first search I have done five years ago (which is the initial point of this dataset) 1346830330912343.
So what can we do to convert them to the date and time that we know? Below I am using as.POSIXct
function in R
. You can read more about how this function works and why we give 1970-01-01
as origin date and GMT
as time zone by putting ?as.POSIXct
in your R console.
# increasing number of digits R is going to show us not to see time stamps in scientific form
options(scipen = 25)
# converting time stamps from character to double to be able to convert them to date later
time_data$timestamp_usec <- as.double(time_data$timestamp_usec)
# an example of how to convert dates in the json files to human readable based on GMT
as.POSIXct(sort(time_data$timestamp_usec)[[1]]/1000000, origin = "1970-01-01", tz = "GMT")
## [1] "2012-09-05 07:32:10 GMT"
One last step before heading to more interesting questions is to use above function on all the time stamps we have to get a data frame of clearer date and times.
# from now I will use "dplyr" data frame format which gives more possibilities to work with dataframe
time_data <- tibble::as_tibble(time_data)
# adding a column which will include clear (human readable time and date)
time_data <- time_data %>%
mutate(new_date = as.POSIXct(timestamp_usec/1000000, origin = "1970-01-01", tz = "GMT"))
# also adding two other columns to separate day from hours to use in visualizations
time_data <- time_data %>%
separate(col = new_date, into = c("day", "hour"), sep = " ", remove = F)
# convert day to "date" format R will understand
time_data$day <- as.Date(time_data$day)
# adding a column which assigns months of activity (to use later for monthly reports)
require(lubridate)
time_data$month <- floor_date(time_data$day, "month")
# also let's add month names "as words" to another column, it will come handy
time_data$month_name <- months(time_data$day)
# beside that, let's take "years" out as well and save them as another column which will be useful to draw meaningfull plots
time_data$year <- format.Date(time_data$day, "%Y")
# Now how does our main search in time data frame look like?!
glimpse(time_data)
## Rows: 34,630
## Columns: 7
## $ timestamp_usec <dbl> 1349016915768563, 1349016882077715, 13487…
## $ new_date <dttm> 2012-09-30 14:55:15, 2012-09-30 14:54:42…
## $ day <date> 2012-09-30, 2012-09-30, 2012-09-27, 2012…
## $ hour <chr> "14:55:15", "14:54:42", "11:31:56", "18:4…
## $ month <date> 2012-09-01, 2012-09-01, 2012-09-01, 2012…
## $ month_name <chr> "September", "September", "September", "S…
## $ year <chr> "2012", "2012", "2012", "2012", "2012", "…
First thing I would like to check is the most active day which I have done maximum number of searches.
# I need to first group all the searches based on day then count the number of searches in each day and sort it in descending order to find the most active day, with dplyr verbs it is easy!
active_days <- time_data %>%
group_by(day) %>%
summarise(searches_in_day = n()) %>%
arrange(desc(searches_in_day)) %>%
print()
## # A tibble: 1,666 x 2
## day searches_in_day
## <date> <int>
## 1 2017-07-31 277
## 2 2015-04-18 184
## 3 2017-08-21 142
## 4 2017-08-10 132
## 5 2015-11-27 112
## 6 2015-11-29 95
## 7 2017-01-31 88
## 8 2015-12-19 87
## 9 2015-12-11 85
## 10 2017-08-17 84
## # … with 1,656 more rows
OK, that means I have been actively searching for 1666 days in last five years. Also we see that my most active day has been 2017-07-31 with 277 searches! That means 23.08 searches in an hour! WoW ! Google has been teaching me a lot!
Although, I should mention there has been 82 days with only 1
search! And in this plot we see that the most active days are only exceptions, not the rule!
plot(table(active_days$searches_in_day, exclude = NULL), xlab = "# of searches in a day", ylab = "Frequency of days with that many searches")
What about my most active year and month? Is there a year and a month which I have been searching more? Let’s see!
# Same as what we did for days, I need to first group all the searches based on month then count the number of searches in each month and sort it in descending order to find the most active month, with dplyr verbs it is easy!
active_months <- time_data %>%
group_by(month_name) %>%
summarise(searches_in_month = n()) %>%
arrange(desc(searches_in_month)) %>%
print()
## # A tibble: 12 x 2
## month_name searches_in_month
## <chr> <int>
## 1 April 3586
## 2 December 3492
## 3 August 3342
## 4 January 3171
## 5 March 2972
## 6 May 2845
## 7 November 2844
## 8 July 2840
## 9 September 2835
## 10 February 2547
## 11 October 2290
## 12 June 1866
My most active month has been “April”. And apparently the order of my active months doesn’t follow the order of a normal year, what about a closer look at each year? If we look at these five years, is there a year with more searches than others? Am I searching more recently or less?
# Same as what we did for days and months, I need to first group all the searches based on year then months then count the number of searches in each month and sort it in descending order to find the most active month, with dplyr verbs it is easy!
active_months_in_years <- time_data %>%
group_by(year, month_name) %>%
summarise(searches_in_month = n()) %>%
arrange(searches_in_month)
# first a simple plot of how I have been searching in these years
plot(table(time_data$year, exclude = NULL), xlab = "Years", ylab = "# of Searches done")
Oops ! It seems my most active year has been 2015 and in last two years I am searching less. Does this mean me and Google are not as close as before? Or am I learning to search more wisely?!
Let’s have one last look at the months in each of these years! Does my search follow similar trends in each month in different years? In other words, can I find a month in all these years that is the peak of my searches?
ggplot(active_months_in_years, aes(x = month_name, y = searches_in_month, label = year)) +
geom_point() +
geom_text(aes(label = year), hjust = 0, vjust = 0) +
labs(x = "Months", y = "# Searches done") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It seems not! Apparently in each of these years I have been searching in differnt quantities in different months.
Use mouse scroll to zoom in and out to explore more, drag to move on the timeline. I am going to use these activities as a baseline of comparison to see if trends of my google searches have been affected by the type of activity I have been doing in that time.
require(timevis)
time_line_data <- read.csv("./data/time_line_data.csv")
time_line_groups <- read.csv("./data/time_line_data_groups.csv")
timevis(data = time_line_data, groups = time_line_groups, fit = T)
#
# require(timeline)
# require(scales)
# require(ggthemes)
#
#
# time_data <- data.frame(
# Person = c("Host: Paul Wouters, CWTS"),
# Group = c("Leiden University"),
# StartDate = as.Date(c("2018-01-15")),
# EndDate = as.Date(c("2018-04-15"))
# )
#
# time_data <- data.frame(
# Person = c("Host: Paul Wouters, CWTS"),
# Group = c("Leiden University"),
# StartDate = as.Date(c("2018-01-15")),
# EndDate = as.Date(c("2018-04-15"))
# )
#
# task_data <- data.frame(
# Main_task = c("Evolution of sociological community, ANVUR evaluation and replication"),
# Group = c("Leiden University"),
# Date = as.Date(c("2018-01-22"))
# )
#
# ttt <- timeline(df = time_data, text.size = 3, text.color = c('black'), events = task_data, event.group.col = "Group", event.col = "Date", event.line = F, event.above = F, event.text.size = 3, event.text.color = c('black'), border.color = c('black')) +
# scale_x_date(date_breaks = "1 month",
# labels = date_format("%b"),
# limits = as.Date(c('2018-01-01','2018-05-01')))
#
# ttt + theme_minimal() + theme(legend.position = "none")
This was a simple practice firstly for myself to remember to “learn R by doing R” and beside that to show how amazingly helpful it can be in getting to know myself and my search behavior through the time.