Taking Your Facebook Messenger Data Further.
October 28, 2021 | By .
Disclaimer - All the visualisations in this project were created using my personal Facebook data, all of the data has been anonymised and all participants consented to their data being used for this article
Given the amount of personal data Facebook collects, it probably knows you better than you do. That's a bit concerning, although it is not all doom and gloom. Most of this data is a few clicks away, making it a perfect data source for some personal projects.
Here's a quick refresher on how to retrieve your data:
Settings & Privacy>
Your Facebook Information>
Download Your Information
- Change format to
Create File(this can take a while depending on your date range and media quality)
This will provide you with a range of data that Facebook tracks. It's an immense amount to cover in just one article, so we will focus on Facebook Messenger data. You can find this data in the
Facebook Messenger data is quite diverse, ranging from text messages to many other file types (photos, videos, audio, etc.). Initially, this dataset can be overwhelming. However, after a bit of preprocessing and modelling with SAYN, the dataset becomes more manageable and gives us a lot of creative freedom. So here are a few neat things I have managed to create from my data (full project code here):
Periodic Word Clouds
Most of the data we send through Messenger is text data, so visualising this data is a great starting point. Initially, we could generate a word cloud for each of our conversations and see how they differ. We can take this further by looking at how our topics evolve for each conversation by creating a word cloud timelapse. Here is an example from one of my conversations:
While changes can be subtle, we can see that key topics change over time. In this case, we can see a shift to more gaming-related terms (e.g. grind, GTA, etc.) after 2019.
This is quite a basic example, but it can be taken further by adding more stopwords to filter out some common words. Determining what those common words are will vary for each individual, so this process involves a bit of trial and error.
We all have that friend who takes an eternity to reply to a message, which can feel quite frustrating. So let us quantify this frustration by calculating a reply time metric. We can then compare this metric between friends to see if this is normal or just a quirk of that individual. Here's an example:
It's clear that
User 2 tends to have quicker reply times, but even so, the reply times for both people seem to be quite good. There is a noticeable increase in reply time over the last few months. This increase could be explained by a shift from messages to calls.
These results paint an interesting picture but can be slightly misleading. Longer messages take longer to type out, which can result in a longer reply time. Let's look into the average message length to test this claim:
We can see that
User 1 usually sends longer messages than
User 2. This could explain why
User 1 has a longer reply time, highlighting why we should not rely on a single metric.
By focusing on reply time, we also lose a lot of context from our conversations. Consider the scenario of a heated argument, where reply times are great, but the content may not be.
So let's see if we can regain some context by looking at the sentiment scores of these conversations. For this analysis, we will be using the vaderSentiment package to generate our sentiment scores. VADER is specifically attuned to sentiments expressed on social media; making it ideal for this analysis. Following the typical threshold values, we will say scores above 0.05 are considered positive.
Overall, the conversations seem to be quite positive, although there are some interesting trends. For example, we can see that the
User 2 sentiment declined consistently since 2020, implying increased usage of negative vocabulary over time. This trend would have gone unnoticed if we relied on the previous two metrics, showcasing the value of looking at a range of engagement metrics.
After identifying our key engagement metrics, we can showcase them in a Metabase dashboard and get a better overview of our data. Here's an example of what that could look like:
So far, I have highlighted engagement metrics that I find interesting, but there are many more you could add. Furthermore, we could try combining all of our metrics to create an overall engagement score; this is quite a dense topic and is beyond the scope of this article.
So far we have only focused on messages we send and receive, however, there are other actions that you can make, like reactions. In a way, this is a great way to expand on the message sentiment I covered earlier. So let's take a quick look at the distribution:
There seems to be quite a bit of variation between me and
User 1 .
User 1 tends to react more frequently and positively during our conversations, however, arguably the strongest reaction 😍 appears very rarely from both of us and in varying frequencies.
We can also see that the ♥️ reaction is extremely rare. This represents a major overhaul in the reaction system that occurred in 2020. Originally there were only 6 types of reactions (the ones seen in the above graph excluding ♥️); this changed in 2020 when custom reactions were introduced. In addition, there was a change to the default reaction selection which replaced the 😍 reaction with ♥️. We can see this change reflected in the data below:
While this default can be overwritten back to the previous emoji, most users will not go beyond the defaults, which demonstrates how important defaults are to the overall user experience.
Most Frequently Shared Sites
Now that we have a better understanding of our interactions on Messenger, why not look beyond and see which other platforms dominate our lives. We can get a rough idea by looking at the source of each link that we share and plotting their counts as a bar chart. But why not take this further and generate a bar chart race instead?
This was a mere glimpse of what you can do with your Facebook data, with plenty of file types (photos, gif, audio, video, etc.) and data categories left untouched. Given the volume of data, the possibilities are endless. So I hope this article has inspired you to try to do more with your data. If you would like to try using your own data, you can find the code here.