Thursday, February 2, 2017

How to enable/disable privacy protection in Google Analytics (it's easy to get wrong!)

In my survey last year of ARL library web services, I found that 72% of them used Google Analytics. So it's not surprising that a common response to my article about leaking catalog searches to Amazon was to wonder whether the same thing is happening with Google Analytics.

The short answer is "It Depends". It might be OK to use Google Analytics on a library search facility, if the following things are true:
  1. The library trusts Google on user privacy. (Many do.)
  2. Google is acting in good faith to protect user privacy and is not acting under legal compulsion to act otherwise. (We don't really know.)
  3. Google Analytics is correctly doing what their documentation says they are doing and not being circumvented by the rest of Google. (They don't always.)
  4. The library has implemented Google Analytics correctly to enable user privacy.
There's an entire blog post to write about each of the first three conditions, but I have only so many hours in a day.  Given that many libraries have decided that the benefits using of Google Analytics outweigh the privacy risks, the rest of this post concerns only this last condition. Of the 72% of ARL libraries that use Google Analytics, I find that only 19% of them have implemented Google Analytics with privacy-protection features enabled.

So, if you care about library privacy but can't do without Google Analytics, read on!

Google Analytics has a lot of configuration options, which is why webmasters love it. For the purposes of user privacy, however, there are just two configuration options to pay attention to, the "IP Anonymization" option and the "Display Features" option.

IP Anonymization says to Google Analytics "please don't remember the exact IP address of my users". According to Google, enabling this mode masks the least significant bits of the user's IP address before the IP address is used or saved. Since many users can be identified by their IP address, this prevents anyone from discovering the search history for a given IP address. But remember, Google is still sent the IP address, and we have to trust that Google will obscure the IP address as advertised, and not save it in some log somewhere. Even with the masked IP address, it may still be possible to identify a user, particularly if a library serves a small number of geographically dispersed users.

"Display Features" says to Google to that you don't care about user privacy, and it's OK to track your users all to hell so that you can get access to "demographic" information. To understand what's happening, it's important to understand the difference between "first-party" and "third-party" cookies, and how they implicate privacy differently.

Out of the box, Google Analytics uses "first party" cookies to track users. So if you deploy Google Analytics on your "library.example.edu" server, the tracking cookie will be attached to the library.example.edu hostname. Google Analytics will have considerable difficulty connecting user number 1234 on the library.example.edu domain with user number 5678 on the "sci-hub.info" domain, because the user ids are chosen randomly for each hostname. But if you turn on Display Features, Google will connect the two user ids via a third party tracking cookie from its Doubleclick advertising service. This enables both you and Google to know more about your users. Anyone with access to Google's data will be able to connect the catalog searches saved for user number 1234 to that user's searches on any website that uses Google advertising or any site that has Display Features turned on.

IP Anonymization and Display Features can be configured in Google Analytics in three ways, depending on how it's being configured. The instructions here apply to the "Universal Analytics" script. You can tell a site uses Universal Analytics because the pages execute a javascript named "analytics.js". An older "classic" version of Google Analytics uses a script named "ga.js"; its configuration is similar to that of Universal. More complex websites may use Google Tag Manager to deploy and configure Google Analytics.

Google Analytics is usually deployed on a web page by inserting a script element that looks like this:
<script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('send', 'pageview');
</script>
IP Anonymization and Display Features are turned on with extra lines in the script:
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('require', 'displayfeatures');  // starts tracking users across sites
    ga('set', 'anonymizeIp', true); // makes it harder to identify the user from logs
    ga('send', 'pageview');
The Google Analytics Admin allows you to turn on cross site user tracking, though the privacy impact of what you're doing is not made clear . In the "Data Collection" item of the Tracking info pane, look at the toggle switches for "Remarketing" and "Advertising Reporting Features" if these are switched to "ON", then you've enabled cross site tracking and your users can expect no privacy.

Turning on IP anonymization is not quite as easy and turning on cross-site tracking. You have to add it explicitly in your script or turn it on in Google tag manager (where you won't find it unless you know what to look for!).

To check if cross-site tracking has been turned on in your institution's Google Analytics, use the procedures I described in my article on How to check if your library is leaking catalog searches to Amazon.  First, clear the cookies for your website, then load your site and look at the "Sources" tab in Chrome developer tools. If there's a resource from "stats.g.doubleclick.net", then your website is asking google to track your users across sites. If your institution is a library, you should not be telling Google to track your users across sites.

Bottom line: if you use Google Analytics, always remember that Google is fundamentally an advertising company and it will seldom guide you towards protecting your users' privacy.