task-4030/blocking.tex

   1 \documentclass{article}
   2 \usepackage[pdftex]{graphicx}
   3 \usepackage{graphics}
   4 \usepackage{color}
   5 \usepackage{url}
   6
   7 \begin{document}
   8
   9 \author{Karsten Loesing\\{\tt karsten@torproject.org}}
  10 \title{Case study:\\Learning whether a Tor bridge is blocked\\by looking
  11 at its aggregate usage statistics\\-- Part one --}
  12 \maketitle
  13
  14 \section{Introduction}
  15
  16 Tor bridges\footnote{\url{https://www.torproject.org/docs/bridges}} are
  17 relays that are not listed in the main directory.
  18 Clients which cannot access the Tor network directly can try to learn a
  19 few bridge addresses and use these bridges to connect to the Tor network.
  20 Bridges have been introduced to impede censoring the Tor network, but in
  21 the past we experienced successful blocking of bridges in a few countries.
  22
  23 In this report we investigate whether we can learn that a bridge is
  24 blocked in a given country only by looking at its reported aggregate
  25 statistics on usage by country.
  26 By knowing that a bridge is blocked, we can, for example, avoid giving
  27 out its address to users from that country.
  28
  29 Learning whether a bridge is blocked is somewhat related to our recent
  30 efforts to detect censorship of direct access to the Tor
  31 network.\footnote{\url{https://metrics.torproject.org/papers/detector-2011-09-09.pdf}}
  32 The main difference is that we want to know which bridges are blocked and
  33 which are not, whereas we don't care which relays are accessible in the
  34 case of blocked direct access.
  35 It's easy to block all relays, but it should be difficult to block all
  36 bridges.
  37
  38 This report can only be seen as a first step towards researching bridge
  39 blocking.
  40 Even if a bridge reports that it had zero users from a country, we're
  41 lacking the confirmation that the bridge was really blocked.
  42 There can be other reasons for low user numbers which may be completely
  43 unrelated.
  44 The results of this analysis should be considered when actively scanning
  45 bridge reachability from inside a country, both to decide how frequently a
  46 bridge should be scanned and to evaluate how reliable an analysis of
  47 passive usage statistics can be.
  48
  49 \section{Bridge usage statistics}
  50
  51 Bridges report aggregate usage statistics on the number of connecting
  52 clients.
  53 Bridges gather these statistics by memorizing unique IP addresses of
  54 connecting clients over 24 hour periods and resolving IP addresses to
  55 country codes using an internal GeoIP database.
  56 Archives of these statistics are available for analysis from the metrics
  57 website.\footnote{\url{https://metrics.torproject.org/data.html#bridgedesc}}
  58 Figure~\ref{fig:bridgeextrainfo} shows an example of bridge usage
  59 statistics.
  60 This bridge observed 41 to 48 connecting clients from Saudi Arabia
  61 (all numbers are rounded up to the next multiple of 8), 33 to 40
  62 connecting clients from the U.S.A., 25 to 32 from Germany, 25 to 32 from
  63 Iran, and so on.
  64 These connecting clients were observed in the 24~hours (86,400 seconds)
  65 before December 27, 2010, 14:56:29 UTC.
  66
  67 \begin{figure}[h]
  68 \begin{quote}
  69 \begin{verbatim}
  70 extra-info Unnamed A5FA7F38B02A415E72FE614C64A1E5A92BA99BBD
  71 published 2010-12-27 18:55:01
  72 [...]
  73 bridge-stats-end 2010-12-27 14:56:29 (86400 s)
  74 bridge-ips sa=48,us=40,de=32,ir=32,[...]
  75 \end{verbatim}
  76 \end{quote}
  77 \caption{Example of aggregate bridge usage statistics}
  78 \label{fig:bridgeextrainfo}
  79 \end{figure}
  80
  81 An obvious limitation of these bridge usage statistics is that we can only
  82 learn about connecting clients from bridges with at least 24 hours uptime.
  83 It's still unclear how many bridge users are not included in the
  84 statistics because of this, which is left for a different analysis.
  85
  86 We further decided to exclude bridges running Tor versions 0.2.2.3-alpha
  87 or earlier.
  88 These bridges report similar statistics as the later Tor versions that
  89 we're considering here, but do not enforce a measurement interval of
  90 exactly 24 hours which would have slightly complicated the analysis.
  91 We don't expect the bridge version to have an influence on bridge usage
  92 or on the likelihood of the bridge to be blocked in a given country.
  93
  94 \section{Case study: China in the first half of 2010}
  95
  96 The major limitation of this analysis is that we don't have the data
  97 confirming that a bridge was actually blocked.
  98 We may decide on a case-by-case basis whether a blocking is a plausible
  99 explanation for the change in observed users from a given country.
 100 Anything more objective requires additional data, e.g., data obtained from
 101 active reachability scans.
 102
 103 We decided to investigate bridge usage from China in the first half of
 104 2010 as a case study.
 105 Figure~\ref{fig:bridge-users} shows estimated daily bridge users from China
 106 since July 2009.
 107 The huge slope in September and October 2009 is very likely a result from
 108 China blocking direct access to the Tor network.
 109 It seems plausible that the drops in March and May 2010 result from
 110 attempts to block access to bridges, too.
 111 We're going to focus only on the interval from January to June 2010 which
 112 promises the most interesting results.
 113 We should be able to detect these blockings in the reported statistics of
 114 single bridges.
 115 Obviously, it may be hard or impossible to transfer the findings from this
 116 case study to other countries or situations.
 117
 118 \begin{figure}
 119 \includegraphics[width=\textwidth]{bridge-users.png}
 120 \caption{Estimated daily bridge users from China}
 121 \label{fig:bridge-users}
 122 \end{figure}
 123
 124 \paragraph{Definition of bridge blocking}
 125
 126 We have a few options to define when we consider a bridge to be blocked
 127 from a given country on a given day.
 128
 129 \begin{itemize}
 130 \item \textbf{Absolute threshold:}
 131 The absolute number of connecting clients from a country falls below a
 132 fixed threshold.
 133 \item \textbf{Relative threshold compared to other countries:}
 134 The fraction of connecting clients from a country drops below a fixed
 135 percent value.
 136 \item \textbf{Estimated interval based on history:}
 137 The absolute or relative number of connecting clients falls outside an
 138 estimated interval based on the recent history.
 139 \end{itemize}
 140
 141 For this case study we decided to stick with the simplest solution being
 142 an absolute threshold.
 143 We define a somewhat arbitrary threshold of 32 users to decide whether a
 144 bridge is potentially blocked.
 145 A blocked bridge does not necessarily report zero users per day.
 146 A likely explanation for reporting users from a country that blocks a
 147 bridge is that our GeoIP is not 100~\% accurate and reports a few users
 148 which in fact come from other countries.
 149
 150 The reason against using a relative threshold was that it depends on
 151 development in other countries.
 152 As we can see in the example of China, bridge usage can depend on the
 153 abilty to directly access the Tor network.
 154 A sudden increase in country $A$ could significantly lower the relative
 155 usage in country $B$.
 156 We should probably consider both absolute and relative thresholds in
 157 future investigations.
 158 Maybe we also need to take direct usage numbers into account.
 159
 160 We also didn't build our analysis upon an estimated interval based on the
 161 recent history, because it's unclear how fast a bridge will be blocked
 162 after being set up.
 163 If it only takes the censor a few hours, the bridge may never see much use
 164 from a country at all.
 165 An estimate based on the bridge's history may not detect the censorship at
 166 all, because it may look like a bridge with only few users from that
 167 country.
 168
 169 We plan to reconsider other options for deciding that a bridge is blocked
 170 once we have data confirming this.
 171
 172 \paragraph{Visualization of bridge blockings}
 173
 174 Figure~\ref{fig:bridge-blockings} shows a subset of the raw bridge usage
 175 statistics for clients connecting from China in the first half of 2010.
 176 Possible blocking events are those when the bridge reports 32 or fewer
 177 connecting clients per day.
 178 These events are marked with red dots.
 179
 180 We decided to only include bridges in the figure that report at least
 181 100~Chinese clients on at least one day in the whole interval.
 182 Bridges with fewer users than that have a usage pattern that makes it much
 183 more difficult to detect blockings at all.
 184 The figure also shows only bridges reporting statistics on at least 30
 185 days in the measurement interval.
 186
 187 \begin{figure}[t]
 188 \includegraphics[width=\textwidth]{bridge-blockings.png}
 189 \caption{Subset of bridge usage statistics for Chinese clients in the
 190 first half of 2010}
 191 \label{fig:bridge-blockings}
 192 \end{figure}
 193
 194 The single bridge usage plots indicate how difficult it is to detect
 195 blockings only from usage statistics.
 196 About 10 of the displayed 27 plots have a pattern similar to the expected
 197 pattern from Figure~\ref{fig:bridge-users}.
 198 The best examples are probably bridges \verb+C037+ and \verb+D795+.
 199 Interestingly, bridge \verb+A5FA+ was unaffected by the blocking in March
 200 2010, but affected by the blocking in May 2010.
 201
 202 \paragraph{Aggregating blocking events}
 203
 204 As the last step of this case study we want to compare observed bridge
 205 users to the number of blocked bridges as detected by our simple threshold
 206 approach.
 207 We would expect most of our bridges to exhibit blockings in March 2010 and
 208 from May 2010 on.
 209 Figure~\ref{fig:bridge-users-blockings} plots users and blocked bridges
 210 over time.
 211 The two plots indicate that our detection algorithm is at least not
 212 totally off.
 213
 214 \begin{figure}[t]
 215 \includegraphics[width=\textwidth]{bridge-users-blockings.png}
 216 \caption{Estimated users and assumed bridge blockings in China in the
 217 first half of 2010}
 218 \label{fig:bridge-users-blockings}
 219 \end{figure}
 220
 221 \section{Conclusion}
 222
 223 Passively collected bridge usage statistics seem to be a useful tool to
 224 detect whether a bridge is blocked from a country.
 225 However, the main conclusion from this analysis is that we're lacking the
 226 data to conduct it usefully.
 227 One way to obtain the data we need are active scans.
 228 When conducting such scans, passively collected statistics may help reduce
 229 the total number and frequency of scans.
 230 For example, when selecting a bridge to scan, the reciprocal of the last
 231 reported number of connecting clients could be used as a probability
 232 weight.
 233 Once we have better data confirming bridge blocking we shall revisit the
 234 criteria for deriving the blocking from usage statistics.
 235
 236 \end{document}
 237