Remove unused members of the Router class (#6232).
[tor-metrics-tasks/delber.git] / task-4030 / blocking.tex
blobbb2ad90f0bb92d5bcf56afe27c4aa3bac2664d16
1 \documentclass{article}
2 \usepackage[pdftex]{graphicx}
3 \usepackage{graphics}
4 \usepackage{color}
5 \usepackage{url}
7 \begin{document}
9 \author{Karsten Loesing\\{\tt karsten@torproject.org}}
10 \title{Case study:\\Learning whether a Tor bridge is blocked\\by looking
11 at its aggregate usage statistics\\-- Part one --}
12 \maketitle
14 \section{Introduction}
16 Tor bridges\footnote{\url{https://www.torproject.org/docs/bridges}} are
17 relays that are not listed in the main directory.
18 Clients which cannot access the Tor network directly can try to learn a
19 few bridge addresses and use these bridges to connect to the Tor network.
20 Bridges have been introduced to impede censoring the Tor network, but in
21 the past we experienced successful blocking of bridges in a few countries.
23 In this report we investigate whether we can learn that a bridge is
24 blocked in a given country only by looking at its reported aggregate
25 statistics on usage by country.
26 By knowing that a bridge is blocked, we can, for example, avoid giving
27 out its address to users from that country.
29 Learning whether a bridge is blocked is somewhat related to our recent
30 efforts to detect censorship of direct access to the Tor
31 network.\footnote{\url{https://metrics.torproject.org/papers/detector-2011-09-09.pdf}}
32 The main difference is that we want to know which bridges are blocked and
33 which are not, whereas we don't care which relays are accessible in the
34 case of blocked direct access.
35 It's easy to block all relays, but it should be difficult to block all
36 bridges.
38 This report can only be seen as a first step towards researching bridge
39 blocking.
40 Even if a bridge reports that it had zero users from a country, we're
41 lacking the confirmation that the bridge was really blocked.
42 There can be other reasons for low user numbers which may be completely
43 unrelated.
44 The results of this analysis should be considered when actively scanning
45 bridge reachability from inside a country, both to decide how frequently a
46 bridge should be scanned and to evaluate how reliable an analysis of
47 passive usage statistics can be.
49 \section{Bridge usage statistics}
51 Bridges report aggregate usage statistics on the number of connecting
52 clients.
53 Bridges gather these statistics by memorizing unique IP addresses of
54 connecting clients over 24 hour periods and resolving IP addresses to
55 country codes using an internal GeoIP database.
56 Archives of these statistics are available for analysis from the metrics
57 website.\footnote{\url{https://metrics.torproject.org/data.html#bridgedesc}}
58 Figure~\ref{fig:bridgeextrainfo} shows an example of bridge usage
59 statistics.
60 This bridge observed 41 to 48 connecting clients from Saudi Arabia
61 (all numbers are rounded up to the next multiple of 8), 33 to 40
62 connecting clients from the U.S.A., 25 to 32 from Germany, 25 to 32 from
63 Iran, and so on.
64 These connecting clients were observed in the 24~hours (86,400 seconds)
65 before December 27, 2010, 14:56:29 UTC.
67 \begin{figure}[h]
68 \begin{quote}
69 \begin{verbatim}
70 extra-info Unnamed A5FA7F38B02A415E72FE614C64A1E5A92BA99BBD
71 published 2010-12-27 18:55:01
72 [...]
73 bridge-stats-end 2010-12-27 14:56:29 (86400 s)
74 bridge-ips sa=48,us=40,de=32,ir=32,[...]
75 \end{verbatim}
76 \end{quote}
77 \caption{Example of aggregate bridge usage statistics}
78 \label{fig:bridgeextrainfo}
79 \end{figure}
81 An obvious limitation of these bridge usage statistics is that we can only
82 learn about connecting clients from bridges with at least 24 hours uptime.
83 It's still unclear how many bridge users are not included in the
84 statistics because of this, which is left for a different analysis.
86 We further decided to exclude bridges running Tor versions 0.2.2.3-alpha
87 or earlier.
88 These bridges report similar statistics as the later Tor versions that
89 we're considering here, but do not enforce a measurement interval of
90 exactly 24 hours which would have slightly complicated the analysis.
91 We don't expect the bridge version to have an influence on bridge usage
92 or on the likelihood of the bridge to be blocked in a given country.
94 \section{Case study: China in the first half of 2010}
96 The major limitation of this analysis is that we don't have the data
97 confirming that a bridge was actually blocked.
98 We may decide on a case-by-case basis whether a blocking is a plausible
99 explanation for the change in observed users from a given country.
100 Anything more objective requires additional data, e.g., data obtained from
101 active reachability scans.
103 We decided to investigate bridge usage from China in the first half of
104 2010 as a case study.
105 Figure~\ref{fig:bridge-users} shows estimated daily bridge users from China
106 since July 2009.
107 The huge slope in September and October 2009 is very likely a result from
108 China blocking direct access to the Tor network.
109 It seems plausible that the drops in March and May 2010 result from
110 attempts to block access to bridges, too.
111 We're going to focus only on the interval from January to June 2010 which
112 promises the most interesting results.
113 We should be able to detect these blockings in the reported statistics of
114 single bridges.
115 Obviously, it may be hard or impossible to transfer the findings from this
116 case study to other countries or situations.
118 \begin{figure}
119 \includegraphics[width=\textwidth]{bridge-users.png}
120 \caption{Estimated daily bridge users from China}
121 \label{fig:bridge-users}
122 \end{figure}
124 \paragraph{Definition of bridge blocking}
126 We have a few options to define when we consider a bridge to be blocked
127 from a given country on a given day.
129 \begin{itemize}
130 \item \textbf{Absolute threshold:}
131 The absolute number of connecting clients from a country falls below a
132 fixed threshold.
133 \item \textbf{Relative threshold compared to other countries:}
134 The fraction of connecting clients from a country drops below a fixed
135 percent value.
136 \item \textbf{Estimated interval based on history:}
137 The absolute or relative number of connecting clients falls outside an
138 estimated interval based on the recent history.
139 \end{itemize}
141 For this case study we decided to stick with the simplest solution being
142 an absolute threshold.
143 We define a somewhat arbitrary threshold of 32 users to decide whether a
144 bridge is potentially blocked.
145 A blocked bridge does not necessarily report zero users per day.
146 A likely explanation for reporting users from a country that blocks a
147 bridge is that our GeoIP is not 100~\% accurate and reports a few users
148 which in fact come from other countries.
150 The reason against using a relative threshold was that it depends on
151 development in other countries.
152 As we can see in the example of China, bridge usage can depend on the
153 abilty to directly access the Tor network.
154 A sudden increase in country $A$ could significantly lower the relative
155 usage in country $B$.
156 We should probably consider both absolute and relative thresholds in
157 future investigations.
158 Maybe we also need to take direct usage numbers into account.
160 We also didn't build our analysis upon an estimated interval based on the
161 recent history, because it's unclear how fast a bridge will be blocked
162 after being set up.
163 If it only takes the censor a few hours, the bridge may never see much use
164 from a country at all.
165 An estimate based on the bridge's history may not detect the censorship at
166 all, because it may look like a bridge with only few users from that
167 country.
169 We plan to reconsider other options for deciding that a bridge is blocked
170 once we have data confirming this.
172 \paragraph{Visualization of bridge blockings}
174 Figure~\ref{fig:bridge-blockings} shows a subset of the raw bridge usage
175 statistics for clients connecting from China in the first half of 2010.
176 Possible blocking events are those when the bridge reports 32 or fewer
177 connecting clients per day.
178 These events are marked with red dots.
180 We decided to only include bridges in the figure that report at least
181 100~Chinese clients on at least one day in the whole interval.
182 Bridges with fewer users than that have a usage pattern that makes it much
183 more difficult to detect blockings at all.
184 The figure also shows only bridges reporting statistics on at least 30
185 days in the measurement interval.
187 \begin{figure}[t]
188 \includegraphics[width=\textwidth]{bridge-blockings.png}
189 \caption{Subset of bridge usage statistics for Chinese clients in the
190 first half of 2010}
191 \label{fig:bridge-blockings}
192 \end{figure}
194 The single bridge usage plots indicate how difficult it is to detect
195 blockings only from usage statistics.
196 About 10 of the displayed 27 plots have a pattern similar to the expected
197 pattern from Figure~\ref{fig:bridge-users}.
198 The best examples are probably bridges \verb+C037+ and \verb+D795+.
199 Interestingly, bridge \verb+A5FA+ was unaffected by the blocking in March
200 2010, but affected by the blocking in May 2010.
202 \paragraph{Aggregating blocking events}
204 As the last step of this case study we want to compare observed bridge
205 users to the number of blocked bridges as detected by our simple threshold
206 approach.
207 We would expect most of our bridges to exhibit blockings in March 2010 and
208 from May 2010 on.
209 Figure~\ref{fig:bridge-users-blockings} plots users and blocked bridges
210 over time.
211 The two plots indicate that our detection algorithm is at least not
212 totally off.
214 \begin{figure}[t]
215 \includegraphics[width=\textwidth]{bridge-users-blockings.png}
216 \caption{Estimated users and assumed bridge blockings in China in the
217 first half of 2010}
218 \label{fig:bridge-users-blockings}
219 \end{figure}
221 \section{Conclusion}
223 Passively collected bridge usage statistics seem to be a useful tool to
224 detect whether a bridge is blocked from a country.
225 However, the main conclusion from this analysis is that we're lacking the
226 data to conduct it usefully.
227 One way to obtain the data we need are active scans.
228 When conducting such scans, passively collected statistics may help reduce
229 the total number and frequency of scans.
230 For example, when selecting a bridge to scan, the reciprocal of the last
231 reported number of connecting clients could be used as a probability
232 weight.
233 Once we have better data confirming bridge blocking we shall revisit the
234 criteria for deriving the blocking from usage statistics.
236 \end{document}