1 # -*- coding: utf-8 ; mode: org -*-
3 Filename: XXX-bridgedb-database-improvements.txt
4 Title: "Scalability and Stability Improvements to BridgeDB: Switching to a
5 Distributed Database System and RDBMS"
6 Author: Isis Agora Lovecruft
8 Related Proposals: XXX-social-bridge-distribution.txt
13 BridgeDB is Tor's Bridge Distribution system, which currently has two major
14 Bridge Distribution mechanisms: the HTTPS Distributor and an Email
17 BridgeDB is written largely in Twisted Python, and uses Python2's builtin
18 sqlite3 as its database backend. Unfortunately, this backend system is
19 already showing strain through increased times for queries, and sqlite's
20 memory usage is not up-to-par with modern, more efficient, NoSQL databases.
22 In order to better facilitate the implementation of newer, more complex
23 Bridge Distribution mechanisms, several improvements should be made to the
24 underlying database system of BridgeDB. Additionally, I propose that a
25 clear distinction in terms, as well as a modularisation of the codebase, be
26 drawn between the mechanisms for Bridge Distribution versus the backend
27 Bridge Database (BridgeDB) storage system.
29 This proposal covers the design and implementation of a scalable NoSQL ―
30 Document-Based and Key-Value Relational ― database backend for storing data
31 on Tor Bridge relays, in an efficient manner that is ammenable to
32 interfacing with the Twisted Python asynchronous networking code of current
33 and future Bridge Distribution mechanisms.
37 BridgeDistributor := A program which decides when and how to hand out
38 information on a Tor Bridge relay, and to whom.
40 BridgeDB := The backend system of databases and object-relational mapping
41 servers, which interfaces with the BridgeDistributor in order
42 to hand out bridges to clients, and to obtain and process new,
43 incoming ``@type bridge-server-descriptors``,
44 ``@type bridge-networkstatus`` documents, and
45 ``@type bridge-extrainfo`` descriptors. [3]
47 BridgeFinder := A client-side program for an Onion Proxy (OP) which handles
48 interfacing with a BridgeDistributor in order to obtain new
49 Bridge relays for a client. A BridgeFinder also interfaces
50 with a local Tor Controller (such as TorButton or ARM) to
51 handle automatic, transparent Bridge configuration (no more
52 copy+pasting into a torrc) without being given any
53 additional privileges over the Tor process, [1] and relies
54 on the Tor Controller to interface with the user for
55 control input and displaying up-to-date information
56 regarding available Bridges, Pluggable Transport methods,
57 and potentially Invite Tickets and Credits (a cryptographic
58 currency without fiat value which is generated
59 automatically by clients whose Bridges remain largely
60 uncensored, and is used to purchase new Bridges), should a
61 Social Bridge Distributor be implemented. [2]
64 ** III.A. Scalability Requirements
66 Databases SHOULD be implemented in a manner which is ammenable to using a
67 distributed storage system; this is necessary because many potential
68 datatypes required by future BridgeDistributors MUST be stored permanently.
69 For example, in the designs for the Social Bridge Distributor, the list of
70 hash digests of spent Credits, and the list of hash digests of redeemed
71 Invite Tickets MUST be stored forever to prevent either from being replayed
72 ― or double-spent ― by a malicious user who wishes to block bridges faster.
73 Designing the BridgeDB backend system such that additional nodes may be
74 added in the future will allow the system to freely scale in relation to
75 the storage requirements of future BridgeDistributors.
77 Additionally, requiring that the implementation allow for distributed
78 database backends promotes modularisation the components of BridgeDB, such
79 that BridgeDistributors can be separated from the backend storage system,
80 BridgeDB, as all queries will be issued through a simplified, common API,
81 regardless of the number of nodes system, or the design of future
84 *** 1. Distributed Database System
86 A distributed database system SHOULD be used for BridgeDB, in order to
87 scale resources as the number of Tor bridge users grows. This database
88 system, hereafter referred to as DDBS.
90 The DDBS MUST be capable of working within Twisted's asynchronous
91 framework. If possible, a Object-Relational Mapper (ORM) SHOULD be used to
92 abstract the database backend's structure and query syntax from the Twisted
93 Python classes which interact with it, so that the type of database may be
94 swapped out for another with less code refactoring.
96 The DDBM SHALL be used for persistent storage of complex data structures
97 such as the bridges, which MAY include additional information from both the
98 `@type bridge-server-descriptor`s and the `@type bridge-extra-info`
101 **** 1.a. Choice of DDBS
103 CouchDB is chosen for its simple HTTP API, ease of use, speed, and official
104 support for Twisted Python applications. [4] Additionally, its
105 document-based data model is very similar to the current archetecture of
106 tor's Directory Server/Mirror system, in that an HTTP API is used to
107 retrieve data stored within virtual directories. Internally, it uses JSON
108 to store data and JavaScript as its query language, both of which are
109 likely friendlier to various other components of the Tor Metrics
110 infrastructure which sanitise and analyse portions of the Bridge
111 descriptors. At the very least, friendlier than hardcoding raw SQL queries
114 **** 1.b. Data Structures which should be stored in a DDBS:
116 - RedactedDB - The Database of Blocked Bridges
118 The RedactedDB will hold entries of bridges which have been discovered to
119 be unreachable from BridgeDB network vantage point, or have been reported
120 unreachable by clients.
122 - BridgeDB - The Database of Bridges
124 BridgeDB holds information on available Bridges, obtained via bridge
125 descriptors and networkstatus documents from the BridgeAuthority. Because
126 a Bridge may have multiple `ORPort`s and multiple
127 `ServerTransportListenAddress`es, attaching additional data to each of
128 these addresses which MAY include the following information on a blocking
130 - Geolocational country code of the reported blocking event
131 - Timestamp for when the blocking event was first reported
132 - The method used for discovery of the block
133 - an the believed mechanism which is causing the block
134 would quickly become unwieldy, the RedactedDB and BridgeDB SHOULD be kept
139 For the Social BridgeDistributor, these are rather complex,
140 increasingly-growing, concatenations (or tuples) of several datatypes,
141 including Non-Interactive Proofs-of-Knowledge (NIPK) of Commitments to
142 k-TAA Blind Signatures, and NIPK of Commitments to a User's current
143 number of Credits and timestamps of requests for Invite Tickets.
145 *** 2. Key-Value Relational Database Mapping Server
147 For simpler data structures which must be persistently stored, such as the
148 list of hashes of previously seen Invite Tickets, or the list of
149 previously spent Tokens, a Relational Database Mapping Server (RDBMS)
150 SHALL be used for optimisation of queries.
152 Redis and Memcached are two examples of RDBMS which are well tested and
153 are known to work well with Twisted. The major difference between the two
154 is that Memcaches are stored only within volatile memory, while Redis
155 additionally supports commands for transferring objects into persistent,
158 There are several support modules for interfacing with both Memcached and
159 Redis from Twisted Python, see Twisted's MemCacheProtocol class [5] [6] or
160 txyam [7] for Memcached, and txredis [8] or txredisapi [9] for
161 Redis. Additionally, numerous big name projects both use Redis as part of
162 their backend systems, and also provide helpful documentation on their own
163 experience of the process of switching over to the new systems. [17] For
164 non-Twisted Python Redis APIs, there is redis-py, which provides a
165 connection pool that could likely be interfaced with from Twisted Python
166 without too much difficultly. [10] [11]
168 **** 2.a. Data Structures which should be stored in a RDBMS
170 Simple, mostly-flat datatypes, and data which must be frequently indexed
171 should be stored in a RDBMS, such as large lists of hashes, or arbitrary
172 strings with assigned point-values (i.e. the "Uniform Mapping" for the
173 current HTTPS BridgeDistributor).
175 For the Social BridgeDistributor, hash digests of the following datatypes
176 SHOULD be stored in the RDBMS, in order to prevent double-spending and
181 These are anonymous, unlinkable, unforgeable, and verifiable tokens
182 which are occasionally handed out to well-behaved Users by the Social
183 BridgeDistributor to permit new Users to be invited into the system.
184 When they are redeemed, the Social BridgeDistributor MUST store a hash
185 digest of their contents to prevent replayed Invite Tickets.
189 These are Credits which have already been redeemed for new Bridges.
190 The Social BridgeDistributor MUST also store a hash digest of Spent
191 Credits to prevent double-spending.
193 *** 3. Bloom Filters and Other Database Optimisations
195 In order to further decrease the need for lookups in the backend
196 databases, Bloom Filters can used to eliminate extraneous
197 queries. However, this optimization would only be beneficial for string
198 lookups, i.e. querying for a User's Credential, and SHOULD NOT be used for
199 queries within any of the hash lists, i.e. the list of hashes of
200 previously seen Invite Tickets. [14]
202 **** 3.a. Bloom Filters within Redis
204 It might be possible to use Redis' GETBIT and SETBIT commands to store a
205 Bloom Filter within a Redis cache system; [15] doing so would offload the
206 severe memory requirements of loading the Bloom Filter into memory in
207 Python when inserting new entries, reducing the time complexity from some
208 polynomial time complexity that is proportional to the integral of the
209 number of bridge users over the rate of change of bridge users over time,
210 to a time complexity of order O(1).
212 **** 3.b. Expiration of Stale Data
214 Some types of data SHOULD be safe to expire, such as User Credentials
215 which have not been updated within a certain timeframe. This idea should
216 be further explored to assess the safety and potential drawbacks to
219 If there is data which SHOULD expire, the PEXPIREAT command provided by
220 Redis for the key datatype would allow the RDBMS itself to handle cleanup
221 of stale data automatically. [16]
223 **** 4. Other potential uses of the improved Bridge database system
225 Redis provides mechanisms for evaluations to be made on data by calling
226 the sha1 for a serverside Lua script. [15] While not required in the
227 slightest, it is a rather neat feature, as it would allow Tor's Metrics
228 infrastructure to offload some of the computational overhead of gathering
229 data on Bridge usage to BridgeDB (as well as diminish the security
230 implications of storing Bridge descriptors).
232 Also, if Twisted's IProducer and IConsumer interfaces do not provide
233 needed interface functionality, or it is desired that other components of
234 the Tor software ecosystem be capable of scheduling jobs for BridgeDB,
235 there are well-tested mechanisms for using Redis as a message
236 queue/scheduling system. [16]
240 [0]: https://bridges.torproject.org
241 mailto:bridges@bridges.torproject.org
242 [1]: See proposals 199-bridgefinder-integration.txt at
243 https://gitweb.torproject.org/torspec.git/tree/proposals/199-bridgefinder-integration.txt
244 [2]: See XXX-social-bridge-distribution.txt at
245 https://gitweb.torproject.org/user/isis/bridgedb.git/tree/doc/proposals/XXX-bridgedb-social-distribution.txt?h=feature/7520-social-dist-design
246 [3]: https://metrics.torproject.org/formats.html#descriptortypes
247 [4]: https://github.com/couchbase/couchbase-python-client#twisted-api
248 [5]: https://twistedmatrix.com/documents/current/api/twisted.protocols.memcache.MemCacheProtocol.html
249 [6]: http://stackoverflow.com/a/5162203
250 [7]: http://findingscience.com/twisted/python/memcache/2012/06/09/txyam:-yet-another-memcached-twisted-client.html
251 [8]: https://pypi.python.org/pypi/txredis
252 [9]: https://github.com/fiorix/txredisapi
253 [10]: https://github.com/andymccurdy/redis-py/
254 [11]: http://degizmo.com/2010/03/22/getting-started-redis-and-python/
255 [12]: http://www.dr-josiah.com/2012/03/why-we-didnt-use-bloom-filter.html
256 [13]: http://redis.io/topics/data-types §"Strings"
257 [14]: http://redis.io/commands/pexpireat
258 [15]: http://redis.io/commands/evalsha
259 [16]: http://www.restmq.com/
260 [17]: https://www.mediawiki.org/wiki/Redis