Introducing BRIC (Bunch of Redundant Independent Clouds)

The Plan

Online storage providers are handing out free storage like candy.  Add them all up and soon you’re talking about a serious amount of space.  So let’s have some fun by turning ten different online storage providers into a single data grid which is secure, robust, and distributed.  We call this grid a BRIC (Bunch of Redundant Independent Clouds).

Why BRIC?

It’s pretty cheap to get storage.  For example, Google Drive offers 5GB for free and you can upgrade to 100GB for just $5/month. However you might still prefer the BRIC approach as online storage providers can:

By adopting an open-source BRIC you avoid putting all your eggs in one basket and have improved transparency to understand exactly what is happening to your data.

Introducing Tahoe-LAFS

The BRIC solution presented here will use the open source project Tahoe-LAFS to perform the RAID like function of striping data across different storage providers.  Here’s how Tahoe-LAFS describes itself:

Tahoe-LAFS is a Free and Open cloud storage system. It distributes your data across multiple servers. Even if some of the servers fail or are taken over by an attacker, the entire filesystem continues to function correctly, including preservation of your privacy and security.

More detail from these links:

We’ll create a private Tahoe grid where each node’s storage will be backed by an online storage provider.  Tahoe doesn’t directly support online storage providers as remote back-ends, but we can work around this problem by using sync folders, at the expense of local disk space.

Storage Providers

Here are the providers used. They each offer free plans with desktop apps or daemons which sync local folders. In total we have 92GB of free storage and we could easily obtain more by referring friends or using secondary email addresses.

  • Asus Web Storage – 5GB
  • Dropbox – 25GB
  • Google Drive – 5GB
  • Jottacloud – 5GB
  • Skydrive – 25GB
  • Spideroak – 2GB
  • Sugarsync – 5GB
  • Symform – 10GB
  • Ubuntu One – 5GB
  • Wuala – 5GB

Before setting up Tahoe, launch the desktop apps and set up the sync folders.  For this project, it’s a good idea to put them all under the same folder such as ~/syncfolders to make file management simpler.

Patching Tahoe

Originally this solution was only supposed to run on Linux but some providers only offer sync software for OS X and Windows. So we’ll be adventurous and install Tahoe on two computers on the same internal network, one computer running OS X and the other Ubuntu.

To get started, download the latest version of Tahoe (1.9.2 at time of writing).  Before building we’ll apply some patches.  The patches do two things.

  1. Add a maximum_capacity configuration option to set the storage capacity of each node. Without the patch, nodes will keep storing data until the local hard drive is full. Instead, we want to restrict each node to storing only the amount of data offered by its associated online storage.
  2. Compute the available space by subtracting the actual number of bytes used by the storage folder from the maximum capacity.  Without the patch, the node will simply compute the free space of the local hard drive which is not the behaviour we want.

I’m not a Python programmer and have never looked at the Tahoe-LAFS source code before, so the changes are most likely sub-optimal, but do seem to work fine.

Patch for allmydata-tahoe-1.9.2/src/allmydata/client.py

--- client.py.orig      2012-10-21 19:47:56.000000000 -0700
+++ client.py   2012-10-22 12:35:12.000000000 -0700
@@ -220,6 +220,19 @@
% data)
if reserved is None:
reserved = 0
+
+       data = self.get_config("storage", "maximum_capacity", None)
+       capacity = None
+       try:
+               capacity = parse_abbreviated_size(data)
+        except ValueError:
+               log.msg("[storage]maximum_capacity= contains unparseable value %s" % data)
+       if capacity is None:
+               capacity = 0
+
+
+
+
discard = self.get_config("storage", "debug_discard", False,
boolean=True)

@@ -247,6 +260,7 @@

ss = StorageServer(storedir, self.nodeid,
reserved_space=reserved,
+                           maximum_capacity=capacity,
discard_storage=discard,
readonly_storage=readonly,
stats_provider=self.stats_provider,

Patch for allmydata-tahoe-1.9.2/src/allmydata/storage/server.py

--- server.py.orig      2012-10-21 19:22:14.000000000 -0700
+++ server.py   2012-10-22 20:56:16.000000000 -0700
@@ -38,7 +38,7 @@
name = 'storage'
LeaseCheckerClass = LeaseCheckingCrawler

-    def __init__(self, storedir, nodeid, reserved_space=0,
+    def __init__(self, storedir, nodeid, reserved_space=0, maximum_capacity=0,
discard_storage=False, readonly_storage=False,
stats_provider=None,
expiration_enabled=False,
@@ -58,6 +58,7 @@
self.corruption_advisory_dir = os.path.join(storedir,
"corruption-advisories")
self.reserved_space = int(reserved_space)
+        self.maximum_capacity = int(maximum_capacity)
self.no_storage = discard_storage
self.readonly_storage = readonly_storage
self.stats_provider = stats_provider
@@ -167,6 +168,8 @@
# remember: RIStatsProvider requires that our return dict
# contains numeric values.
stats = { 'storage_server.allocated': self.allocated_size(), }
+        stats['storage_server.BRIC_available_space'] = self.get_available_space()
+        stats['storage_server.BRIC_maximum_capacity'] = self.maximum_capacity
stats['storage_server.reserved_space'] = self.reserved_space
for category,ld in self.get_latencies().items():
for name,v in ld.items():
@@ -205,7 +208,16 @@

if self.readonly_storage:
return 0
-        return fileutil.get_available_space(self.sharedir, self.reserved_space)
+
+        # http://stackoverflow.com/questions/1392413/calculating-a-directory-size-using-python
+        total_size = 0
+        for dirpath, dirnames, filenames in os.walk(self.sharedir):
+            for f in filenames:
+                fp = os.path.join(dirpath, f)
+                total_size += os.path.getsize(fp)
+
+        return self.maximum_capacity - total_size
+#        return fileutil.get_available_space(self.sharedir, self.reserved_space)

def allocated_size(self):
space = 0

Setting up Tahoe

The next step is for you to read the Tahoe documentation. Setting up Tahoe can be tricky so it’s best to become familiar with the concepts and terminology before proceeding.  A good starting point is: https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/running.rst

1. Linux Configuration

On the Linux machine, create a project directory and then create an introducer and get the introducer’s URL from the file introducer.furl.  An introducer is the starting seed of our grid.

tahoe create-introducer introducer

Next, create client nodes for each of the Linux friendly storage providers: Ubuntu One, Dropbox, Wuala and Symform.

tahoe create-client node1
tahoe create-client node2
tahoe create-client node3
tahoe create-client node4

For each client node, modify the configuration file tahoe.cfg and set the introducer URL, nickname and the maximum_capacity to match that of the storage provider. Here’s an example for Ubuntu One.

[node]
nickname = node1-ubuntuone
web.port = tcp:3456:interface=127.0.0.1
[client]
introducer.furl = pb://p6dkdpfdvh3vpwf5yafir7tv4lwizsid@ubuntu.local:42136,127.0.0.1:42136/introducer
[storage]
enabled = true
maximum_capacity = 5G

Next, in each client node’s directory, create a symbolic link storage which links to the sync folder of a provider.

storage -> ~/syncfolders/wuala

You might prefer to use a designated folder in each provider’s sync folder.

storage -> ~/syncfolders/Google Drive/bricstuff

If you intend to share the online storage for non-BRIC purposes, this will impact the computation of available space as the Tahoe node won’t know how much space is taken up by non-Tahoe data. It’s best not to.

If you prefer, you can create symbolic links in the other direction e.g.

~/syncfolders/dropbox -> ~/projects/bric/tahoe/node2/storage

2. Launching Tahoe on Linux

To launch Tahoe you specify the directory of the created Tahoe node, or run tahoe from within that node’s directory.  Start the introducer node first, and then the other clients:

tahoe start ./introducer
tahoe start ./node1
tahoe start ./node2
tahoe start ./node3
tahoe start ./node4

To check your grid is up and running, connecting to a client node’s web interface via your browser: http://localhost:3456

3. Connecting the Mac

You repeat the same setup on the Mac for the Linux unfriendly storage providers, but you don’t need to create an introducer. Just create Tahoe client nodes as normal and configure then with the already obtained introducer URL.  If all your Tahoe nodes are up and running, you should see that you have one active Tahoe introducer and ten active Tahoe clients.

If you check your online storage, you should see the desktop apps have already started syncing Tahoe’s house-keeping files and folders:

Storing Data in Tahoe

Uploading a single file is easy via the web interface of a Tahoe node.  Tahoe will split up the file, encrypt, and store the chunks of data redundantly across the connected network.  You can confirm this by checking the sync folders and your online storage accounts.

You can also transfer folders and files in bulk by using Tahoe from the command line.  It’s strongly advised you consult the Tahoe documentation to avoid frustration!

Keep in mind that you have to keep track of any file or folder URLs (what Tahoe calls FILECAP or DIRCAP) you created otherwise you won’t be able to retrieve your data.  Creating aliases for frequently used URLs makes things easier.

Here is an example of what you might do.

cd node1
tahoe add-alias -d ./ home URI:DIR2:blahblahblah
tahoe cp -d ./ /tmp/test.txt home:
tahoe backup -d ./ ~/Documents home:docs
tahoe deep-check -d ./ home:
tahoe ls -d ./ home:

How is data stored across the BRIC?

Tahoe will store data using erasure coding rather than simple replication.  A 2002 paper “Erasure coding vs. Replication a quantitative comparison” [PDF oceanstore.cs.berkeley.edu/publications/papers/pdf/erasure_iptps.pdf] demonstrated why:

We show that systems employing erasure codes have mean time to failures many orders of magnitude higher than replicated systems with similar storage and bandwidth requirements. More importantly, erasure-resilient systems use an order of magnitude less bandwidth and storage to provide similar system durability as replicated systems.

The Windows Azure team received a 2012 USENIX Best Paper Award for detailing how they use erasure coding in their storage service.

Tahoe will use encoding parameters defined in the configuration file tahoe.cfg:

# What encoding parameters should this client use for uploads?
# default is 3,7,10
#shares.needed = 3
#shares.happy = 7
#shares.total = 10

What do these default values mean? As explained in the FAQ:

The default Tahoe-LAFS parameters are 3-of-10, so the data is spread over 10 different drives, and you can lose any 7 of them and still recover the entire data. This gives much better reliability than comparable RAID setups, at a cost of only 3.3 times the storage space that a single copy takes. It takes about 3.3 times the storage space, because it uses space on each server needs equal to 1/3 of the size of the data and there are 10 servers.

So from our 92 GB of total online storage space, we can expect to store about 28GB of data. You could store more data by tweaking the parameters, but at the expense of redundancy.

Retrieving Data

Getting and verifying data is done via the command line.  Here’s what you might do.

cd node1
tahoe add-alias -d ./ home URI:DIR2:blahblahblah
tahoe deep-check -d ./ home:
tahoe ls -d ./ home:
tahoe cp -d ./ home:test.txt ~/recover/test.txt
tahoe cp -d ./ --recursive home:docs ~/recover/Documents

If you’ve managed to store data, verify it’s been striped across the providers, and retrieve the data without any errors, you’ve successfully built a BRIC.  Congratulations!

Some Thoughts

With the help of Tahoe-LAFS we’ve turned a hodge-podge mix of free online storage into a secure, fault-tolerant, distributed store for data. It might be fair to say that a BRIC is truly greater than the sum of parts.

Obviously there’s a huge scope for improvement as the BRIC solution proposed here is not easy to set up, configure or use.  A few things to ponder are:

  • Duplicity supports a Tahoe-LAFS backend so you can use Duplicity for backups to the BRIC and avoid the Tahoe command line.
  • It might be possible to save local disk space and avoid the use of sync folders by making use of storage providers who support open protocols (only a few do and often only as a paid option). For example, upon login you could auto-mount Box.com via WebDAV, with the Tahoe storage symlinked to the mounted volume. Data is now stored directly on Box.com rather than local disk.
  • You could avoid having to patch the Tahoe source by linking storage to a virtual file system, with the size set to match the online storage provider. On OS X you could use disk images. However, the images have to exist somewhere so you don’t really save any local disk space with this approach.
  • Running multiple closed-source background apps that scan your hard drive is not ideal on your day-to-day computer. There’s a performance hit and a real security issue: ‘Google Drive opens backdoor to Google Accounts’. The current approach is probably best suited to a dedicated backup computer.
  • You’ll need decent upstream speed as you now have multiple sync apps competing for available bandwidth.
  • Tahoe-LAFS is complicated to use and configure correctly. There’s a lot to consider, such as  lease time of data and maintenance of the grid e.g. rebalancing data when storage providers change. More here: https://tahoe-lafs.org/~warner/pycon-tahoe.html
  • If rolling your own BRIC is too much effort, you might want to look at Symform and Bitcasa, two user friendly storage providers based upon a P2P / community storage model.  The creators of Tahoe-LAFS also have a commercial offering which supports a S3 backend and the code remains open-source.
  • Cloud providers won’t like being reduced to commodity storage by the BRIC approach, so one day they might explicitly forbid such usage in the terms of service.  Thankfully due to redundancy, even if a provider closed your account you shouldn’t suffer any real data loss and you would have plenty of time to add an alternate storage provider.
  • Some folk at Cornell wrote a paper on using a proxy to stripe data across multiple cloud storage providers and call their proxy a RACS (Redundant Array of Cloud Storage). I’ve never seen an array of clouds in the sky, but I have seen a bunch of them, so I prefer the term BRIC :-)

If you’ve created your own BRIC (or similar) let’s hear about it!

HackerNews: http://news.ycombinator.com/item?id=4689238

About these ads
This entry was posted in cloud, linux, mac and tagged , , , , , , , , , , . Bookmark the permalink.

19 Responses to Introducing BRIC (Bunch of Redundant Independent Clouds)

  1. In tahoe community, I’ve seen such concept called RAIC – Redundant Array of Independent Clouds (obvious RAID analogy) – https://tahoe-lafs.org/trac/tahoe-lafs/browser/git/docs/specifications/backends/raic.rst

    There are plans for tahoe to support clouds through API-specific drivers (related tickets: http://is.gd/eFea2c, “cloud-backend” branch).

    Implementing cloud backend on top of that is actually as easy as making a class that supports get/put methods (interface: http://is.gd/GpnIGr, and it’ll be even simplier upon release). I have a working SkyDrive driver in a github branch (http://is.gd/BQC3IL).

    That said, async synchronization of some local path also works nicely with tahoe, though one naturally might be concerned about proprietary clients and various limitations (incl. the fact that they’re usually made for desktops) they impose.

    Synchronous translation of these APIs to VFS doesn’t seem to work well in practice with tahoe-specific workload (issues a lot of cacheable share-discovery requests), at least some local cache boosts performance (and avoids requests’ rate-limiting) greatly.
    Hybrid approaches (like davfs2 for webdav, maintains on-disk cache) seem to work well, but not all-remote, of course.

    Also, another free (as in beer) cloud that offers webdav (and 8-20 G) that you didn’t seem to mention is disk.ya.ru (russian google).

    • bitcartel says:

      Thanks for the info about development of back-ends, didn’t know it was happening.

      Good point about davfs2 caching. Looks like there might be a way to encourage flushing. https://savannah.nongnu.org/support/index.php?107933#comment5

      With 8GB from disk.ya.ru we could hit the magic 100GB mark. There’s a huge list of providers on Wikipedia, but to be honest, it was getting tedious to register and setup each provider.
      Interestingly though, with multiple email addresses, you could really max out with the WebDAV providers, but it wouldn’t work with proprietary sync apps which only allow one login at a time. Of course, you could always run one virtual machine for each email address!

      • While tempting, I think such “maxing out” might be more trouble than it’s worth in the end – it should be easily detectable and all such accounts can get banned at once, turning huge amount of space you’ve acquired into a huge point of failure.

        I tend to think that if one requires a vast amounts of space, better solution would be to pay some $10/year to each provider to get 2x-4x net space gain, based on assumption that amount of work to maintain the scam and mitigate risks involved should cost more.

  2. Thanks for writing this up! The core Tahoe-LAFS developers are working on the same concept under the name RAIC (Redundant Array of Independent Clouds). This is going to take some time to finish and review, so we strongly encourage experiments like yours into this idea.

    Our design uses the cloud services’ HTTP APIs, which solves some of the limitations of the approach based on syncing a local directory:
    * it does not duplicate shares on the local disk and on the cloud storage.
    * it has better reliability (under the assumption that we only want to rely on the reliability of the cloud storage for a given share) because a write is only confirmed to the client once the cloud service has confirmed that the share has been stored or updated.
    * it’s not dependent on third-party, closed-source sync apps.

    On the other hand, it requires more work to support a given storage service, even though much of the code is common. We’re also currently focussing more on storage services that are designed for programmatic use (Amazon S3, Google Cloud Storage, etc.) than consumer-oriented services that provide free storage. This isn’t a technical limitation; Google Drive provides an HTTP API (http://blog.programmableweb.com/2012/04/30/google-drive-api-its-kind-of-a-big-deal/), and other services may do so in future, or their current web interfaces might be sufficient in some cases.

    In addition, we’re using a database local to the storage server to store lease and accounting information. Among other benefits, that will allow efficiently calculating used space without needing to walk over all share files, as your patch above needs to.

    Least Authority Enterprises (https://leastauthority.com) is already providing storage on S3 using some of the same code that will be used for RAIC, although the current product does not yet support redundancy across clouds.

    Preliminary design documentation for the cloud backend and leasedb are here:

    * https://github.com/davidsarah/tahoe-lafs/blob/666-accounting/docs/specifications/backends/raic.rst

    * https://github.com/davidsarah/tahoe-lafs/blob/666-accounting/docs/specifications/leasedb.rst

    These changes won’t be in the next release of Tahoe-LAFS (v1.10), but might start being merged quite soon after that release.

  3. Pingback: Develop in the Cloud - Keith Dawson - Friday Four for October 26

  4. Pingback: Cloud national | Du fixe et du flux

  5. Samantha says:

    This web site certainly has all of the info I needed concerning this
    subject and didn’t know who to ask.

  6. Brit says:

    As a user of TahoeLAFS this is exactly what I searched for – a possibility to limit the size seperately for each node. But the patches are driving me crazy. What am I supposed to do with them? First I thought I had to use the unix command shell “patch” – but it just says “patch: **** malformed patch” – but i so want this patch to be working :-( Any help?

  7. Giuseppe says:

    well bob clearly you don’t work out extremely much but no-explode performs wonders for me it gets me amped up for my exercise thats the caffeine and it provides me a good pump in the course of my workout thats the NO maybe you should try working out harder maybe thats why it doesn’t operate

  8. mobile apps says:

    whoah this blog is fantastic i really like studying your articles.
    Keep up the great work! You realize, many individuals are looking around for
    this information, you could help them greatly.

  9. poetry page says:

    Really no matter if someone doesn’t understand afterward its up to other viewers that they will assist, so here it occurs.

  10. amontero says:

    Hi bitcartel.
    Thanks for this awesome post. Searching the Tahoe Trac tickets I’ve found a “sizelimit” configuration option ticket that may help you with limiting space usage in each node:
    https://tahoe-lafs.org/trac/tahoe-lafs/ticket/671

  11. Pingback: Your storage in the Cloud | @hansdeleenheer

  12. I’m gone to tell my little brother, that he should also visit this weblog on regular basis to obtain updated
    from newest news update.

  13. This website was… how do you say it? Relevant!!
    Finally I have found something which helped me. Many thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s