<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>Project Voldemort Blog</title>
	<atom:link href="http://project-voldemort.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://project-voldemort.com/blog</link>
	<description>News about the project</description>
	<pubDate>Thu, 17 Dec 2009 03:17:34 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Voldemort 0.60 Released</title>
		<link>http://project-voldemort.com/blog/2009/12/voldemort-060-released/</link>
		<comments>http://project-voldemort.com/blog/2009/12/voldemort-060-released/#comments</comments>
		<pubDate>Thu, 17 Dec 2009 03:17:20 +0000</pubDate>
		<dc:creator>alex</dc:creator>
		
		<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=141</guid>
		<description><![CDATA[
In sync with our plan for regular monthly releases, we&#8217;re excited to announce release of version 0.60. Downloads are available, you may browse the updated Javadoc or view the release notes.


In addition to bug fixes, several important new features and enhancements have made it into this release:


Admin Client/Server API: intended for functionality which is required, [...]]]></description>
			<content:encoded><![CDATA[<p>
In sync with our <a href="http://project-voldemort.com/blog/2009/11/release-057/">plan</a> for regular monthly releases, we&#8217;re excited to announce release of version 0.60. <a href="http://github.com/voldemort/voldemort/downloads">Downloads</a> are available, you may browse the updated <a href="http://project-voldemort.com/javadoc/all/">Javadoc</a> or view the <a href="http://github.com/voldemort/voldemort/blob/release-060/release_notes.txt">release notes</a>.
</p>
<p>
In addition to bug fixes, several important new features and enhancements have made it into this release:</p>
<ul>
<li>
Admin Client/Server API: intended for functionality which is required, but should be used sparingly (if at all), at the application level. This adds support for retrieval and update of metadata on remote nodes as well as <em>streaming</em> of keys and key/value pairs from one node to another.
</li>
<li>
EC2 testing: a distributed system requires tests which involve multiple machines, contributed by Kirk True. Amazon&#8217;s EC2 web service allows us to provision and de-provision nodes <em>programatically</em>. The <a href="http://wiki.github.com/voldemort/voldemort/ec2-testing-infrastructure">EC2 Testing Infrastructure</a> allows for such tests to run on a regular basis along with other automated tests.
</li>
<li>
Support for large lists and strings in the JSON serializer. Previously, the binary JSON serialization format limited us to maximum size of 32,768 for strings and lists (i.e. the maximum value of a signed 16-bit integer). The maximum size of a list or string is now 1,073,741,823 bytes.
</li>
<li>
<em>Experimental</em> support for <a href="http://project-voldemort.com/javadoc/all/voldemort/store/views/package-summary.html">views</a>. Views allow for computation to be moved close to the data. Suppose, for example, that we&#8217;re storing a serialized list as a value in a key/value pair and would like to append a single element. Normally, we&#8217;d have to transfer the entire list to the client, append the element, then transfer the modified list to the client. Views would mean a &#8220;put&#8221; operation now becomes a proxy for append a value to the list <em>directly on the client</em>. Note, this is feature is <b>experimental</b>, we can&#8217;t make any guarantees about the stability or performance of this feature; in addition, the API is also subject to change.
</li>
<li>Support for LZF compression, contributed by Ismael Juma and Tatu Saloranta.</li>
<li><a href="http://en.wikipedia.org/wiki/Interpolation_search">Interpolation search</a> for read-only stores.</li>
<li>A <a href="http://github.com/voldemort/voldemort/tree/master/contrib/ruby-client/">client</a> for the Ruby programming language (using an experimental ruby protocol buffers gem), contributed by Claudio  Cherubino.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/12/voldemort-060-released/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Introducing AdminClient APIs</title>
		<link>http://project-voldemort.com/blog/2009/12/introducing-adminclient-apis/</link>
		<comments>http://project-voldemort.com/blog/2009/12/introducing-adminclient-apis/#comments</comments>
		<pubDate>Thu, 17 Dec 2009 02:58:46 +0000</pubDate>
		<dc:creator>alex</dc:creator>
		
		<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=129</guid>
		<description><![CDATA[AdminClient is intended for administrative functionality that is useful and often needed, but should not be used at the application level. The key functionality of the APIs is to provide extraction/loading of entries (or of keys only) in batches for offline data management and manipulation and to provide a way to get/set current state and [...]]]></description>
			<content:encoded><![CDATA[<p>AdminClient is intended for administrative functionality that is useful and often needed, but should not be used at the application level. The key functionality of the APIs is to provide extraction/loading of entries (or of keys only) in batches for offline data management and manipulation and to provide a way to get/set current state and metadata on individual nodes for cluster management. The AdminClient was initially designed to facilitate the rebalancing (aka dynamic cluster membership) feature, which is still in development.Some of the uses of AdminClient include</p>
<ul>
<li> Extraction of entries/keys for backups</li>
<li> Daily batch ETL (extraction, transformation, loading) to other systems for analysis (e.g. hadoop, search, etc&#8230;)</li>
<li> Bulk loading of entries</li>
<li> Migrating partitions</li>
<li> Getting/Updating the cluster state/metadata</li>
</ul>
<h2>AdminClient APIs</h2>
<p>These are the supported AdminClient APIs and their brief descriptions; please refer to the <a href="http://project-voldemort.com/javadoc/all/">javadocs</a> for a full reference. The AdminClient can be constructed given a voldemort server bootstrap URL or given a cluster object with information about nodes in the cluster.</p>
<ul>
<li><strong>fetchEntries()</strong> : Provides a way to fetch all entries (key/value pairs) from a remote node belonging to any of the specified partitions.<br />
The call returns an iterator instantaneously which internally keep fetching new entries as new values are requested (aka streaming mode).</li>
<li> <strong>fetchKeys()</strong> : Same as fetchEntries() but only returns the keys. The server side can be more intelligent and do optimization at storage level to for better performance than fetchEntries()</li>
<li> <strong>updateEntries()</strong>: Provides a way to bulk load entries at a remote node, takes an iterator as parameter and start updating remote node with iterator values as they are streamed.</li>
<li> <strong>migratePartitions()</strong> : Provides a way to migrate partitions from a remote node to another node by a third party. This API is used by the rebalancing system to start rebalancing operations at different nodes. The operation is started as an asynchronous operation at the updating remote node. A fetchEntries() request is started on the remote node, updating the values on that node. The status the of operation can be checked using the getAsyncRequestStatus() API.</li>
<li> <strong>(Get/Set)RemoteMetadata()</strong>: These two APIs provide a way to get/set node metadata (cluster.xml, stores.xml, states) at individual remote nodes for cluster monitoring, status checks and upgrades.</li>
<li><strong>getAsyncRequestStatus()</strong>: Gets the status for async operation at the (remote) node. This allows the progress of an async operation (e.g. migratePartitions) to be monitored.</li>
<li><strong>waitForCompletion()</strong>: use exponential backoff to await  the completion of an asynchronous request, simulating &#8220;blocking&#8221; behaviour.</li>
</ul>
<h2>Streaming support</h2>
<p>Streaming is the efficient bulk transfer of entire partitions between machines. The intended uses are extraction of all keys (or all key/value pairs) for off-line manipulation (e.g. MapReduce processing) or backups, migration of partitions between machines and deletion of entries. Experimental support also exists for server-side filtering using a client-side supplied Java bytecode implementing the <a href="http://project-voldemort.com/javadoc/all/voldemort/client/protocol/VoldemortFilter.html">VoldemortFilter</a> interface. The major application of the migration functionality is dynamic deletion and addition of nodes (&#8221;rebalancing&#8221;) which is presently in development.</p>
<p>For efficiency during streaming messages (containing key/value pairs) are sent one-after-anothere (over a buffered input stream), without waiting for acknowledgment (or forcing an early buffer flush); an &#8220;end-of-stream&#8221; message is sent signifying completion, after which is the buffer is flushed.</p>
<p>When the values are streamed from a store backed by the BerkeleyDB JE storage engine, the values are read from a cursor. At the present time, unfortunately, the cursors are only opened in key order. This presents an issue: when the entries aren&#8217;t present in BDB cache or the operating system page cache (this is guaranteed to occur when the dataset is larger than the machine&#8217;s physical memory), they have to be fetched from disk; if the cursor is opened in key order, unless the key order matches the disk order (i.e. ordered keys written sequentially) that means random seeks are made to access the values. Fortunately, BerkeleyDB, the operating system and the disk controller provide a way to schedule the seeks. In our experimental setup, on a machine with eight gigabytes of memory, a single 7200 RPM SATA disk  and <tt>bdb.cache.size</tt> set to <tt>3G</tt> were able to achieve the rate of ~1,200 entries a second: multiple times higher than would have been possible if random seeks were to be made for each entry. The distributed nature of Voldemort also means fetches of partitions can be performed from multiple nodes in parallel (e.g. fetch partition 1 from node a, partition 2 from b, etc&#8230;) distributing the random seeks across multiple nodes.</p>
<p>In addition, there is support for throttling the streaming operations. There are two settings <tt>stream.read.byte.per.sec</tt> and <tt>stream.write.byte.per.sec</tt>, responsible for throttling reads and writes respectively. The default settings for both are 10MB/second. This allows you to limit the amount of read and writes operations that are being done by the admin client API, as to prevent exhaustion of the seek capacity and starvation of the system of resources needed to handle routine requests.</p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/12/introducing-adminclient-apis/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Voldemort 0.57: Introducing the First Monthly Release</title>
		<link>http://project-voldemort.com/blog/2009/11/release-057/</link>
		<comments>http://project-voldemort.com/blog/2009/11/release-057/#comments</comments>
		<pubDate>Tue, 17 Nov 2009 01:26:21 +0000</pubDate>
		<dc:creator>alex</dc:creator>
		
		<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=122</guid>
		<description><![CDATA[We&#8217;re proud to announce the release of Project Voldemort version 0.57. This is our first monthly release: we&#8217;re moving to a monthly release cycle, where by regular releases will be made around the fifteenth of every month. Javadocs and tar/zip archives have been updated and may be downloaded from the usual location. You can view [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re proud to announce the release of Project Voldemort version 0.57. This is our first monthly release: we&#8217;re moving to a monthly release cycle, where by regular releases will be made around the fifteenth of every month. Javadocs and tar/zip archives have been updated and may be downloaded from the <a href="http://github.com/voldemort/voldemort/downloads">usual location</a>. You can view the release notes <a href="http://github.com/voldemort/voldemort/blob/release-057/release_notes.txt">here</a>.</p>
<p>Under the current release plan, patches (containing new features, enhancements to existing features and bug fixes) are accepted throughout on master branch; after thorough testing, these patches are merged. Automated testing runs whenever any commits are made, thus the master branch remains stable. Once a month, a &#8220;release candidate&#8221; is branched from the master.  No further patches are accepted into the release branch. After undergoing rigiorous testing (unit tests, as well as integration and performance tests passing), a release is made from this branch (i.e. versions are incremented). Release notes are kept, detailing changes brought on by this release (new features, issues/bugs fixed, new configuration options and especially any incompatibilities with previous releases). Any changes made on this release branch are merged back to the master.</p>
<p>As a young and agile open source project, we ask for and greatly appreciate community participation: we welcome all contribution, including patches and suggestions on how our release process could be tuned to better serve you.</p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/11/release-057/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Introducing the NIO SocketServer Implementation</title>
		<link>http://project-voldemort.com/blog/2009/08/introducing-the-nio-socketserver-implementation/</link>
		<comments>http://project-voldemort.com/blog/2009/08/introducing-the-nio-socketserver-implementation/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 18:54:53 +0000</pubDate>
		<dc:creator>Kirk True</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=99</guid>
		<description><![CDATA[Users of Voldemort have the option of using a binary protocol for efficient network communication between clients and nodes. This is implemented on the server side using an abstraction known as a SocketServer. Previously the only implementation of the voldemort.server.socket.SocketServer used the classic thread-per-socket blocking I/O approach to handling the network communication.
Recently my NIO implementation [...]]]></description>
			<content:encoded><![CDATA[<p>Users of Voldemort have the option of using a binary protocol for efficient network communication between clients and nodes. This is implemented on the server side using an abstraction known as a <em>SocketServer</em>. Previously the only implementation of the <code>voldemort.server.socket.SocketServer</code> used the classic thread-per-socket blocking I/O approach to handling the network communication.</p>
<p>Recently my NIO implementation of Project Voldemort’s socket server was merged into the <a href="http://github.com/voldemort/voldemort">master repository</a>. I wanted to share the results of some testing I&#8217;ve done comparing the existing blocking I/O (BIO) implementation of the socket server vs. my NIO implementation. Hopefully this will encourage you to give it a try and let us know how it&#8217;s working for your workload.</p>
<h2>Enabling the NIO Implementation</h2>
<p>Enabling the NIO implementation is simply a matter of updating the configuration in Voldemort&#8217;s <code>server.properties</code>:</p>
<blockquote><p><code>enable.nio.connector=true</code></p></blockquote>
<p>This will enable the NIO socket server implementation.</p>
<p>Internally, the NIO implementation uses N threads, each with its own <code>java.nio.channels.Selector</code> object, used for readiness selection. The number of threads (N) is dictated by another configuration parameter:</p>
<blockquote><p><code>nio.connector.selectors=2</code></p></blockquote>
<p><strong>Note</strong>: the <code>nio.connector.selectors</code> property is optional. If omitted, the number of threads used by the NIO socket server implementation defaults to the number of processors on the running system. Empirical testing found that a thread/<code>Selector</code> per processor worked most efficiently.</p>
<h2>Test Environment</h2>
<p>My testing environment is kept deliberately simple in order to encourage reproducibility. I&#8217;m running a single Voldemort instance on a server to which four client machines are pointing. I am simulating multiple clients by using multiple threads from each client.</p>
<p>In terms of the hardware involved, the server is an Intel Core 2 Duo 6700 (2.66 GHz) with 3.3 GB RAM. The client machines are Intel Core 2 Duo 4300s (1.80 GHz) with a full 4 GB of RAM each. The server and clients are separated by Gigabit Ethernet on a dedicated switch.</p>
<p>The server and client machines are running Fedora 10 64-bit. The JVMs are standardized to version 1.6.0_14. The code is version 0.52 from my git-forked &#8220;nio-server&#8221; branch. Besides the changes made to my branch, I have made two small uncommitted changes as detailed in the patch below[1]. Basically the uncommitted changes increase the heap size for both client and server to 3 GB and increase the client timeouts to 10 minutes(!).</p>
<h3>Client Setup</h3>
<p>To perform the tests I&#8217;m using the voldemort-remote-test.sh script on each of the four client machines, updated to use 3 GB heap as stated above. I updated the voldemort.performance.RemoteTest to set some timeouts very long (10 minutes). I simply averaged out the result from all four machines over 100 iterations of the test.</p>
<h3>Server Setup</h3>
<p>The server is set up to use either the BIO or NIO implementation depending on the value of “enable.nio.connector” in server.properties. The value of “storage.configs” was set to  “voldemort.store.noop.NoopStorageConfiguration,voldemort.store.bdb.BdbStorageConfiguration” as some tests used the no-op storage engine. When running the BIO implementation the value of “max.threads” was set as appropriate based on the total number of client threads. For the NIO implementation we run <strong>only two threads</strong> regardless of the fact that we&#8217;re serving tens of thousands of clients.</p>
<p>After each test the Voldemort server was shut down and restarted.</p>
<p>For both the client and the server I executed the command “ulimit -n 40000” prior to starting the JVM to run the tests in order to provide room for the socket file descriptors.</p>
<h2>Test Results</h2>
<p>So far I&#8217;ve run three tests comparing the BIO implementation to the NIO implementation. These all center around measuring the writes/second as reported by the voldemort-remote-test.sh script.</p>
<h3>Transactions/second using No-op Storage Engine</h3>
<p>The first test measured the transactions/second as the number of clients increased. The server is running the no-op storage engine to avoid skewing the results with BDB&#8217;s work (as is shown to be significant in the next test).</p>
<p>For this test, the invocation of voldemort-remote-test.sh provided these options:</p>
<blockquote><p>Number of iterations: 100<br />
Number of requests: 100,000<br />
Operations: write and delete<br />
Value size: 1024 bytes<br />
Threads: as shown in the x-axis; each client ran ¼ of the threads</p></blockquote>
<p>Here is a graph charting the results:</p>
<p><img src="http://www.mustardgrain.com/images/nio.vs.bio.v1/noop.tps.png" alt="Transactions/second" /></p>
<p>While running the BIO implementation simulating 6,000 clients, the client application started throwing lots of connection timeout errors upon starting up. After several minutes the clients would generally recover and complete the number of iterations. However, the reporting output from the client would show that the number of successful operations was often less than the expected. For example, the delete operation would report “99,984 things deleted” instead of 100,000. This connection timeout behavior explains the increase in speed shown by the BIO implementation. As a fraction of the clients hung (and eventually timed out), the remaining clients would process their work that much faster, producing a more favorable measurement. These connection timeout errors were not seen when running the NIO implementation until much later.</p>
<p>On the server, the load average (as measured by top) was about ½ the number of server threads, e.g. around 4,000 for 8,000 clients when running the BIO implementation. For NIO, it was always ~2. For 8,000 clients, the heap size went to 2 GB for BIO while NIO was about 700 MB. This is largely attributed to the per-thread stack size.</p>
<p>Individual iteration measurements for clients varied wildly with the BIO implementation. Unfortunately I didn&#8217;t measure this, but I did note several occasions where iterations would yield 15,000 writes/sec and the next iteration only 4,000 writes/sec. Deviation for the NIO implementation was observed to be minimal, however.</p>
<p>The data for the BIO implementation ends abruptly before recording the results for 10,000 connections since the clients hung until they were forcibly stopped (via kill -9). This was attributed to the inability of the BIO implementation to keep up due to the fact that it was pegged in garbage collection.</p>
<p>At 20,000 connections the NIO implementation did cause one client machine to timeout on connections, though it did recover. At 32,000 connections a noticeable drop in performance was noted and at 34,000 the clients hung until forcibly stopped. This too was attributed to garbage collection times.</p>
<h3>Transactions/second using BDB Storage Engine</h3>
<p>The next test also measured the transactions/second as the number of clients increased. This time, however, the server is running the BDB storage engine to see how the socket server implementation is affected.</p>
<p>For this test, the invocation of voldemort-remote-test.sh provided these options:</p>
<blockquote><p>Number of iterations: 100<br />
Number of requests: 100,000<br />
Operations: write and delete<br />
Value size: 1024 bytes<br />
Threads: as shown in the x-axis; each client ran ¼ of the threads</p></blockquote>
<p>Here is a graph charting the results:</p>
<p><img src="http://www.mustardgrain.com/images/nio.vs.bio.v1/bdb.tps.png" alt="Transactions/second" /></p>
<p>As with the above test, around 6,000 clients with the BIO implementation the clients started throwing lots of connection timeout errors when starting up. At 8,000 clients this connection timeout behavior occurred for all four clients and explains the increase in speed shown by the BIO implementation. These connection timeout errors were not seen when running the NIO implementation until around 30,000 clients.</p>
<p>The BDB storage engine greatly affects the transactions/second that can be processed. My guess is that because the BDB code (and its threads) are competing for CPU time against thousands of other threads in the BIO implementation that it begins to become starved in terms of making progress. The NIO implementation, however, using a fixed two threads allows BDB to accomplish more work.</p>
<p>So even at a small number of clients, we achieve an increase of ~50-100% more transactions/second with the NIO implementation over the BIO implementation.</p>
<h3>Transactions/second per Request Size</h3>
<p>The next test also measured the transactions/second, but this time as the request size increased. The server is running the no-op storage engine to measure more clearly the network I/O.</p>
<p>For this test, the invocation of voldemort-remote-test.sh provided these options:</p>
<blockquote><p>Number of iterations: 100<br />
Number of requests: 100,000<br />
Operations: write and delete<br />
Value size: as shown in the x-axis<br />
Threads: 100 total, each client ran 25 threads</p></blockquote>
<p>Here is a graph charting the results:</p>
<p><img src="http://www.mustardgrain.com/images/nio.vs.bio.v1/requestsize.tps.png" alt="Transactions/second" /></p>
<p>It&#8217;s pretty easy to see how the request size has an obvious effect on the number of transactions. Please note that the request sizes in the x-axis are doubling rather than linearly increasing. As can be seen in the graph, this appears to be a matter of JVM throughput as opposed to the nuances of the socket server implementation.</p>
<h2>Conclusion</h2>
<p>The implementation of the socket server delivers on NIO&#8217;s promise of better scalability with regard to using asynchronous I/O and readiness selection over the blocking I/O and thread-per-socket approach.</p>
<h3>A Big Surprise</h3>
<p>For a long time my NIO implementation suffered from weaker performance than the BIO implementation. I had scoured the code looking for problems but found nothing obvious. I had used the examples that I&#8217;d seen posted everywhere wherein the server used a single Selector to process readiness selection and for each ready socket, handed processing of the thread off to a thread pool.</p>
<p>What I found during profiling of the server was that ~30% of the runtime was spent in updating the SelectionKey&#8217;s interestOps. Understandably when a SelectionKey updates its interestOps it requires acquiring a <em>shared</em> lock in the Selector implementation (at least for the 1.6 JDK on Linux). This caused a horrible serialization problem.</p>
<p>What I did instead was to move to a thread-per-Selector design wherein all of the I/O is done serially in a loop. There is literally one thread handling the I/O for tens of thousands of sockets. Because it&#8217;s done in the same thread, when the SelectionKey&#8217;s interestOps is updated, the lock is found to already be held, thus no contention.</p>
<p>Even while I&#8217;m writing this I&#8217;m dumbfounded how a single thread that includes I/O and <em>processing the operation</em> could outperform a multi-threaded design. I&#8217;ve looked and looked and tested and tested and it&#8217;s almost always faster. The code for the NIO implementation is there, take a look and let me know what I&#8217;ve overlooked.</p>
<p>1. Here&#8217;s an uncommitted patch that I used to increase the heap size and increase client timeouts:</p>
<pre><code>
diff --git a/bin/run-class.sh b/bin/run-class.sh
index ade56bb..a95d593 100755
--- a/bin/run-class.sh
+++ b/bin/run-class.sh
@@ -35,8 +35,8 @@ done
CLASSPATH=$CLASSPATH:$base_dir/dist/resources

if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
fi

export CLASSPATH
-java $VOLD_OPTS -cp $CLASSPATH $@
\ No newline at end of file
+java $VOLD_OPTS -cp $CLASSPATH $@
diff --git a/bin/voldemort-server.sh b/bin/voldemort-server.sh
index 9878095..66a69e5 100755
--- a/bin/voldemort-server.sh
+++ b/bin/voldemort-server.sh
@@ -42,7 +42,7 @@ done
CLASSPATH=$CLASSPATH:$base_dir/dist/resources

if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
fi

java $VOLD_OPTS -cp $CLASSPATH voldemort.server.VoldemortServer $@
diff --git a/test/integration/voldemort/performance/RemoteTest.java b/test/integration/voldemort/performance/RemoteTest.java
index 2a15955..29e945d 100644
--- a/test/integration/voldemort/performance/RemoteTest.java
+++ b/test/integration/voldemort/performance/RemoteTest.java
@@ -26,6 +26,7 @@ import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import joptsimple.OptionParser;
@@ -132,6 +133,9 @@ public class RemoteTest {

System.out.println("Bootstraping cluster data.");
StoreClientFactory factory = new SocketStoreClientFactory(new ClientConfig().setMaxThreads(numThreads)
+                                                                                    .setConnectionTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setRoutingTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setSocketTimeout(10, TimeUnit.MINUTES)
.setMaxTotalConnections(numThreads)
.setMaxConnectionsPerNode(numThreads)
.setBootstrapUrls(url));
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/08/introducing-the-nio-socketserver-implementation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Building Voldemort read-only stores with Hadoop</title>
		<link>http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/</link>
		<comments>http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/#comments</comments>
		<pubDate>Thu, 18 Jun 2009 04:46:24 +0000</pubDate>
		<dc:creator>elias</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[hadoop]]></category>

		<category><![CDATA[lookery]]></category>

		<category><![CDATA[read-only]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=79</guid>
		<description><![CDATA[A well-known lesson in scalability is that writes are 40x more expensive than reads and if your application becomes write-intensive as it is easily the case when you are dealing with sufficiently large number of users, you will be in trouble if you don&#8217;t design to scale.  For example, if you are using MySQL, [...]]]></description>
			<content:encoded><![CDATA[<p>A well-known lesson in scalability is that writes are 40x more expensive than reads and if your application becomes write-intensive as it is easily the case when you are dealing with sufficiently large number of users, you will be in trouble if you don&#8217;t design to scale.  For example, if you are using MySQL, you will most likely follow the conventional path of scaling by introducing one of many replication schemes such as establishing a master server with several slaves servers. All writes go to a single server, which then replicates out to the read-only slaves. This all works well, until you reach the maximum number of writes any one of your servers can handle individually and replication lag will rear its ugly face. If we ignore the specifics and brainstorm for a little bit, you might come to the conclusion that updating databases while they are online can be detrimental to the service&#8217;s performance and maybe we should find a way to batch these updates and deliver them in bulk to the slaves. And this is exactly what the team behind Project Voldemort has done in addition to the default store (read-write) available today. We have a read-only store that lets you build its index and data files in an offline system (like Hadoop) and once built, it provides a mechanism for swapping its current store for a fresher version.</p>
<p>At <a href="http://www.lookery.com/">Lookery</a>, we have been working very hard to transform most of our data processing tasks into batch-oriented workflows in order to deal with growth. For example, we were already using Hadoop to compute our index and data files for our largest database, but the process of serving that information took place over too many network hops (load balancers, reverse proxies and Amazon S3). Therefore, as soon as I learned that Project Voldemort supported offline building of distributed stores, I decided to try it and we&#8217;re now running it in production. Please read the rest for an example walkthrough building a Voldemort read-only index and data files using Hadoop. The goal for this tutorial is to deploy a Voldemort cluster storing words and their counts calculated by the canonical Hadoop example: <a href="http://wiki.apache.org/hadoop/WordCount">WordCount.</a></p>
<h4>Step 1: Preferably build or <a href="http://github.com/voldemort/voldemort/tree/master">download</a> Voldemort:</h4>
</p>
<pre>
    &gt; git clone git://github.com/voldemort/voldemort.git
    &gt; cd voldemort
    &gt; ant</pre>
<h4>Step 2: Configure Cluster</h4>
<p>Now let&#8217;s create a single node cluster for our wordcounts store. Create a directory and place the following files inside a directory named &#8216;config&#8217; within that directory.</p>
<p><strong>server.properties</strong></p>
<pre>    # The ID of *this* particular cluster node
    node.id=0

    max.threads=100

    http.enable=true
    socket.enable=true
    enable.readonly.engine=true

    file.fetcher.class=voldemort.store.readonly.fetcher.HdfsFetcher</pre>
<p><strong>cluster.xml</strong></p>
<pre>    &lt;cluster&gt;
      &lt;name&gt;wordcounts&lt;/name&gt;
      &lt;server&gt;
        &lt;id&gt;0&lt;/id&gt;
        &lt;host&gt;localhost&lt;/host&gt;
        &lt;http-port&gt;8081&lt;/http-port&gt;
        &lt;socket-port&gt;6666&lt;/socket-port&gt;
        &lt;partitions&gt;0, 1&lt;/partitions&gt;
      &lt;/server&gt;
    &lt;/cluster&gt;</pre>
<p><strong>stores.xml</strong></p>
<pre>    &lt;stores&gt;
      &lt;store&gt;
        &lt;name&gt;wordcounts&lt;/name&gt;
        &lt;persistence&gt;read-only&lt;/persistence&gt;
        &lt;routing&gt;client&lt;/routing&gt;
        &lt;replication-factor&gt;1&lt;/replication-factor&gt;
        &lt;required-reads&gt;1&lt;/required-reads&gt;
        &lt;required-writes&gt;1&lt;/required-writes&gt;
        &lt;key-serializer&gt;
          &lt;type&gt;json&lt;/type&gt;
          &lt;schema-info&gt;"string"&lt;/schema-info&gt;
        &lt;/key-serializer&gt;
        &lt;value-serializer&gt;
          &lt;type&gt;json&lt;/type&gt;
          &lt;schema-info&gt;"int32"&lt;/schema-info&gt;
        &lt;/value-serializer&gt;
      &lt;/store&gt;
    &lt;/stores&gt;</pre>
<p>Let&#8217;s now make sure that everything is working as expected. Voldemort will automatically create a blank read-only store in a subdirectory called data if an existing is not found.</p>
<pre>    &gt; $VOLDEMORT_HOME/bin/voldemort-server.sh . &amp;&gt; /tmp/voldemort.log &amp;
    &gt; $VOLDEMORT_HOME/bin/voldemort-shell.sh wordcounts tcp://localhost:6666
    Established connection to wordcounts via tcp://localhost:6666
    &gt; get "voldemort"
    null</pre>
<h4>Step 3: Prepare your input data to organize into a word and count sequence.</h4>
<p>Before you continue, you must already have Hadoop configured in either pseudo-distributed or fully-distributed mode. Unfortunately, I&#8217;m going to punt on this part of the setup because there&#8217;s plenty of good documentation on the <a href="http://wiki.apache.org/hadoop/GettingStartedWithHadoop">Hadoop</a> wiki itself. Once you have your cluster running, let&#8217;s proceed to run the wordcount example.</p>
<pre>    &gt; bin/hadoop dfs -copyFromLocal &lt;local-dir&gt; &lt;hdfs-dir&gt;
    &gt; bin/hadoop jar hadoop-*-examples.jar wordcount [-m &lt;#maps&gt;] [-r &lt;#reducers&gt;] &lt;in-dir&gt; &lt;out-dir&gt;</pre>
<p>We now have a tab-separated records that contain the word and its count: &#8220;&lt;word&gt;TAB&lt;count&gt;&#8221; stored in the &lt;out-dir&gt; directory.</p>
<h4>Step 4: Use your favorite Java IDE/Editor to implement your Hadoop mapper</h4>
<p>Project Voldemort provides you with most of the code necessary to build the store. The only missing part is how to extract the key and value objects from your input files. Since our WordCount example outputs our data in a very simple key,value format, our mapper only needs to split the actual line on the TAB character and return a String object for the key and an Integer object for the value just as declared in our store definition inside stores.xml. You will need to update your CLASSPATH to point to the voldemort jars and their dependencies in order for this to compile. Upon completion, you should export this class into a jar called wordcount-mapper.jar and place it inside a &#8216;lib&#8217; directory right along your &#8216;config&#8217; directory.</p>
<pre>    package com.lookery;

    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;

    import voldemort.store.readonly.mr.AbstractHadoopStoreBuilderMapper;

    public class HadoopStoreMapper extends AbstractHadoopStoreBuilderMapper&lt;LongWritable, Text&gt; {

        @Override
        public Object makeKey(LongWritable key, Text value) {
            return value.toString().split("\t")[0];
        }

        @Override
        public Object makeValue(LongWritable key, Text value) {
            return Integer.parseInt(value.toString().split("\t")[1]);
        }
    }</pre>
<h4>Step 5: Build your read-only store from the command-line</h4>
<p>The next step is for you to call Project Voldemort&#8217;s shell script as shown in the example below. Most of the parameters should be self-explanatory, if not you can always call the script with &#8216;&#8211;help&#8217; to get full details for all available parameters. However, it might be worth explaining a couple in this example. First, let&#8217;s look at chunk size. The main reason for using Hadoop to build large read-only stores is because of its ability to run tasks in parallel. But to help you take advantage of Hadoop, Project Voldemort builds the index and data files for each node in the cluster in chunks in order to let the developer maximize the number of reducers his Hadoop cluster dedicates to building the distributed store. You should play with the chunksize until it gives you a number within a small multiple of your maximum reducer availability. In this case, I set it to 1 gigabyte. The second option worth mentioning is the replication factor. This setting will let you create redundancy in your storage so your cluster can continue serving values even after some node failures. Please see the rest of the documentation to choose what is the best replication factor in your application scenario.</p>
<pre>    &gt; $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input &lt;input_dir&gt; \
        --output wordcounts --tmpdir tmp-build --mapper com.lookery.HadoopStoreMapper \
        --jar lib/wordcount-mapper.jar --cluster config/cluster.xml \
        --storename wordcounts --storedefinitions config/stores.xml \
        --chunksize 1073741824 --replication 2
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Data size = 17934, ...
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Number of reduces: 1
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Building store...
    09/06/17 20:24:38 INFO mapred.FileInputFormat: Total input paths to process : 1
    09/06/17 20:24:38 INFO mapred.FileInputFormat: Total input paths to process : 1
    09/06/17 20:24:38 INFO mapred.JobClient: Running job: job_200906171849_0002
    09/06/17 20:24:39 INFO mapred.JobClient:  map 0% reduce 0%
    09/06/17 20:24:44 INFO mapred.JobClient:  map 100% reduce 0%
    ...</pre>
<h4>Step 6: Ask the cluster to fetch the newly built read-only in Hadoop</h4>
<p>Now the easiest part is to ask each node in the cluster to fetch its index and data chunks and only after they have succeeded, then they will all atomically swap their stores to the latest version. This has to be one of the coolest features in the read-only store. We hope you like it too. There are a couple of parameters available to the server.properties file that could help you further enhance this capability. The first is that you can specify a temporary folder for downloading the files from Hadoop HDFS (hdfs.fetcher.tmp.dir). This is important to make sure that both the read-only data store and the temporary folder are in the same device in order to avoid an extra copy of the downloaded files and make the swap truly atomic as permitted by the underlying filesystem. The second option (fetcher.max.bytes.per.sec) is to signal the HdfsFetcher to throttle its download rates in order to avoid interference with the online requests from Voldemort clients. There are also some other options for the swap-store.sh command such as timeout that could help you deal with large download periods from the cluster nodes. But other than that, it should be straight forward to update your online read-only store using this command.</p>
<pre>    &gt; $VOLDEMORT_HOME/bin/swap-store.sh --cluster config/cluster.xml \
        --file hdfs://localhost:54310/user/${user.name}/wordcounts --name wordcounts
    [2009-06-17 23:35:23,946] INFO Invoking fetch for node 0 for hdfs://localhost:54310/user/elias/node-0
    [2009-06-17 23:35:24,045] INFO Fetch succeeded on node 0 (voldemort.store.readonly.StoreSwapper)
    [2009-06-17 23:35:24,046] INFO Attempting swap for node 0 dir = /tmp/hdfs-fetcher/hdfs-fetcher/node-0
    [2009-06-17 23:35:24,060] INFO Swap succeeded for node 0 (voldemort.store.readonly.StoreSwapper)
    [2009-06-17 23:35:24,060] INFO Swap succeeded on all nodes in 0 seconds.</pre>
<h4>Step 7: Verify that your data was loaded successful</h4>
<pre>    &gt; bin/voldemort-shell.sh wordcounts tcp://localhost:6666
    Established connection to wordcounts via tcp://localhost:6666
    &gt; get "voldemort"
    version(): 2</pre>
<p>We have been running at Lookery with this setup for almost a month now and have been very pleased the results. It allowed us to have a Voldemort cluster that is always refreshing with the new data without manual intervention. If you have any corrections or suggestions on how we can improve both the batch indexer or this tutorial, please don&#8217;t hesitate to email the <a href="http://groups.google.com/group/project-voldemort">mailing list.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Building a terabyte-scale data cycle at LinkedIn with Hadoop and Project Voldemort</title>
		<link>http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/</link>
		<comments>http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 19:28:32 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[hadoop]]></category>

		<category><![CDATA[linkedin]]></category>

		<category><![CDATA[read-only]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=15</guid>
		<description><![CDATA[Many of LinkedIn&#8217;s products are critically dependent on computationally intensive data mining algorithms. Examples of these include some modules like People You May Know, Viewers of This Profile Also Viewed, and much of the Job matching functionality that we give to people who post jobs on the site. To support these data-intensive products we have [...]]]></description>
			<content:encoded><![CDATA[<p>Many of LinkedIn&#8217;s products are critically dependent on computationally intensive data mining algorithms. Examples of these include some modules like People You May Know, Viewers of This Profile Also Viewed, and much of the Job matching functionality that we give to people who post jobs on the site. To support these data-intensive products we have begun to move many of the largest offline processing jobs to Hadoop. These jobs form a fairly typical data cycle. Data is moved out of twenty or so online data storage systems (Oracle, MySQL, Voldemort, etc) as well as from our centralized logging service, where they go to offline systems like <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop</a>, <a title="AsterData" href="http://www.asterdata.com">AsterData</a>, and our Oracle Data Warehouse. Moving all the data into centralized offline processing systems like these dramatically simplifies the implementation of complex algorithms which may use data from dozens of sources. Once data has been extracted a sequence of offline processing jobs are run on it. Finally the results are automatically loaded back into the live system to feed parts of the website. All offline data we produce is read-only once it goes live to avoid the complexity of merging the offline computations with online updates during the next run of this data processing cycle.</p>
<p>The difficulty in these systems comes with the fact that large amounts of data need to moved around every day. Thus although hundreds of gigabytes or terrabytes of data are not to difficult when sitting still in a storage system, the problem because much, much harder when it must be transformed to support quick lookups and moved between systems on a daily basis.</p>
<p>This post describes the system we built to deploy data to the live site using our key-value storage system, <a title="Project Voldemort" href="http://project-voldemort.com">Project Voldemort</a>.</p>
<p>Why do we end up with so much data? The size of the output is usually determined by the quantity of something on the site: we might compute something for each member profile, each question that gets asked, each news article that is posted, etc. These jobs may process a lot of data, especially if they involve any of the very large logging data streams, but the results, though large, are manageable. We have a second kind of job that is at least as common and produces results for each <em>pair</em> of users, or each <em>pair</em> of companies, or, say, the relationships <em>between</em> users and questions, or between the many other types of content on our site. As you might imagine the number of interesting pairs of items is much larger than the number of actual items (it isn&#8217;t as large as the square of the number of items, since most pairs aren&#8217;t interesting, but it is still huge). This seems to be a natural use case for social networks where the relationships are of central importance. Previously we did not need to confront this problem both because our data size was smaller, and also because our ability to produce large offline datasets was limited by computation constraints. Hadoop has been quite helpful in removing scalability problems in the offline portion of the system; but in doing so it creates a huge bottleneck in our ability to actually deliver data to the site. As is often the case, removing a bottleneck in one area creates a new bottleneck somewhere else.</p>
<p>To solve this problem we spent some time thinking about how to build support for large daily data cycles<a title="Project Voldemort" href="http://project-voldemort.com"></a>. Voldemort was designed to support fast, scalable read/write loads, and is already used in a number of systems at LinkedIn. It was not designed specifically with batch computation in mind, but it supports a pluggable architecture which allows the support of multiple storage engines in the same framework. This allows us to integrate our fast, failure-resistent online storage system, with the heavy offline data crunching running on Hadoop.</p>
<p>Here is a picture of what our world looks like:</p>
<p><img class="aligncenter size-full wp-image-54" title="linkedin_arch" src="http://project-voldemort.com/blog/wp-content/uploads/2009/06/linkedin_arch.png" alt="linkedin_arch" width="538" height="423" /></p>
<h2>Some existing approaches</h2>
<p>There are plenty of other ways to approach this problem, but no one we talked to had a good solution. We saw many variety of things being done, including pushing static text files by hand, FTPing giant XML files or doing JDBC batch inserts in an (Oracle) DB. None of these are really good approaches to the problem, since they typically have one of two common problems. The first is that the data transfer is centralized, creating an un-scalable bottleneck in the delivery of the data. The second is that the process of building the lookup index (generally a btree) is happening on the same live server that is serving lookups. This is a big problem since building a large index is a huge and computationally intense operation that may take hours, and by doing this on the live server we are effectively mixing this huge throughput-oriented operation with short-latency sensitive lookups, generally with poor results for your users.</p>
<h2>So what alternatives are there?</h2>
<p>The best online system for data lookups right now is <a title="Memcached" href="http://www.danga.com/memcached">memcached</a>. Memcached is stable and has excellent performance for common caching needs. The obvious problem with memcached are the &#8220;mem&#8221; and the &#8220;cache&#8221; parts. Memcached is all in memory so you need to squeeze all your data into memory to be able to serve it (which can be an expensive proposition if the generated data set is large). In addition memcached is a cache, so if you need to restart your servers then your data will disappear and need to be re-pushed! Another problem is the apparent lack of batch set operations. Without this the majority of time will inevitably be spent on unnecessary network round-trips no matter how efficiently we implement them. We could easily build a map reduce job to do this in parallel, but that only works around the underlying weakness in the per-node transfer rate.</p>
<p>The next best online system is <a title="MySQL" href="http://www.mysql.com">MySQL</a>. MySQL can avoid the one-round-trip per insert problem by doing batch inserts, but even that seems to give rather low throughput. MySQL&#8217;s InnoDB table format has too high space overhead to make it a real competitor. However MySQL has a very slim and simple MyISAM format. MyISAM isn&#8217;t used as much for normal read/write usage since it uses a global table write lockand lacks many transactional features, but this isn&#8217;t a problem for read-only usage since it is write-free. MySQL also supports an optimized &#8220;load data infile local&#8221; statement that provides bulk load capability. This is an extremely important feature for a disk-based storage format in this use case&#8211;building a 100GB index can not be done effectively as a sequence of b-tree updates that incrementally re-arrange data as the they go because the total IO casued by all the little updates is extremely high. To avoid this the tree needs to do a batch build that builds as much of the tree at once as possible, and this is exactly what the &#8220;load data&#8221; statement does. All-in-all MySQL is slim, quick, and generally the best off-the-shelf solution to this problem we have seen. Still to make this build effective you need a ton of memory, and it will lock the table for the duration of the build. This means that if you are running this on your live servers they will be extremely heavily worked for the duration of the load (which can easily take hours). Not to mention that MySQL provides little in the way of ability to parallelize this, making constructing a system on top of this a difficult proposition.</p>
<p>Clearly building an index like this is an offline operation and should not be done on a server that is serving live traffic as it will likely choke the CPU and IO resources from serving the live requests. In principle this is possible as MySQL (rather frighteningly) seems to allow you to just copy the files for a database into the database directory of a running server which will immediately make the table appear available without restarting. But this would mean maintaining a whole separate cluster of MySQL servers just for the purpose of index building as well as devising some way of parallelising this process. Finally a practical point is that you will likely have to write the data to disk multiple times&#8211;once to copy it to the server as a text file, then again as it is built as a database, and finally a third time if it copied to a live server (not to mention that fact that MySQL unfortunately seems to make its own internal copy of the data as well when you build the index to support its transactional requirements). Since the load data statement doesn&#8217;t seem to support compression, storing your data as CSV is a rather large blow-up. These things don&#8217;t seem like they should be a big problem, but when your 400GB data set turns into a 1,200GB dataset because all the numbers are in ASCII, and this file is then copied multiple times, that creates a serious problem.</p>
<h2>Requirements for a better solution</h2>
<p>These alternatives weren&#8217;t attractive, so we thought through what would be needed to do a good job with this problem. We came up with the following things:</p>
<ol>
<li><strong>Protect the live servers<em>.</em></strong> Uploading a new data set can&#8217;t impact the services relying on the data. We want the upload of new results to go as fast as possible but no faster. This means moving as much computation out of the online system as possible, and guaranteeing the live servers are not negetively impacted.</li>
<li><strong>Horizontal scalability at each step</strong>. Hadoop gives us a scalable approach to the build, and Voldemort gives a scalable system for the lookups. The trick is just ensuring that there is no centralized bottleneck in the process.</li>
<li><strong>Ability to rollback</strong>. Like any code, the processes that generates the data may have some kind of error or bug that leads to generating corrupt data, but unlike most code problems, fixing things may not be so quick. Since the processes may take many hours to run, and the data automatically goes live without human perusal, this kind of failure can leave us in a bad position. It may take hours to rerun (or for some very computationally intense processes, days), and we will could be stuck with the bad data until we manage to fix the bug, rerun everything, and re-push the fixed data. This is clearly unacceptable. As a result we would like to retain multiple copies of the data set, one for each of the last N pushes (where in the common case N = 1) so that we could revert to this known good state. This allows us to have a constant time rollback to a previous data set.</li>
<li><strong>Failure tolerance.</strong> This is where the Voldemort consistent hashing comes in to play&#8211;a server failure in the live system will redirect 1/K of that server&#8217;s traffic to each of the remaining servers without impacting the client. We also get similar failure tolerance in the build from using Hadoop.</li>
<li><strong>Support large ratios of data to RAM.</strong> The original problem we are trying to solve is that the data size is very large so we need to design accordingly. Improving performance in the case where the data is all in memory is not terribly valuable, the focus is on supporting a data size significantly larger than memory on each node.</li>
</ol>
<h2>Our approach</h2>
<p><a title="Bhupesh" href="http://www.linkedin.com/in/bhupeshbansal">Bhupesh</a>, <a title="Elias" href="http://www.linkedin.com/in/eliast">Elias</a>, and <a title="I" href="http://www.linkedin.com/in/jaykreps">I</a> toyed with solutions to these requirements, and here is the design we came up with.</p>
<p>One thing was clear, the Hadoop cluster is the natural place for the index build to occur. Hadoop is where the data is when the processing is done, and the goal of these machines is to run at full utilization so however computationally intense the build process is, it will not be a problem.</p>
<p>For the live system we wanted to adapt our key-value system, Voldemort. To do this we wanted to add an on-disk structure optimized for access to very large read only data sets we could deploy in batch. In particular we wanted some kind of simple file-based format we could stream to the servers to avoid doing many network round trips during the deployment. Ideally we should be able to deploy data at the rate possible by the network or disk system of the Voldemort and Hadoop clusters.</p>
<p>In early versions of the storage engine we toyed with different lookup and caching structures. But some simple benchmarking revealed this to be a rather academic exercise. The fundamental fact of filesystem access is that you may or may not be accessing the underlying disk depending on whether your request can be served by the OS&#8217;s pagecache or not.  A pagecache hit on an mmap&#8217;d file takes less than 250 nanoseconds but a page miss is around 5 milliseconds (a mere twenty thousand times slower). Any fancy data structure we build is likely to reside in-memory. Hence it would only help the lookups for things that would be in page cache anyway (since the process of loading them into memory would put them there) and so lookups on these would be fast no matter what. And worse this in-process lookup structure will likely steal memory from the pagecache to store its data, and since this will duplicate things in the pagecache it is extremely inefficient. Thus even if we manage to improve the lookup time for the things in our process memory, it is already quite low; and by doing so we use up memory that moves more requests out of the ns column and into the ms column. In short, <a title="Amdahl" href="http://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl</a> wins again.</p>
<p>To take advantage of this we have a very simple storage strategy that exploits the fact that our data doesn&#8217;t change&#8211;all we do is just mmap the entire data set into the process address space and access it there. This provides the lowest overhead caching possible, and makes use of the very efficient lookup structures in the operating system. Since our data is immutable, we don&#8217;t need to leave any space for growth and can tightly pack the data and index. Since the OS maintains the memory it can be very aggressive about this cache, and indeed it will attempt to fill all free RAM at all times with recently used pages. In comparison Java is a very inefficient user of memory since it must leave lots of extra space for garbage collection, etc. Plus anyone who has gotten intimate with Java GC tuning will not object to moving things out of the Java heap space.</p>
<h2>How data is stored</h2>
<p>Sometimes it&#8217;s nice to know what is going on under the covers. The data for a store named my_store would consist of the following files:</p>
<pre>my_store/
  version-0/
    0.index
    0.data
    ...
    n.index
    n.data
  version-1/
    0.index
    0.data
    ...</pre>
<p>As you can see a store is just a directory of simple files. The <em>.data</em> files contain variable length values and the <em>.index</em> files contain the lookup structure necessary to map keys to values. In principle only one <em>.index</em> and <em>.data</em> file would be needed, but since writing a file is inherently single-threaded we break it into chunks numbered 0 through <em>n</em> to allow greater parallelism in the build. These chunks are then grouped into version directories containing a complete version of the data, with <em>version-0</em> containing the current live data set.</p>
<p>Deploying a new version of the data consists of adding a new directory and renaming the existing ones. Storing multiple copies of the data is clearly a huge waste of space, but this is not too important as inactive files use no pagecache space just disk space. Small low latency reads on a huge data set will be largely seek bound, so we are going to need a lot of disk spindles no matter how we store things. Hard drive space is fairly cheap, so getting slightly larger disks to store additional copies is not a big problem.</p>
<p>To reduce the size of the file pointers and to work around limitations in Java&#8217;s mmap implementation, we limit chunk files to a maximum size of 2GB, so a reasonably sized store will consist of tens or hundreds of chunks per Voldemort node.</p>
<p>The order in which the values in the .data files are are stored is not important. Each value is prefixed by a 4 byte length indicating how many bytes to read. Each value is uniquely identified by the offset in the file at which its 4 byte length begins. The index contains 16 byte MD5 hashes of the keys along with the associated 4 byte position offset of the value in the data file. Because we hash keys, each key/value pair we store has a fixed overhead of exactly 24 bytes in addition to the length of the value itself. Furthermore this can be stored very efficiently as we can calculate the location of the <em>i</em>th index value as 20 * <em>i</em>. All lookups are positional; no internal pointers are needed within the index to locate entries.</p>
<p>The question is how to structure the keys in the index for quick lookups? A page- or block-organized tree is a good data structures if the data does not fit in memory. But this complicates both the lookups (which would need to be block aware), and the build process. In particular we want to perform our build in hadoop, which means that we will be limited to the amount of memory available to the mapper and reducer tasks which may leave only a few hundred megabytes for the build&#8211;if this does not fit our index data then we will have to perform some kind of external tree build. However on consideration we realized that since the index contains only 20 bytes per key even a very moderate amount of memory can hold several hundred million entries. Given this low overhead, very likely the whole index can (and should) be in memory (pagecache, not java heap), and so organizing the data by block or page is not really very important. As a result we greatly simplified our design&#8211;we just store the index entries in sorted order by md5 hash of the key.</p>
<p><img class="aligncenter size-full wp-image-25" title="data_format" src="http://project-voldemort.com/blog/wp-content/uploads/2009/05/data_format2.png" alt="data_format" width="397" height="607" /></p>
<p>A lookup in the store proceeds as follows:</p>
<ol>
<li>Calculate the MD5 of the key</li>
<li>The first 4 bytes of this md5, modulo the number of chunks, is the chunk number to search in</li>
<li> Do a binary search for the key md5 in that chunk&#8217;s .index file to get the position of the value in the data file</li>
<li>Finally read the appropriate number of bytes for the value from the data file starting at the given position</li>
</ol>
<p>The code for this storage engine is quite simple, only a few hundred lines, with the distribution and fault tolerance&#8211;the hard problems&#8211;being provided by the rest of Voldemort.</p>
<p>Binary search is not a very efficient algorithm for finding the location of the data. Most of the time this is not important since the index is in memory and so data access time dominates, but there are two cases that could be improved. The first is the case where all data and index fit entirely in memory. With very small keys, a chunk might have an index with, say, 100 million entries, which means a binary search does 27 key reads and comparisons and a single data read. In this case the cost of the search will dominate. Another suboptimal case is when we have an entirely uncached index. We explicitly transfer index files last in the data deployment to avoid this case, however in the case of rolling back to a previous index version it is unavoidable. To page the 100 million entry index for a chunk into memory will require 500k page faults no matter what the structure is. However it would be desirable to minimize the maximum number of page faults incurred on a given request to minimize the variance of the request time. In this case a page-organized tree, where each parent had 204 20 byte children, could do only log_204(100 million) = 4.5 page faults in the worst case and would be superior.</p>
<p>To resolve these cases we are working on an improved search algorithm which takes into account the uniformity of the key distribution, whch results from the fact that MD5 is (somewhat) cryptographically secure and so its keys are uniformly distributed. Rather than always beginning with a comparison to the middle entry such an algorithm would use the uniformity of the key distribution to compute the expected quantile of the key being looked up attempting to jump immediately to the correct location. If we can get a reliable implementation this promises to greatly improve the number of both page faults and comparisons needed in these corner cases.</p>
<h3>Index Building</h3>
<p>To build these store files we created two programs: a single-process command-line java program and also a distributed Hadoop-based store builder. The single process program uses a simple external sort to build the index files. Since this is a centralized process it is only useful for small data sets, testing, or one-time builds.</p>
<p>The Hadoop-based store builder is actually substantially simpler than the single-process builder as it leans heavily on Hadoop&#8217;s native capabilities to do its work. The store building processes proceeds as follows. An user-extensible Mapper extracts keys from the source data. This mapper can be parametrized to work with different <a title="InputFormat" href="http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/InputFormat.html">InputFormat</a>s, and provides hooks to allow custom ways to construct the key and value from the data. A custom Hadoop <a title="Partitioner" href="http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/Partitioner.html">Partitioner</a> then applies the Voldemort consistent hashing function to the keys, and assigns all keys mapped to a given node and chunk to a single reduce task. The shuffle phase of the map/reduce copies all values with the same destination node and chunk to the same reduce task. Thus each of the reduce tasks will create one .index and .data file for a given chunk on a particular node; and as a result the number of chunks specified in the configuration acts as a parameter to control the parallelism of the build. These values are then sorted by Hadoop in order to group them by key for reduce. Each reduce task copies the key/value pairs it is given into a pair of .index and .data files in sorted order to build its store chunk.</p>
<h2>Data deployment</h2>
<p>It is important that we be able to swap in a complete data set all at once without any downtime or impact to the live cluster. As described above, multiple data versions are kept in the <em>version-</em> subdirectories, but only <em>version-0</em> is used for serving data. Versions 1 through <em>n</em> are effectively backups. When a new data version is deployed, the version number of each existing data set is incremented, and the new set becomes the new <em>version-0</em>. To perform this swap a simple reader/writer lock is used to halt readers, each directory <em>version-i</em> is moved to <em>version-(i+1)</em>, and the new data is moved to <em>version-0</em> and the store is reopened and unlocked using this new dataset. Since only file renames are used, this is an O(1) operation, and in practice the whole procedure seems to complete in a few milliseconds irrespective of file size. The deleting of the <em>N</em>+1st version is prolonged until after the lock is released as delete may not be an O(1) operation, and may take over a minute on a filesystem that lack <a title="extents" href="http://en.wikipedia.org/wiki/Extents">extents</a> such as ext3.</p>
<p>The actual method for transferring data is pluggable. The original prototype used rsync in hope of efficiently supporting the transferring of diffs. However, this has two practical problems. The first was that the rsync diff calculation appears to be quite expensive, and half of the expensive calculation is done on the live server. Clearly if we want to do diffs, that too should be done on the batch system (Hadoop) not the live system (Voldemort). In fact due to this heavy calculation rsync was actually slower than just copying the whole file, even when the diff was rather small (though presumably much more network efficient). The more fundamental problem was that using rsync required copying the data out of HDFS to some local unix filesystem&#8211;which had better have enough space!&#8211;to be able to run rsync. This copying took as long as the data transfer to Voldemort, and meant we were copying the data twice.</p>
<p>To avoid these problems we switched from a push model to a pull model. It was important that we could schedule the transfer from the batch system to run automatically when the build completed successfully, so this took the form of a RESTful fetch command which triggers the Voldemort servers to fetch the data directly from HDFS. This mechanism is pluggable and a third party can provide an alternate implementation of the fetch command to support non-HDFS based mechanisms.</p>
<p>HDFS provides great throughput and seems to be able to max out the write capabilities of the Voldemort node. This is a blessing and a curse. Anyone who has lived with JDBC-based data transfer and seen it bottleneck on a measly few hundred KBs/sec will be overwhelmed with joy. But once again, removing performance problems in one area creates performance problems elsewhere: the high rate of data transfer to the live servers, even without any index building, can potentially starve live requests. However, in this model, where the server controls the pull, the Voldemort nodes can be configured to throttle itself to a fixed MB/sec limit so as not to overwhelm the I/O capabilities of the local Voldemort node. We have implemented a Voldemort configuration property, fetcher.max.bytes.per.sec, that controls this rate.</p>
<p>We have provided a driver program which initiates this fetch and swap procedure in parallel across a whole Voldemort cluster. In our tests this process can reach the I/O limit of either the Hadoop cluster or the Voldemort cluster.</p>
<p><img class="aligncenter size-full wp-image-23" title="store_build_process2" src="http://project-voldemort.com/blog/wp-content/uploads/2009/05/store_build_process2.png" alt="store_build_process2" width="486" height="274" /></p>
<h2>Some benchmarks</h2>
<p>There are two things to benchmark: the build time for a store in Hadoop and the request rate a node can sustain once live. We completed our benchmarks on EC2, since this is an easy way to get big clusters up and running for a quick test. Hopefully this will aid in making the results reproducible by others interested in testing different scenarios. We used extra large instances for both Hadoop and Voldemort as these most closely match our own hardware, and we used the <a title="Cloudera Hadoop AMI" href="http://www.cloudera.com/hadoop-ec2">Cloudera Hadoop AMI</a> to get the test cluster up and running quickly.</p>
<p>Benchmarking anything that involves disk access is notoriously difficult because of sensitivity to three factors:</p>
<ol>
<li> The ratio of data to memory</li>
<li>The performance of the disk subsystem, and</li>
<li>The entropy of the request stream</li>
</ol>
<p>The ratio of data to memory and the entropy of the request stream determine how many cache misses will be sustained, so these are critical. A random request stream is more or less un-cachable, but fortunately almost no real request streams are random. They tend to have strong temporal locality which is what page cache eviction algorithms exploit. So for our testing we can assume a large ratio of memory to disk, and test against a simulated request stream to get performance information.</p>
<p>The performance is still very sensitive to the quality of the disk subsystem used for the Voldemort nodes. A live system like this will do lots of quick seeks with relatively small reads and will likely be bound by the seek time of the hard drive and the number of drives. The drives on the EC2 machines are fairly weak and not configured with RAID so they are not optimal if you are purchasing hardware, and in our tests all the processes we benchmark are IO bound. To help make sense of all these variables we provide a comparison to MySQL&#8217;s performance on the same tasks on the same hardware.</p>
<p>We are not aware of an existing system that does full build and data deployment in parallel, so there no direct comparison possible. But any build process will consist of three stages: (1) partitioning the data into seperate sets for each destination nodes, (2) gathering all data for a given node, and (3) building the lookup structure for that node. We can only compare results for the actual build (i.e. part 3) with MySQL as there is no off-the-shelf method for (1) and (2).</p>
<p>For our tests the keys are integers in ascii form. The values are meaningless 1024 byte strings.</p>
<h3>Build Time</h3>
<p>We tested the Hadoop build for a variety of store sizes. This time is the complete build time including mapping the data out to the appropriate node-chunk, shuffling the data to the nodes that will do the build, and finally creating the store files. In general, the time was roughly evenly split between map, shuffle and reduce phases. The number of map and reduce tasks are a very important parameter, as experiments on a smaller data set showed that varying the number of tasks could change the build time by more than 25%, but due to time constraints no attempt was made to optimize these, we just used whatever defaults Hadoop produced. Here are the times taken:</p>
<ul>
<li>100GB: 28mins (400 mappers, 90 reducers)</li>
<li>512GB: 2hrs, 16mins (2313 mappers, 350 reducers)</li>
<li>1TB: 5hrs, 39mins (4608 mappers, 700 reducers)</li>
</ul>
<p>To compare the build time we created a RAID 10 array on a single extra large instance, and did a build using one node&#8217;s worth of data (100m keys). This  process to 6 hours and 3 minutes to build the 100GB table for single node. Assuming similar performance for partitioning and copying data around this would indicate a complete build time of almost 8 hours per destination node. But this comparison ignores the time necessary to extract the data from the source system and convert it to CSV format for loading. And, of course, this neglects the additional benefits of Hadoop for handling failures, dealing with slower nodes, etc.</p>
<p>In addition, this process is scalable: it can be run on a number of machines equal to the number of chunks (700 in our 1TB case) not the number of destination nodes (only 10).</p>
<p>Data transfer between the clusters happens at a steady rate bound by the disk or network. For our Amazon instances this is around 40MB/second.</p>
<h3>Online Performance</h3>
<p>Lookup time for a single Voldemort node compares well to a single MySQL instance as well. To test this we ran local tests against the 100GB per-node data from the 1 TB test. This test as well was run on an Amazon Extra Large instance with 15GB of RAM and the 4 ephemeral disks in a RAID 10 configuration. To run the tests we simulated we simulated 1 million requests from a real request stream recorded on our production system against each of storage systems. We see the following performance for 1 million requests against a single node:</p>
<table style="text-align: center; height: 119px;" border="0" width="301">
<tbody>
<tr>
<td></td>
<td><strong>MySQL</strong></td>
<td><strong>Voldemort</strong></td>
</tr>
<tr>
<td style="text-align: left;"><strong>Reqs per sec.</strong></td>
<td>727</td>
<td>1291</td>
</tr>
<tr>
<td style="text-align: left;"><strong>Median req. time</strong></td>
<td>0.23 ms</td>
<td>0.05 ms</td>
</tr>
<tr>
<td style="text-align: left;"><strong>Avg. req. time</strong></td>
<td>13.7 ms</td>
<td>7.7 ms</td>
</tr>
<tr>
<td>
<p style="text-align: left;"><strong>99th percentile req. time</strong></p>
</td>
<td>127.2 ms</td>
<td style="text-align: center;">100.7 ms</td>
</tr>
</tbody>
</table>
<p>These numbers are both for local requests with no network involved as the only intention is to benchmark the storage layer of these systems.</p>
<h2>How to actually use it</h2>
<p>The code is all checked in to <a title="github" href="http://github.com/voldemort/voldemort/tree/master">the main project repository on github</a>. The commands for building a store, and executing a swap can be found under the bin/ directory. Elias has written a <a href="http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop">blog entry</a> on how to use these, and how he has put this system into action at <a href="http://www.lookery.com">Lookery</a>.</p>
<h2>Future work</h2>
<p>Nothing is ever finished, and below are a few of the ideas we didn&#8217;t quite get to. There are a lot of huge performance wins that exploit the immutable nature of the data that we have not yet taken advantage of. If any one is interested in playing with one of these problems here are a few ideas. LinkedIn is also looking for engineers to work on Project Voldemort full time, so if that sounds interesting <a href="http://www.linkedin.com/static?key=jobs_open">send us a resume</a>.</p>
<h4>Incremental data updates</h4>
<p>Despite the problems with rsync, incremental data pushes would be quite a big improvement for the case where the data changes by only 5%. This is a common case for a job that runs daily to recompute a large set of values. Getting efficient incremental performance is a harder problem than it sounds. We have never gotten a production system that will do this well in our past attempts: rsync didn&#8217;t seem to work well, MySQL&#8217;s load data performance is destroyed by pre-existing unique indexes, and Oracle insert/update is slower than a complete transfer and rebuild for anything but the most minor of changes.</p>
<p>There are two ways we can think of to support this. The first is the easiest to implement, and just consists of creating a diff file in much the same way that Unix diff or rsync , and using it in combination with the existing data on the live server to create the new set (rather than deploying everything each time). Since computing the diff turns out to be a very computationally intensive task for a large file, this work must be done in the offline Hadoop system. There is little point in trying to do incremental updates to the .index files as these are comparably small, and the changes are liable to be randomly interspersed throughout&#8211;so the diff won&#8217;t be much smaller than the original file. The .data files, however, do not have any inherent order so all the new data could be placed at the end of the file allowing for extremely efficient diffs. The &#8220;patch&#8221; could be applied in the process of doing the fetch, so that existing data segments would be read from the current <em>version-0</em> dataset and new segments would be read from the HDFS diff file.</p>
<p>The above strategy reduces the network transfer necessary and could be run in steady state each day. However it does still require writing the complete data set to disk for each deployment, rather than writing the smaller diff only. One strategy that could avoid this would be to create a seperate set of .index and .data files for each day and store each days data seperately on the live system. A naive lookup would have to check each of these directories from latest to oldest to retrieve a value. However a more sophisticated approach could keep a <a title="Bloom filter" href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> tracking which keys are in each day&#8217;s patch. This would give a quick way to determine which files need not be searched without actually performing the search (with high probability).</p>
<h4>Improved key hashing</h4>
<p>The consistent hashing algorithm Voldemort uses has the nice property that N copies of a key are more or less randomly distributed over the cluster. As a result a failed node redistributes load evenly to the remaining nodes. This is an essential property to be able to tolerate node failure. However this algorithm has the unfortunate property that the data on each machine is different, and as a result the data built for each machine is unique. This means that a replication factor of two, doubles the size of the data that must be built. A hashing algorithm tuned to this use case could avoid this problem by replicating at the chunk level. This would provide less fine-grained load distribution since each chunk of data would be fully replicated on N machines, but would avoid the blow-up due to replication factor. Voldemort supports pluggable hashing algorithms so this should not be too difficult to implement, but didn&#8217;t make the cut for our first attempt.</p>
<h4>Compression</h4>
<p>Because the data is known upfront and does not change, it is a good target for compression. LZO compression or another compression algorithm tuned towards fast decompression speed could improve performance by reducing IO.</p>
<h4>Better indexing</h4>
<p>Fancier index structures are another area for improvement. We think the probabilistic binary search will prove to be a very effective approach, but since we haven&#8217;t implemented it yet it is worth considering a few other approaches.</p>
<p>The idea of a 204-way page-aligned tree instead of a binary tree was mentioned above. Each set of 204 20-byte index entries would take 4080 bytes which with 16 bytes of padding would then be exactly page aligned for a 4k page. This would mean the first 4k page of the index file would contain the hottest entries, the next 204 entries would contain the next hottest, and so forth. Thus even though the number of comparison necessary to locate an entry would not asymptotically improve the maximum number of page faults necessary to do an index lookup would decrease substantially in practical terms (to 4 or 5 for a large index).</p>
<p>This is not the optimal tree structure for an immutable tree such as ours, though. A much better approach to this problem was brought up by Elias who was familiar with the literature on <a title="cache-oblivious algorithms" href="http://en.wikipedia.org/wiki/Cache-oblivious_algorithm">cache-oblivious algorithms</a>, and was aware of a cache-oblivious structure called a <a title="van Emde Boas tree" href="http://en.wikipedia.org/wiki/Van_Emde_Boas_tree">van Emde Boas tree</a>. Cache oblivious algorithms uses a data structure that recurses in on itself in a way that requires no assumptions or special treatment for page sizes or CPU caches to get optimal cache performance (under some reasonable assumptions). There is a well developed set of cache-oblivious data structures for dealing with disk-based tree lookups. These algorithms manage to nicely utilize CPU cache as well, all without explicit assumptions about the memory hierarchy.</p>
<p>Still another alternative would be an on-disk hash-based lookup structure. Such a structure can reduce the number of required comparisons in a lookup, though as with a tree, its creation could be difficult. The trade-off between extra space used in the hashing and the collision rate is well known. At the far end of this spectrum is the <a title="Minimal Perfect Hash" href="http://en.wikipedia.org/wiki/Minimal_perfect_hashing">minimal, perfect hash</a> which is a function that hashes a fixed set of <em>N</em> keys to exactly <em>N</em> hash table slots. This structure would seem to be optimal for the lookups since it guarantees that we will require no more than one lookup to find the location of the data (indeed we could entirely avoid storing the hash value itself reducing the index entry size to only 4 bytes for the position). The hash function itself requires only about 3 bits per entry to be stored once it has been found. Computing these hashes can be difficult, though, so only a real implementation would show if the superior lookup time was justified by the possible increase in build time. There is an off-the-shelf <a title="Sux4J" href="http://sux.dsi.unimi.it">MPH implementation</a> from the author of mg4j but we have not yet investigated the feasibility of this in much detail.</p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Moved to github</title>
		<link>http://project-voldemort.com/blog/2009/05/moved-to-github/</link>
		<comments>http://project-voldemort.com/blog/2009/05/moved-to-github/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 05:24:16 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=27</guid>
		<description><![CDATA[By popular demand. Fork us.
]]></description>
			<content:encoded><![CDATA[<p>By popular demand. <a href="http://github.com/voldemort/voldemort/tree/master">Fork us</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/05/moved-to-github/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Added site documentation to Subversion</title>
		<link>http://project-voldemort.com/blog/2009/03/added-site-documentation-to-svn/</link>
		<comments>http://project-voldemort.com/blog/2009/03/added-site-documentation-to-svn/#comments</comments>
		<pubDate>Sun, 08 Mar 2009 23:21:57 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=5</guid>
		<description><![CDATA[I added the project-voldemort.com site HTML to subversion today.]]></description>
			<content:encoded><![CDATA[<p>I added the <a title="project-voldemort.com" href="http://project-voldemort.com" target="_self">project-voldemort.com</a> site HTML to subversion today. Documentation is the weakness of so many open source projects, and that is definitely true for this one. I think people widely underestimate how much of the value of a project or product is not the directory of code, but rather the human capital in the heads of the people who know how to use and extend the code.</p>
<p>You can check out the site code here: <a title="http://code.google.com/p/project-voldemort/source/browse/#svn/site/www" href="http://code.google.com/p/project-voldemort/source/browse/#svn/site/www" target="_self">http://code.google.com/p/project-voldemort/source/browse/#svn/site/www</a>.</p>
<p>Patches gladly accepted.</p>
]]></content:encoded>
			<wfw:commentRss>http://project-voldemort.com/blog/2009/03/added-site-documentation-to-svn/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
