<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>SNA Projects Blog</title>
	<atom:link href="http://sna-projects.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://sna-projects.com/blog</link>
	<description>LinkedIn's Search Network and Analytics team</description>
	<pubDate>Thu, 26 Aug 2010 19:16:08 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Zookeeper experience</title>
		<link>http://sna-projects.com/blog/2010/08/zookeeper-experience/</link>
		<comments>http://sna-projects.com/blog/2010/08/zookeeper-experience/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 19:16:08 +0000</pubDate>
		<dc:creator>jrao</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=430</guid>
		<description><![CDATA[While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I&#8217;d like to share my experience of using Zookeeper. First of all, for those of you who don&#8217;t know, Zookeeper is an Apache project that implements a consensus service based [...]]]></description>
			<content:encoded><![CDATA[<p>While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use <a title="Zookeeper" href="http://hadoop.apache.org/zookeeper/">Zookeeper</a> (ZK) to implement the load-balancing logic. I&#8217;d like to share my experience of using Zookeeper. First of all, for those of you who don&#8217;t know, Zookeeper is an Apache project that implements a consensus service based on a variant of <a href="http://en.wikipedia.org/wiki/Paxos_algorithm">Paxos</a> (it&#8217;s similar to Google&#8217;s <a title="Chubby" href="http://labs.google.com/papers/chubby.html">Chubby</a>). ZK has a very simple, file system like API. One can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. ZK does a couple of more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path is changed, (b) a path can be created as ephemeral, which means that if the client that created the path is gone, the path is automatically removed by the ZK server. However, don&#8217;t let the simple API fool you. One needs to understand a lot more than those APIs in order to use them properly. For me, this translates to weeks asking the ZK mailing list (which is pretty responsive) and our local ZK experts.</p>
<p>To get started, it&#8217;s important to understand the state transitions and the associated watcher events inside a ZK client. A ZK client can be in one of the 3 states, disconnected, connected, and closed. When a client is created, it&#8217;s in the disconnected state. Once a connection is established, the client is moved to the connected state. If the client loses its connection to a server, it switches back to the disconnected state. If it can&#8217;t connect to any server within some time limit, it&#8217;s eventually transitioned to the closed state. For each state transition, a state changing event (disconnected, syncconnected and expired) is sent to the client&#8217;s watcher. As you will see, those events are critical to the client. Finally, if one performs an operation on ZK when the client is in the disconnected state, a ConnectionLossException (CLE) is thrown back to the caller. More detailed information can be found at the ZK <a href="http://wiki.apache.org/hadoop/ZooKeeper/FAQ">site</a>. A lot of the subtleties when using ZK are to deal with those state changing events.</p>
<p>The first tricky issue is related to CLE. The problem is that when a CLE happens, the requested operation may or may not have taken place on ZK. If the connection was lost before the request reached the server, the operation didn&#8217;t take place. On the other hand, it can happen that the request did reach the server and got executed there. However, before the server can send a response back, the connection was lost. If the request is a read or an update, one can just keep retrying until the operation succeeds. It becomes a problem if the request is a create. If you simply retry, you may get a NodeExistsException and it&#8217;s not clear whether it&#8217;s you or someone else have created the path. What one can do is to set the value of the path to a client specific value during creation. If a NodeExistsException is thrown, read the value back to check who actually created it. One can&#8217;t use this approach for sequential paths (a ZK feature that creates a path with a generated sequential id) though. If you retry, a different path will be created. You also can&#8217;t check who created the path, since if you get a CLE, you don&#8217;t know the name of the path that gets created. For this reason, I think that sequential paths have very limited benefit since it&#8217;s very hard to use them correctly.</p>
<p>The second tricky issue is to distinguish between a disconnect and an expired event. The former happens when the ZK client can&#8217;t connect to the server. This is because either (1) the ZK server is down, or (2) the ZK server is up, but the ZK client is partitioned from the server or it is in a long GC pause and can&#8217;t send the heartbeat in time. In case (1), when the ZK server comes back, the client watcher will get a syncconnected event and everything is back to normal. Surprisingly, in this case, all the ephemeral paths and the watchers are still kept at the server and you don&#8217;t have to recreate them. In case (2), when the client finally reconnects to the server, it will get back an expired event. This implies that the server thinks the client is dead and has taken the liberty to delete all the ephemeral paths and watchers created by that client. It&#8217;s the responsibility of the client to start a new ZK session and to recreate the ephemeral paths and the watchers.</p>
<p>To deal with the above issues, one has to write additional code that keeps track of the ZK client state, starts a new session when the old one expires, and handles the CLE appropriately. For my application, I find the <a href="http://github.com/sgroschupf/zkclient">ZKClient</a> package quite useful. ZKClient is a wrapper of the original ZK client. It maintains the current state of the ZK client, hides the CLE from the caller by retrying the request when the state is transitioned to connected again, and reconnects when necessary. ZKClient has an Apache license and has been used in <a href="http://katta.sourceforge.net/">Katta</a> for quite some time. Even with the help of ZKClient, I still have to handle things like who actually created a path when a NodeExistsException occurs and re-registering after a session expires.</p>
<p>Finally, how do you test your ZK application, especially the various failure scenarios? One can use utilities like &#8220;ifconfig down/up&#8221; to simulate network partitioning. Todd Lipcon&#8217;s <a href="http://github.com/toddlipcon/gremlins">Gremlins</a> seems very useful too.</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/08/zookeeper-experience/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Kamikaze version 3.0.0 is released</title>
		<link>http://sna-projects.com/blog/2010/08/the-kamikaze-version-300-is-released/</link>
		<comments>http://sna-projects.com/blog/2010/08/the-kamikaze-version-300-is-released/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 02:40:04 +0000</pubDate>
		<dc:creator>hyan</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[index compression]]></category>

		<category><![CDATA[kamikaze]]></category>

		<category><![CDATA[linkedin]]></category>

		<category><![CDATA[PForDelta]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/2010/08/the-kamikaze-version-300-is-released/</guid>
		<description><![CDATA[Kamikaze  is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays. 
For example, in search engines, for each term t, the index, or called inverted index, contains an inverted list, which is typically a sequence [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://sna-projects.com/kamikaze"><img src="http://sna-projects.com/blog/wp-content/uploads/2010/08/kamikazelogo.png" alt="Kamikaze" width="54" height="56" class="alignleft size-full wp-image-401" /></a><a href="http://sna-projects.com/kamikaze">Kamikaze</a>  is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays. </p>
<p>For example, in search engines, for each term t, the index, or called inverted index, contains an inverted list, which is typically a sequence of sorted integer document IDs (and other information which can also be considered as sequences of integers). Thus, inverted index compression techniques are concerned with compressing sequences of sorted integers. </p>
<p>A graph is often implemented as adanjency lists. In many cases, each list can be easily organized as a sorted integer array. For example, for the social graphs in large-scale social networks like Linkedin or Facebook, each list is, for a particular member, a sequence of all his friends (represented as integer member IDs). The performance of many algorithms on such graphs is thus greatly affected by the efficiency of various operations on such lists. For example, in order to find all common friends of two members, we need to find all intersected member IDs of their friend lists. </p>
<p>A matrix can be considered as an alternative implementation of a graph especially when most nodes are directly connected with each other. However, when the matrix is sparse (which is very common for the first or second degree friends in social graphs), it is more efficient to first transfer it into the adancency lists and then do various operations on the resulting lists.</p>
<p>In the above applications (large scale search engines or social networks), we often need to process a huge amount of data (arrays of integers) within milliseconds. The data often need to be compressed to be hold in main memory. Due to compression, the disk traffic and the network traffic are also greatly reduced since much less amount of data need to be communicated. We also need to be able to decompress the data very efficiently to maximize, for example, the query throughput of search engines. To achieve these goals, large search engines have been trying a lot of methods. For example, Lucene uses variable-byte coding (please refer to <a href="http://books.google.com/books?id=2F74jyPl48EC&amp;dq=managing+gigabytes&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=qMZuTKqyEIuosQOg5ZmiCw&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CCoQ6AEwAw#v=onepage&amp;q&amp;f=false">Managing Gigabytes</a> for various inverted index compression methods) to compress indexes. Google also uses variable-byte coding to encode part of its indexes a long time ago and has switched to <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf">other compression</a> methods lately (In my opinion, their new method is a variation of PForDelta which is also implemented in Kamikaze and optimized in Kamikaze version 3.0.0). Therefore, we can see that it is very important to build Kamikaze on top of a good compression method that can achieve both the small compressed size and fast decompression speed. </p>
<p>Kamikaze implements PForDelta compression algorithm (or called P4Delta) which was recently studied and has been shown by <a href="http://www2008.org/papers/pdf/p387-zhangA.pdf">paper[1]</a> and <a href="http://www2009.org/proceedings/pdf/p401.pdf">paper[2]</a> to be able to achieve the best trade-off of the compression ratio and decompression speed for inverted index of search engines. Many other techniques for inverted index compression have been studied in the literature; see <a href="http://books.google.com/books?id=2F74jyPl48EC&amp;dq=managing+gigabytes&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=qMZuTKqyEIuosQOg5ZmiCw&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CCoQ6AEwAw#v=onepage&amp;q&amp;f=false">Managing Gigabytes</a> for a survey and <a href="http://www2009.org/proceedings/pdf/p401.pdf">paper[2]</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.2695">paper[3]</a> and for very recent work, especially the detailed performance comparison between most of those techniques and PForDelta. Unfortunately, Lucene does not support PForDelta now although <a href="http://www2009.org/proceedings/pdf/p401.pdf">paper[2]</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.2695">paper[3]</a> have shown that PForDelta can achieve much better performance than variable-byte coding in terms of both compressed size and decompression speed.</p>
<p>Kamikaze builds an platform on top of PForDelta to perform efficient set operations and inverted list compression/decompression. Kamikaze Version 3.0.0 inherits the architecture of the first two versions and supports the same APIs. In Version 3.0.0., the PForDelta algorithm is highly optimized such that the <a href="http://sna-projects.com/kamikaze/performance.php">performance</a> of compression/decompression and the corresponding set operations are improved significantly.</p>
<p>In Linkedin, Kamikaze has been used in the distributed graph team and search team. We are also looking forward to contributing to the Lucene community with Kamikaze, especially the optimized PForDelta compression algorithm.   </p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/08/the-kamikaze-version-300-is-released/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tech talk: Get your distributed pub/sub on&#8230; Ben Reed talks about Hedwig</title>
		<link>http://sna-projects.com/blog/2010/07/hedwig/</link>
		<comments>http://sna-projects.com/blog/2010/07/hedwig/#comments</comments>
		<pubDate>Tue, 13 Jul 2010 01:18:22 +0000</pubDate>
		<dc:creator>Sam</dc:creator>
		
		<category><![CDATA[tech talks]]></category>

		<category><![CDATA[hedwig]]></category>

		<category><![CDATA[publish/subscribe]]></category>

		<category><![CDATA[talks]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=371</guid>
		<description><![CDATA[
Hedwig
Benjamin Reed (Yahoo! Research)
June 7, 2010
ABSTRACT
Hedwig is a large scale cross data center publish/subscribe service  developed at Yahoo! Research. We needed a scalable, fault tolerant,  publish/subscribe service that has strong delivery and ordering  guarantees to maintain the consistency of replicas of datasets in  different data centers. We found that we could [...]]]></description>
			<content:encoded><![CDATA[<p><center><object width="600" height="400"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=13282102&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=13282102&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="600" height="400"></embed></object></center></p>
<p><em>Hedwig</em><br />
<strong><a href="http://research.yahoo.com/Benjamin_Reed">Benjamin Reed</a> (Yahoo! Research)</strong><br />
June 7, 2010</p>
<p>ABSTRACT</p>
<p>Hedwig is a large scale cross data center publish/subscribe service  developed at Yahoo! Research. We needed a scalable, fault tolerant,  publish/subscribe service that has strong delivery and ordering  guarantees to maintain the consistency of replicas of datasets in  different data centers. We found that we could use ZooKeeper, a  coordination service, and BookKeeper, a distributed write-ahead logging  service to build such a service. To present Hedwig I will first review  two of the services it is built on: ZooKeeper and BookKeeper. I will  then present the motivation for Hedwig, review its design, and present  its current status.</p>
<p>BIOGRAPHY</p>
<p>Benjamin Reed has worked for almost 2 decades in the industry: from an intern working on CAD/CAM systems, to shipping and receiving applications in OS/2, AIX, and CICS, to operations, to system admin research and Java frameworks at IBM Almaden Research (11 years), until finally arriving at Yahoo! Research (3 years ago) to work on distributed computing problems. His main interests are large scale processing environments and highly available and scalable systems. Dr. Reed&#8217;s research project at IBM grew into OSGI, which is now in application servers, IDEs, cars, and mobile phones. While at Yahoo, he has worked also worked on Pig and ZooKeeper, which are Apache projects for which he is a committer. Benjamin has Ph.D. in Computer Science from the University of California, Santa Cruz.</p>
<p>[This video is part of LinkedIn's tech talk series.]</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/07/hedwig/feed/</wfw:commentRss>
		</item>
		<item>
		<title>LinkedIn Faceted Search</title>
		<link>http://sna-projects.com/blog/2010/07/linkedin-faceted-search/</link>
		<comments>http://sna-projects.com/blog/2010/07/linkedin-faceted-search/#comments</comments>
		<pubDate>Sun, 04 Jul 2010 01:23:27 +0000</pubDate>
		<dc:creator>jwang</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[faceted search]]></category>

		<category><![CDATA[linkedin]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=352</guid>
		<description><![CDATA[Faceted search has been fully rolled out late last year, we wanted to  give you some insights into how it came to be, some of its challenges  and what is in the future.

At scale and with relevance, faceted  search makes a lot of sense on the rich structured data we have here [...]]]></description>
			<content:encoded><![CDATA[<p>Faceted search has been fully rolled out late last year, we wanted to  give you some insights into how it came to be, some of its challenges  and what is in the future.</p>
<p><img class="alignnone size-large wp-image-364" title="faceted search" src="http://sna-projects.com/blog/wp-content/uploads/2010/07/picture-61-760x1024.png" alt="faceted search" width="760" height="1024" /></p>
<p>At scale and with relevance, faceted  search makes a lot of sense on the rich structured data we have here at LinkedIn.  The fundamental paradigm was to provide individuals  with an easy and natural way to slice and dice through search results or  simply content.  We thought that if the content being indexed had some  structural dimensions (say like in eCommerce) or could be augmented by  implicitly deriving these dimensions with quality, then a Faceted search  paradigm would be ideal not only for retrieval but also for Navigation  and Discovery.  At Linkedin since a member profile does have these rich  structural dimensions, along with rich text data, it seemed that it  would be only a matter of time to create such an interface.</p>
<p>At  Linkedin we had first developed a prototype exhibiting the traditional  pattern; a click on a facet value would be similar to a filtering of  search results through that value.  More explicitly, you would search  for &#8220;John&#8221; and later clicked on the &#8220;San Francisco&#8221; facet value to get  only people in San Francisco called John, i.e. &#8220;John&#8221; + facet_value(&#8221;San  Francisco&#8221;) = &#8220;John AND location:(San Francisco)&#8221;.  Since a facet value  would be displayed only if it contained results, it would never lead  you to a dead end while you were navigating through results.</p>
<p>Our  next steps were to bring it in front of users through a very nice  useability lab (read forthcomming blog).  Feedback was really good with  respect to the idea but with respect to facet interactions, users really  wanted the ability to select multiple values within a facet.  More  explicitly, the ability through facets to look for say &#8220;John&#8221; in San  Francisco or Los Angeles at the same time i.e. &#8220;John AND (location:(San  Francisco) OR location:(Los Angeles)&#8221;.  This with an interface that  would naturally provide that behavior and understandable counts.  As you  can imagine, we were very excited with the positive feedback.</p>
<p>What we have implemented is essentially a query engine for the following type of query:</p>
<p><strong>SELECT f1,f2&#8230;fn FROM members </strong></p>
<p><strong> WHERE c1 AND c2 AND c3..</strong></p>
<p><strong> MATCH (fulltext query, e.g. &#8220;java engineer&#8221;)</strong></p>
<p><strong> GROUP BY fx,fy,fz&#8230;</strong></p>
<p><strong> ORDER BY fa,fb&#8230;</strong></p>
<p><strong> LIMIT offset,count</strong></p>
<p>deferring this query to a traditional RDBMS on 10s - 100s millions of rows with sub-second query latency SLA is not feasible. Hence we built a distributed system that handles the above query at internet scale.</p>
<p>The  technical challenge is really a caching and counting exercise.  Naively  one can view the operation of determining scores for Facet Values as  iterating over all search results and tabulating the right Facet Value.   Of course when you have more than a hundred thousand Facet Values  spread across tens of Facets, the naive approach doesn&#8217;t really scale.  With multi-select options counting can also be rather tricky.  More  specifically, selected values do not participate in counting their  respective facets. A naive solution would be to do N (the number of  facets) iterations over the result set for each multi-selectable field.  But performance wouldn&#8217;t be good and would degrade as we add more such  facets. To deal with it, we devised a solution that only iterates the  result set once by integrating hit counting and hit validation into a  single step.</p>
<p>One item we wanted to highlight in this post is that not all Facets are  equal.  If you take a closer look at your results you&#8217;d notice that  facets can be:</p>
<p><img class="alignleft size-full wp-image-353" title="location facet" src="http://sna-projects.com/blog/wp-content/uploads/2010/07/picture-1.png" alt="location facet" width="199" height="212" />- Static:  A static Facet is &#8216;<strong>Location</strong>&#8216; ; it reflects a  static property of the search record and its value is not affect by the  searcher neither the search query. In this case it simply counts and  tabulates, by location, the number of individuals within the entire  search hits.  Furthermore some static facets values are not mutually  exclusive.  In the case of &#8216;previous company&#8217; a member can have multiple  values if he had two or more previous employers.</p>
<p><img class="alignright size-full wp-image-356" title="Joined Faceted" src="http://sna-projects.com/blog/wp-content/uploads/2010/07/picture-22.png" alt="Joined Faceted" width="220" height="148" /></p>
<p>- Dynamic: A dynamic Facet is &#8216;<strong>Recently Joined</strong>&#8216; ; it reflects a static property of the search record but its value  is affected by the query.  In the case of the &#8220;When Joined&#8221; facet, it  tracks the difference between the time of the query and when the member  joined, rounded to the day.  Note here that we are not re-indexing the  joined date on a daily basis.</p>
<p><img class="alignleft size-full wp-image-357" title="Network Facet" src="http://sna-projects.com/blog/wp-content/uploads/2010/07/picture-3.png" alt="Network Facet" width="196" height="129" /></p>
<p>- Personal: A personal Facet is &#8216;<strong>Relationship</strong>&#8216;,  it depends on the distance in the social graph between the member  found and the searcher.  We are a social network site hence social  distance is a very important facet to provide to our users.  Its nature  is actually very dynamic as new connections occur every milliseconds  which in turn transform the social network of everyone involved to the  3rd degree.  This makes the social graph so dynamic, that we do not  store the relationship between a member and a potential searcher within  the index!.</p>
<p>To be good citizens of the community, we have contributed the technology to the open-source world:</p>
<ul>
<li><a title="Bobo - facet engine" href="http://sna-projects.com/bobo">Bobo</a></li>
<li><a title="Realtime search/indexing system" href="http://sna-projects.com/zoie">Zoie</a></li>
<li><a title="Distributed system" href="http://sna-projects.com/sensei">Sensei</a></li>
</ul>
<p>What is next?  Fundamentally it is all about speed so next play is to  make it even faster and more intelligent <img src='http://sna-projects.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/07/linkedin-faceted-search/feed/</wfw:commentRss>
		</item>
		<item>
		<title>When Pigs Fly: Apache Pig, Open Source and Understanding Systems</title>
		<link>http://sna-projects.com/blog/2010/06/when-pigs-fly-apache-pig-open-source-and-understanding-systems/</link>
		<comments>http://sna-projects.com/blog/2010/06/when-pigs-fly-apache-pig-open-source-and-understanding-systems/#comments</comments>
		<pubDate>Thu, 24 Jun 2010 08:27:23 +0000</pubDate>
		<dc:creator>rjurney</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[hadoop pig buddha consciousness mapreduce linkedin]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=197</guid>
		<description><![CDATA[Pig at LinkedIn
Hadoop drives many of our most powerful features at LinkedIn.  About half of our Hadoop jobs are submitted by Apache Pig.  This means that along with Azkaban and Voldemort, Pig is a large part of LinkedIn&#8217;s data cycle - the process behind features like People You May Know and Who Viewed My Profile.
I have used Pig [...]]]></description>
			<content:encoded><![CDATA[<h2><strong>Pig at LinkedIn</strong></h2>
<p style="text-align: justify;"><a href="http://hadoop.apache.org/" target="_self">Hadoop</a> drives many of our most powerful features at <a href="http://www.linkedin.com/companies/1337?trk=saber_s000001e_1000" target="_self">LinkedIn</a>.  About half of our Hadoop jobs are submitted by <a href="http://hadoop.apache.org/pig/" target="_self">Apache Pig</a>.  This means that along with <a href="http://sna-projects.com/azkaban/" target="_self">Azkaban</a> and <a href="http://project-voldemort.com/" target="_self">Voldemort</a>, Pig is a large part of LinkedIn&#8217;s <a href="http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/" target="_self">data cycle</a> - the process behind features like <a href="http://www.linkedin.com/pymk?showMore=" target="_self">People You May Know</a> and <a href="http://www.linkedin.com/wvmp?showMore=" target="_self">Who Viewed My Profile</a>.</p>
<p style="text-align: justify;"><a href="http://www.linkedin.com/in/russelljurney" target="_self">I</a> have used Pig intensively for about a year.  During that time, I have come to love Pig for what it enables me to do: easily manipulate my data at scale, to turn raw data into data products.  As a recovering <a href="http://search.cpan.org/~rjurney/" target="_self">Perl hacker</a> (see: <a href="http://www.enlightenedperl.org/" target="_self">enlightened perl</a>, <a href="http://www.catalystframework.org/" target="_self">catalyst framework</a>), I always employ the tool with the <a href="http://en.wikipedia.org/wiki/High-level_programming_language" target="_self">highest level abstraction</a> that fits the job - higher level tools being <a href="http://www.paulgraham.com/power.html" target="_self">more powerful</a>.  Because of this, good alternatives to Pig like <a href="http://www.cascading.org/" target="_self">Cascading</a>, and the exciting work with LISP REPLs in <a href="http://github.com/nathanmarz/cascalog" target="_self">Cascalog</a> don&#8217;t do it for me.  If Perl is the <a href="http://www.oreillynet.com/pub/a/oreilly/perl/news/importance_0498.html" target="_self">duct tape of the internet</a>, and <a href="http://hadoop.apache.org/" target="_self">Hadoop</a> is the kernel of the  <a href="http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F09/wharehousesizedcomputers.pdf" target="_self">data center as computer</a>, then Pig is the duct tape of <a href="http://blog.weatherby.net/2009/04/map-reduce-for-the-people.html" target="_self">Big Data</a>.  Pig lets me easily flow my data in parallel with simple commands.  It lets me flow my data through dynamic languages like Python if I want to use <a href="http://www.scipy.org/" target="_self">SciPy</a>, through simple <a href="http://wiki.apache.org/pig/UDFManual" target="_self">UDFs</a> in Java if I want to use a function repeatedly and <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/" target="_self">share</a> it with others, and <a href="http://research.yahoo.com/files/paper_5.pdf" target="_self">ILLUSTRATE</a> lets me <a href="http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html#ILLUSTRATE" target="_self">check the output</a> of my lengthy batch jobs and their custom functions without having to do a <a href="http://en.wikipedia.org/wiki/Batch_processing" target="_self">lengthy run</a> of a long pipeline.  Taken together, these features enable me to be productive.</p>
<p style="text-align: justify;"><a href="http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf"><img class="alignleft size-full wp-image-265" title="mr" src="http://sna-projects.com/blog/wp-content/uploads/2010/06/mr.png" alt="mr" width="496" height="169" /></a>I learned Pig not because I had a big data problem, but because I wanted to build a <a href="http://github.com/rjurney/Cloud-Stenography" target="_self">better interface</a> for Hadoop (see: <a href="http://wiki.apache.org/pig/PigPen" target="_self">PigPen</a>, <a href="http://javascript.neyric.com/wireit/" target="_self">WireIT</a>, <a href="http://vimeo.com/6032078" target="_self">this demo video</a>).  For a long time, I did not delve very deeply.  There was no reason to do so: I didn&#8217;t have to know how to code in <a href="http://labs.google.com/papers/mapreduce-osdi04.pdf" target="_self">MapReduce</a> - Pig &#8216;just worked.&#8217;  I issue SQLish commands in <a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html" target="_self">Pig Latin</a>, and Pig parses these commands and creates and submits MapReduce jobs for me.  This saves me from having to think too hard about the complexity of <a href="http://www.internetnews.com/xSP/article.php/3618166/Is-Java-EEs-Complexity-Its-Worst-Enemy.htm" target="_self">Java</a>, MapReduce or Hadoop.  I don&#8217;t like to think about anything but the problem I&#8217;m actually solving, and so while I have written <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/Algebraic.java?view=markup" target="_self">Algebraic</a> MapReduce jobs as Pig <a href="http://issues.apache.org/jira/browse/PIG-1150" target="_self">UDFs</a>, I am unlikely to ever write a <a href="http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html" target="_self">Java Hadoop</a> job unless I absolutely have to.</p>
<p style="text-align: justify;">Apache Pig is now fairly <a href="http://hadoop.apache.org/pig/releases.html#13+May%2C+2010%3A+release+0.7.0+available" target="_self">robust</a>, but <a href="http://en.wikipedia.org/wiki/Dataflow_programming" target="_self">data-flows</a> themselves can get <a href="http://www.mail-archive.com/pig-user@hadoop.apache.org/msg02699.html" target="_self">complex fast</a>.  I&#8217;m pretty fluent in Pig Latin, but my code in any language rarely runs on the first try.  With batch computing, running jobs repeatedly to debug them can take a long time and slow development to a crawl.  One must often massage the Pig to command its will.</p>
<p>When I write Pig Latin code beyond a dozen lines, I check it in stages:</p>
<ul>
<li>Write Pig Latin in <a href="http://tommy.chheng.com/index.php/2009/09/pig-textmate-bundle/" target="_self">TextMate</a> (Saved in a <a href="https://github.com/" target="_self">git repo</a>, otherwise I lose code)</li>
<li>Paste the code into the <a href="http://wiki.apache.org/pig/Grunt" target="_self">Grunt</a> shell - Did it parse?</li>
<li><a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DESCRIBE" target="_self">DESCRIBE</a> the final output and each complex step - Did it still parse?  Is the schema what I expected?</li>
<li><a href="http://research.yahoo.com/files/paper_5.pdf" target="_self">ILLUSTRATE</a> the output - Does it still parse?  Is the schema ok?  Is the example data ok?</li>
<li><a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#SAMPLE" target="_self">SAMPLE</a>/<a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#LIMIT" target="_self">LIMIT</a>/<a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DUMP" target="_self">DUMP</a> the output - Does it still parse?  Is the schema ok?  Is the sampled/limited data sane?</li>
<li><a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STORE" target="_self">STORE</a> the final output and see if the job completes.</li>
<li><a href="http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html#cat" target="_self">cat</a> output_dir/part-00000 (followed by a quick ctrl-c to stop the flood) - Is the stored output on <a href="http://hadoop.apache.org/hdfs/" target="_self">HDFS</a> ok?</li>
</ul>
<p style="text-align: justify;">When you first tackle a <a href="http://upload.wikimedia.org/wikipedia/en/5/5a/Complexity-map-overview.png" target="_self">complex</a> task with Pig, that last step rarely happens on the first few tries.  In time, you get more proficient.</p>
<p style="text-align: justify;">As an incurious Pig user, I thought of Pig as a black box: a program with a command line.  Nevertheless, I got to know the idiosyncrasies of each version as Pig matured from version <a href="http://svn.apache.org/viewvc/hadoop/pig/tags/release-0.2.0/" target="_self">0.2</a> to <a href="http://svn.apache.org/viewvc/hadoop/pig/tags/release-0.7.0/" target="_self">0.7</a> - unfixed bugs, unusual behaviors, and undocumented limitations.  I never knew exactly why Pig behaved as it did, but I learned to get along with it.</p>
<h2><strong>Working on Pig</strong></h2>
<p style="text-align: justify;"><a href="http://hadoop.apache.org/pig/"><img class="alignleft size-full wp-image-286" title="pig-logo" src="http://sna-projects.com/blog/wp-content/uploads/2010/06/pig-logo.gif" alt="pig-logo" width="75" height="106" /></a>Several months ago I decided to work on the Pig project.  I don&#8217;t even know Java.  I&#8217;ve been faking it my entire career (ask me to write a Java class without any Java code around it in an IDE - I can&#8217;t do it), so I&#8217;m going after low hanging fruit the committers haven&#8217;t gotten around to and leaving the tough bits to them.  <a href="http://en.wikipedia.org/wiki/Web_log_analysis_software" target="_self">Log analysis</a> is a common use of Pig, and logs usually contain timestamps, so I want to add a <a href="http://joda-time.sourceforge.net/" target="_self">Joda-Time</a> <a href="https://issues.apache.org/jira/browse/PIG-1314" target="_self">DateTime</a> data type to Pig.</p>
<p style="text-align: justify;">But that is way too hard, so I&#8217;m going after <a href="http://issues.apache.org/jira/browse/PIG-1429" target="_self">boolean</a> first.  I checked out the <a href="http://github.com/apache/pig" target="_self">code</a>.  I worked on it all weekend.  I made a <a href="https://issues.apache.org/jira/secure/attachment/12445897/working_boolean.patch" target="_self">patch</a>.  I made many patches, actually.  Time and again, I thought I was done, but I wasn&#8217;t.  Booleans would load in grunt, so I thought it worked - but they wouldn&#8217;t store.  I added physical storage code, so I could load and store.  I emailed the LinkedIn Hadoop users list proclaiming victory&#8230; but it wouldn&#8217;t work on Hadoop.  So I added Hadoop storage code, and it would load and store on Hadoop - but I couldn&#8217;t use operators to check for equality.  I added code for ILLUSTRATE and it would illustrate, but I still couldn&#8217;t use booleans in a real job.  This went on and on, and the patch remains incomplete (I&#8217;ll finish it soon).</p>
<p style="text-align: justify;">During that weekend of long and frustrating hours of Pig hacking, the pattern became familiar.  I was interacting with a different part of Pig each time I got a new kind of error.  The hops from package to package in writing the patch corresponded to the stages of my long hours of stepwise data-flow checks in Grunt, as I had written Pig scripts most days over the course of the last year.</p>
<p style="text-align: justify;">From a user&#8217;s perspective using the Grunt shell, this system seems like a cohesive entity - a single program - a complete (and somewhat irrational) Pig.  It doesn&#8217;t seem that way anymore.  Now that I&#8217;ve read the code, using Grunt is different.  Knowing the way it all fits together at a high level - by tracing exceptions and seeing the package names of classes I&#8217;ve failed to implement because I didn&#8217;t know they existed or were required - I know that pig is actually segmented into many logical parts, independent arms that verify and process Pig Latin code independently and in different ways.  The interface presented by grunt presents an illusion of wholeness that a deeper understanding of pig makes transparent - clear as illusion.</p>
<p><a href="http://www.flickr.com/photos/29871022@N03/4729040449/sizes/l/"><img class="alignright size-full wp-image-258" title="word_queue_3001" src="http://sna-projects.com/blog/wp-content/uploads/2010/06/word_queue_3001.gif" alt="word_queue_3001" width="300" height="400" /></a></p>
<h2><strong>Complex Systems: Software and the Brain</strong></h2>
<p style="text-align: justify;">Watching Pig&#8217;s boolean data type&#8217;s slow and stepwise recovery reminded me of something else, something personal.  In February, 2009, I had a minor car accident.  I drove home to the farm, and was fine until the next day when I became mute and started blacking out.  If I hadn&#8217;t been terrified and crying, the 911 call would have been hilarious to hear played back because it took me several minutes to ask for help.  I had a concussion, and the effects of the injury <a href="http://en.wikipedia.org/wiki/Post-concussion_syndrome" target="_self">remained</a> after the concussion passed.  I was punch drunk and irritable for months.  My rate of speech varied from normal to mute, and every range in between.  <strong>My brain was throwing exceptions my body could not catch.</strong> When I could not talk, I couldn&#8217;t think out loud either.  But I could still type.  I could still tweet.  I made a diagram of the way my speech would phase in and out, like a malfunctioning queue that would grow and shrink.  I showed it to my neurologist, but she didn&#8217;t pay much attention.  Call it infographic therapy.</p>
<p style="text-align: justify;">At some point I started to consciously observe the malfunctions, often repeatedly in the case of my speech.  The mind is a complex system, and it can fail in parts.  Over time I came to know, at an experiential level, that consciousness is actually an illusion presented to the user: an illusion made up of many independent processes that only appear cohesive when presented as a unified and intuitive interface.  As my brain healed, I got to contrast functional sub-systems with malfunctioning sub-systems.  As a result, I know my limitations better and can apply myself more effectively.</p>
<h2><strong>The Data Revolution</strong></h2>
<p><a href="http://en.wikipedia.org/wiki/Integrated_circuit"><img class="alignleft size-full wp-image-279" title="1958_intcirc_thumb" src="http://sna-projects.com/blog/wp-content/uploads/2010/06/1958_intcirc_thumb.gif" alt="1958_intcirc_thumb" width="300" height="206" /></a></p>
<p style="text-align: justify;">For me, understanding my work over the last year by understanding Pig was profound.  It gave it more meaning, because strangely enough Pig has become a big part of my life.  By the numbers, I&#8217;ve spent as much time in the last year with Pig as with anything or anyone else in my life excepting my wife.  I&#8217;ve never much contributed to open source before, and I&#8217;m glad to be transitioning from a passive consumer of other people&#8217;s work to an active participant in an open source project.  It is good to create openly, to give back.  Open source is technical righteousness.</p>
<p style="text-align: justify;"><a href="http://en.wikipedia.org/wiki/Integrated_circuit"></a>But more than that, this is an important time in computer science, and unlike many previous technical revolutions, this one is happening completely in the open.  Like the integrated circuit before it, MapReduce is producing a paradigm shift that opens broad opportunities to produce new kinds of products from our massive collective backlog of data to help people in new and unprecedented ways.  At LinkedIn we&#8217;ve amassed the world&#8217;s premiere data-set on the labor of professionals, and it is the mission of LinkedIn Analytics to leverage that deeply meaningful data to provide insight and value to our users.  At LinkedIn Analytics data processing is both personal and meaningful, as the features we create enhance the working lives of tens of millions of people.</p>
<p style="text-align: justify;">The <a href="http://en.wikipedia.org/wiki/Integrated_circuit" target="_self">Integrated Circuit</a> solved the <a href="http://en.wikipedia.org/wiki/Tyranny_of_numbers" target="_self">Tyranny of Numbers</a> and unleashed Moore&#8217;s law, enabling a computerized, networked society.  It did so with the considerable overhead of patent licensing and litigation.  MapReduce is solving the Tyranny of Threads, enabling any company to process data at scale in parallel to extract real value from our most abundant and underutilized resource: information.  It is doing it in the open, through <a href="http://en.wikipedia.org/wiki/Free_and_open_source_software" target="_self">free and open-source software</a>, through the Apache Foundation, Hadoop and its sub-projects.  We&#8217;ve gotten more efficient organizationally this time around.</p>
<p style="text-align: justify;"><a href="http://en.wikipedia.org/wiki/MapReduce"><img class="alignright size-full wp-image-283" title="map-reduce" src="http://sna-projects.com/blog/wp-content/uploads/2010/06/map-reduce.gif" alt="map-reduce" width="300" height="234" /></a></p>
<p style="text-align: justify;">One of the reasons I joined LinkedIn Analytics is its commitment to open source.  At LinkedIn, we love open source.  We&#8217;re committed to contributing to Hadoop and Pig and giving back to the open source community through projects like <a href="http://sna-projects.com/azkaban/" target="_self">Azkaban</a> and <a href="http://project-voldemort.com/" target="_self">Voldemort</a>.  We are determined to provide the open source community with the complete and painless data cycle that we enjoy - to enable even casual hadoop users to analyze data from their application at scale, to mine it for value and store it easily and reliably so that it can drive use and close the data loop.  Look for new open source tools and projects from LinkedIn Analytics in the coming months that will help make this possible!</p>
<p style="text-align: justify;">If you love open source and you love big, meaningful data - we need you.  Come join us.  <a href="http://www.linkedin.com/static?key=jobs" target="_self">LinkedIn Analytics is hiring</a>!</p>
<p>( Shout outs to <a href="http://www.linkedin.com/in/peterskomoroch" target="_self">Pete Skomoroch</a> for acting as late-night editor, helping me dramatically improve this post! )</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/06/when-pigs-fly-apache-pig-open-source-and-understanding-systems/feed/</wfw:commentRss>
		</item>
		<item>
		<title>JNA</title>
		<link>http://sna-projects.com/blog/2010/06/jna/</link>
		<comments>http://sna-projects.com/blog/2010/06/jna/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 19:29:00 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=184</guid>
		<description><![CDATA[There are two groups of people in CS who want to control your program&#8217;s interaction with the outside world&#8211;the operating system people and programming language people&#8211;and they are always fighting over who will have this honor. The operating system people want to provide a set of functionality that is available to any programming language, and [...]]]></description>
			<content:encoded><![CDATA[<p>There are two groups of people in CS who want to control your program&#8217;s interaction with the outside world&#8211;the operating system people and programming language people&#8211;and they are always fighting over who will have this honor. The operating system people want to provide a set of functionality that is available to any programming language, and the programming language people (generally) want to provide a set of functionality that is available on any operating system.</p>
<p>The OS people made POSIX, and mostly adhere to it and it is available on most any system you would use. The language people made java, the clr, common lisp, and a variety of other platforms, none of which really work together.</p>
<p>I like Java. It is a simple, statically compiled language with great performance and solid garbage collector implementations; but unfortunately it was made by the language people. This means a lot of basic functionality available on every platform you could reasonably run on, isn&#8217;t available to you when your program is written in java.</p>
<p>Want to move a file across volumes, figure out what user you are, get your process id, send a signal, interact with pipes, etc? Too bad. Sun has decided you shouldn&#8217;t, so you can&#8217;t.</p>
<p>The motivation for this is to allow java to be used in set-top boxes or something like that, but it presents a bit of a problem for server-side software development when the code needs to get anywhere near the operating system.</p>
<p>You are supposed to be able to fix this with JNI. JNI is a standard that lets you wrap native code with java interfaces and call it from java. The problem is it is a massive pain to write, so people rarely do.</p>
<p>We have seen a number of problems that this causes. We have seen a couple of problems caused by attempts to use Runtime.exec to run a unix command to collect basic file information. This is a major problem since it forks a huge java process to run an itsy bitsy unix command, then attempts to parse the output. Who could be so stupid, you would ask? Well one was buried in an apache commons library, the other was Hadoop (which tries to execute whoami). The later, in addition to forking a large namenode process, is actually quite non-portable in comparison to getuid().</p>
<p>I didn&#8217;t know this existed, but it turns out there is a library called <a title="JNA" href="https://jna.dev.java.net" target="_self">jna</a> that does transparent native  access for java using dynamic proxies. Here is an example of using it to implement kill() to send a signal to a process in java using jna.</p>
<pre>public class Kill {</pre>
<pre>    public static void main(String[] args) {</pre>
<pre>        Posix posix = (Posix) Native.loadLibrary("c", Posix.class);
        int pid = Integer.parseInt(args[0]);
        int signal = Integer.parseInt(args[1]);</pre>
<pre>        posix.kill(pid, signal);
    }</pre>
<pre>    public interface Posix extends Library {
        public int kill(int pid, int signal);
    }
}</pre>
<p>The only downside of this that I have found is that JNI bound native libraries (whether using JNA or not) cannot be reloaded. This means that if you are writing something deployed in a servlet container like tomcat that expects to be able to reload various sub-applications this will not work. The work around is to add the JNA jar to the tomcat system classloader, but this breaks the ability to package the application as a war that works in an container. (Special thanks to a coworker who pointed this out to me before we started converting things right and left.)</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/06/jna/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Beating Binary Search</title>
		<link>http://sna-projects.com/blog/2010/06/beating-binary-search/</link>
		<comments>http://sna-projects.com/blog/2010/06/beating-binary-search/#comments</comments>
		<pubDate>Thu, 17 Jun 2010 07:55:18 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=169</guid>
		<description><![CDATA[A search exponentially faster than binary search, and a use for it.]]></description>
			<content:encoded><![CDATA[<p>Quick, what is the fastest way to search a sorted array?</p>
<p>Binary search, right?</p>
<p>Wrong. There is actually a method called <a href="http://en.wikipedia.org/wiki/Interpolation_search" target="_self">interpolation search</a>, in which, rather than pessimistically looking in the middle of the array, you use a model of the key distribution to predict the location of the key and look there.</p>
<p>Here is a simple example: assume you have an array of length 10, which contains a uniform sample of numbers between 0 and 99. Interpolation search works the same way a person would search. If asked to guess the location of 3 you would probably guess the 1st slot, if asked to guess the location of 85 you might guess the 9th slot, etc.</p>
<p>Okay, so this approach seems to make better guesses, but does it actually require asymptotically fewer iterations on average?</p>
<p>The answer is that in fact it requires exponentially fewer iterations&#8211;it runs in lg lg <em>N</em> time where <em>N</em> is the length of the array. The analysis is a little tricky&#8211;it appeared 19 years after the algorithm was formally published (according to the Knuth Search and Sorting book). I had to look it up (and before I could look it up I first had to realize I didn&#8217;t invent it and figure out what the name of it was), but essentially each step reduces the range to <em>N</em>^0.5 instead of 0.5 * <em>N</em> which yields the better asymptotic runtime. This is an average case result, so it is worth noting that the variance in the number of comparisons is lg lg <em>N</em> as well. This means that assuming your keys are well distributed you will almost certainly get the average case time or very close to it (I tried it on random arrays and the theory is scarily accurate).</p>
<p>So why isn&#8217;t this used in practice? Probably because lg <em>N</em> is already really small. After all, if you have an array of length 2^32 this only drops you from ~32 to ~5 comparisons which in practical terms probably isn&#8217;t a big speed up for searching arrays.</p>
<p>But we found a really great use for it in <a href="http://project-voldemort.com">Voldemort</a>. One use case we support is serving really large read-only files as Voldemort stores. This allows us to support a big batch datacycle run out of Hadoop as described <a href="http://sna-projects.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort">here</a>. The data structure for these uses a large sorted index file to do lookups, what is stored in this file is an MD5 of the key. Since the MD5s are used for the sort, the file is guaranteed to be uniformly distributed over the key space, and can often be many GBs in size. These files are memory mapped to help reduce the cost of a read, but the improved search algorithm can help to greatly reduce the number of seeks when the index is not fully memory resident. A disk seek comes at a price of around 10ms, so saving even one or two is a huge performance win.</p>
<p>Sometimes it is nice to see what these things are like in real life.  For uniformly distributed values, given a key to search for,  an array to search in, and a minimum and maximum value it might look  something like this:</p>
<pre>int interpolationSearch(int key, int[] array, int min, int max) {
    int low = 0;
    int high = array.length - 1;
    while(true) {
        if(low &gt; high || key &lt; min || key &gt; max)
            return -1;

        // make a guess of the location
        int guess;
        if(high == low) {
            guess = high;
        } else {
            int size = high - low;
            int offset = (int) (((size - 1) * ((long) key - min)) / (max - min));
            guess = low + offset;
        }

        // maybe we found it?
        if(array[guess] == key)
            return guess;

        // if we didn't find it and we are out of space to look, give up
        if(guess == 0 || guess == array.length - 1)
            return -1;

        // if we guessed to high, guess lower or vice versa
        if(array[guess] &gt; key) {
            high = guess - 1;
            max = array[guess-1];
        } else {
            low = guess + 1;
            min = array[guess + 1];
        }
    }
}</pre>
<p>You can see the real deal implementation in Voldemort <a href="http://github.com/voldemort/voldemort/blob/release-0802/src/java/voldemort/store/readonly/InterpolationSearchStrategy.java">here</a>&#8211;it is a little trickier as it uses non-integer keys and is searching a file but the basic outline is the same.</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/06/beating-binary-search/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SOCC 2010 updates</title>
		<link>http://sna-projects.com/blog/2010/06/socc-2010-updates/</link>
		<comments>http://sna-projects.com/blog/2010/06/socc-2010-updates/#comments</comments>
		<pubDate>Wed, 16 Jun 2010 16:45:11 +0000</pubDate>
		<dc:creator>jrao</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=154</guid>
		<description><![CDATA[Just came back from the 1st ACM Symposium on Cloud Computing at Indianapolis. The conference is collocated with Sigmod and lasts a day and half. A total of 7 people from LinkedIn were at SOCC and the blog below reflects the notes that we took collectively. There were three keynote speeches, all of which are [...]]]></description>
			<content:encoded><![CDATA[<p>Just came back from the <a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/">1st ACM Symposium on Cloud Computing at Indianapolis</a>. The conference is collocated with Sigmod and lasts a day and half. A total of 7 people from LinkedIn were at SOCC and the blog below reflects the notes that we took collectively. There were three keynote speeches, all of which are excellent (the slides will be made available at the conference website).</p>
<p>1. Keynote by Jeff Dean from Google:</p>
<div>
<ul>
<li>Google already started using flash disks in their clusters.</li>
<li>Bigtable : It now only runs as a service within google (i.e., one doesn&#8217;t install Bigtable himself any more); Each SSTable is validated against the checksum immediately after it&#8217;s written, which helps detecting corruption early. This catches 1 corruption/5.4PB data; A coprocessor daemon runs on each tablet and it splits as a tablet gets split. This is used to do some processing on a set of rows and seems to me like a low-overhead MapReduce job since there is no overhead in starting the mappers and reducers.</li>
<li>Google is  working on Spanner, a virtualized storage service across data centers. Didn&#8217;t get too much details. The key point seems to be the capability of moving storage across data centers.</li>
<li>A few key design patterns that worked well in google&#8217;s infrastructure: (1) 1 master/1000 workers, simplified design; (2) canary requests: to void an unknown type of request that brings down every worker, first send the request to 1 worker. If successful, send it to everyone; (3) distributing requests through a tree of nodes, instead of direct broadcasting; (4) backup requests to improve the performance of stragglers; (5) multiple small units per machine for better load balancing (don&#8217;t need to split a unit and move each unit as a whole); (6) range partitioning instead of hash</li>
</ul>
</div>
<div>2. Keynote by Jason Sobel from Facebook:</div>
<div>
<ul>
<li>Facebook has 8 data centers.</li>
<li>Core infrastructure based on sharded MySQL. The biggest pain is that it&#8217;s hard to do logical migration (need to split a database). The solution is to over-partition and have multiple databases per node and only move a whole database (similar to the design pattern used at Google). They find that a 3-to-1 file size/RAM ratio ideal for MySQL (with innodb). If the ratio is larger, MySQL&#8217;s performance drops significantly. No distributed transactions across partitions. Multi-node updates (e.g., two people becoming friends) are delivered with best effort.</li>
<li>Heavy use of memcache. Maintain multiple replicas. Want to solve the problem of double caching between memcache and MySQL (the problem is that you can&#8217;t take away too much memory from MySQL even though MySQL mostly handles the write load). Memcache is partitioned differently from MySQL. That way, if a memcache server goes down, the read requests on that server are spread across on all MySQL shards.</li>
<li>Facebook is building FB object/association and a system called TAO on top of memcache. TAO is API-aware and supports write-through on updates (instead of an invalidation followed by a read).</li>
<li>Facebook has several specialized services: search, ads, PYMK, and multifeed. It uses the search engine to serve complex &#8220;search-like&#8221; queries over structured data.</li>
<li>Facebook is building Unicorn. Seems like an event publishing system. Today, it receives batch updates from Hive. It&#8217;s moving towards more real-time by taking SQL updates and apply them directly in Unicorn.</li>
</ul>
</div>
<div>3. Keynote by Rob Woollen from SalesForce:</div>
<div>
<ul>
<li>Most amazing thing to me: there are 4000 people at SalesForce and only 200 of them are engineers (the rest are sales and marketing people).</li>
<li>It&#8217;s using Resin App Server, Lucene, and (only) an 8-way Oracle RAC.</li>
<li>Dell and Harah&#8217;s are among the big customers</li>
<li>It uses Apex Governor for service protection to prevent a particular tenant from using too much resource. Apex limits things like heap and stack size.</li>
<li>Flex schema: everything varchar; separate tables for accelerator (indexes), with data types.</li>
<li>It serves both OLTP and reporting on the same database.</li>
<li>It uses Ominiscent tracing for debugging : collects log on an operational system so that one can debug locally.</li>
<li>It uses Chatter for real-time collaboration.</li>
</ul>
</div>
<div>The papers are kind of mixed, likely because this is the very first symposium and nobody knows exactly what kind of papers really fit in. First of all, there are a bunch of papers on MapReduce related stuff.</div>
<div>
<ul>
<li><em><a href="http://www.se.cuhk.edu.hk/~bshe/socc10.pdf">Comet: Batched Stream Processing for Data Intensive Distributed Computing</a></em><em>.</em> This paper is about sharing work through multi-query optimization. For example, if you schedule a weekly job and a daily job on the same data. The weekly results can be derived from the daily jobs. How to share and what to share is determined through static analysis at a planning phase. The system is built on top of Microsoft&#8217;s Dryad.</li>
<li><em><a href="http://cseweb.ucsd.edu/~kyocum/pubs/socc122-logothetis.pdf">Stateful Bulk Processing for Incremental AnalyticsStateful Bulk Processing for Incremental Analytics</a></em><em>.</em> The motivating example is how to do incremental crawling.</li>
<li><em><a href="http://www.cs.duke.edu/~gang/documents/details.pdf">Towards Automatic Optimization of MapReduce Programs</a></em>. This is a position paper and it argues that many database query optimization techniques (both static and dynamic) can be applied to MapReduce. Actually, some of those optimizations have already been applied in systems like <a href="http://research.microsoft.com/en-us/um/people/jrzhou/pub/pgs.pdf">Scope</a>.</li>
<li><em><a href="http://www.dima.tu-berlin.de/fileadmin/fg131/Publikation/Papers/NephelePACTs.pdf">Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing</a></em><em>.</em> This is an alternative parallel computing engine to MapReduce. It offers many operators (e.g., joins, filter, aggregation) to construct a data flow, instead of just a map and a reduce operator.</li>
</ul>
<p>There are several papers related to key-value stores (aka NoSQL databases).</p>
<ul>
<li><em><a href="http://www.eecs.berkeley.edu/~franklin/Papers/socc10armbrust.pdf">The Case for PIQL: A Performance Insightful Query Language.</a></em> It describes a declarative query language for querying key-value stores. The system automatically maintains and selects secondary indexes.</li>
<li><em><a href="http://research.yahoo.com/files/ycsb.pdf">Benchmarking Cloud Serving Systems with YCSB</a></em><em>.</em> It&#8217;s a benchmark for comparing various key-value stores (e.g., HBase, Cassandra, sharded MySQL and Pnuts). The benchmark currently focuses on performance comparison.</li>
<li><em><a href="http://www.uweb.ucsb.edu/~sudipto/talks/GStore-SOCC10.pptx">G-Store: A Scalable Data Store for Transactional Multi Key Access in the Cloud</a></em>. This paper adds multi-row transaction support on top of a key-value store that only supports single-row transactions. The approach is similar to Google&#8217;s Megastore. The prototype is built on top of HBase, an open source implementation of Bigtable.</li>
</ul>
<p>Some other papers that I took notes of.</p>
<ul>
<li><a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBIQFjAA&amp;url=http%3A%2F%2Fdspace.mit.edu%2Fbitstream%2Fhandle%2F1721.1%2F51381%2FMIT-CSAIL-TR-2010-003.pdf%3Fsequence%3D1&amp;ei=0PYYTOnVO8PsnQeA7OHJCg&amp;usg=AFQjCNEoFYZIp-4PqsqmDo6SSFqiqbOe1A">An Operating System for Multicore and Clouds: Mechanisms and Implementation</a>. This is a paper from MIT on building a virtual Cloud OS (called fos) across multiple cores and multiple machines. This are definitely challenges in how to hide the latency across machines.</li>
<li><a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=2&amp;ved=0CBkQFjAB&amp;url=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fantr%2Fpublications%2Fsocc0023-karagiannis.pdf&amp;ei=6_gYTIziOoTjnAek4cHKCg&amp;usg=AFQjCNF4j8DfWTa7dhFV12r5DxgaP4h5uw"><em>Hermes: Clustering Users in Large-Scale E-Mail Services</em></a>. This is a paper about clustering users based on their email exchanges. The motivating application is to save space by avoiding storing duplicated messages collocated on the same partition of the backend email server.</li>
<li><a href="http://dslab.epfl.ch/pubs/taas"><em>Automated Software Testing as a Service</em></a>. It&#8217;s about using cloud for testing software.</li>
</ul>
</div>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/06/socc-2010-updates/feed/</wfw:commentRss>
		</item>
		<item>
		<title>New docs for Norbert</title>
		<link>http://sna-projects.com/blog/2010/06/new-docs-for-norbert/</link>
		<comments>http://sna-projects.com/blog/2010/06/new-docs-for-norbert/#comments</comments>
		<pubDate>Wed, 16 Jun 2010 00:55:23 +0000</pubDate>
		<dc:creator>jay</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sna-projects.com/blog/?p=150</guid>
		<description><![CDATA[We finally added some documentation for Norbert, our open source cluster management and RPC system.
]]></description>
			<content:encoded><![CDATA[<p>We finally added some documentation for <a href="http://sna-projects.com/norbert">Norbert</a>, our open source cluster management and RPC system.</p>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2010/06/new-docs-for-norbert/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Voldemort 0.60 Released</title>
		<link>http://sna-projects.com/blog/2009/12/voldemort-060-released/</link>
		<comments>http://sna-projects.com/blog/2009/12/voldemort-060-released/#comments</comments>
		<pubDate>Thu, 17 Dec 2009 03:17:20 +0000</pubDate>
		<dc:creator>alex</dc:creator>
		
		<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://project-voldemort.com/blog/?p=141</guid>
		<description><![CDATA[
In sync with our plan for regular monthly releases, we&#8217;re excited to announce release of version 0.60. Downloads are available, you may browse the updated Javadoc or view the release notes.


In addition to bug fixes, several important new features and enhancements have made it into this release:


Admin Client/Server API: intended for functionality which is required, [...]]]></description>
			<content:encoded><![CDATA[<p>
In sync with our <a href="http://project-voldemort.com/blog/2009/11/release-057/">plan</a> for regular monthly releases, we&#8217;re excited to announce release of version 0.60. <a href="http://github.com/voldemort/voldemort/downloads">Downloads</a> are available, you may browse the updated <a href="http://project-voldemort.com/javadoc/all/">Javadoc</a> or view the <a href="http://github.com/voldemort/voldemort/blob/release-060/release_notes.txt">release notes</a>.
</p>
<p>
In addition to bug fixes, several important new features and enhancements have made it into this release:</p>
<ul>
<li>
Admin Client/Server API: intended for functionality which is required, but should be used sparingly (if at all), at the application level. This adds support for retrieval and update of metadata on remote nodes as well as <em>streaming</em> of keys and key/value pairs from one node to another.
</li>
<li>
EC2 testing: a distributed system requires tests which involve multiple machines, contributed by Kirk True. Amazon&#8217;s EC2 web service allows us to provision and de-provision nodes <em>programatically</em>. The <a href="http://wiki.github.com/voldemort/voldemort/ec2-testing-infrastructure">EC2 Testing Infrastructure</a> allows for such tests to run on a regular basis along with other automated tests.
</li>
<li>
Support for large lists and strings in the JSON serializer. Previously, the binary JSON serialization format limited us to maximum size of 32,768 for strings and lists (i.e. the maximum value of a signed 16-bit integer). The maximum size of a list or string is now 1,073,741,823 bytes.
</li>
<li>
<em>Experimental</em> support for <a href="http://project-voldemort.com/javadoc/all/voldemort/store/views/package-summary.html">views</a>. Views allow for computation to be moved close to the data. Suppose, for example, that we&#8217;re storing a serialized list as a value in a key/value pair and would like to append a single element. Normally, we&#8217;d have to transfer the entire list to the client, append the element, then transfer the modified list to the client. Views would mean a &#8220;put&#8221; operation now becomes a proxy for append a value to the list <em>directly on the client</em>. Note, this is feature is <b>experimental</b>, we can&#8217;t make any guarantees about the stability or performance of this feature; in addition, the API is also subject to change.
</li>
<li>Support for LZF compression, contributed by Ismael Juma and Tatu Saloranta.</li>
<li><a href="http://en.wikipedia.org/wiki/Interpolation_search">Interpolation search</a> for read-only stores.</li>
<li>A <a href="http://github.com/voldemort/voldemort/tree/master/contrib/ruby-client/">client</a> for the Ruby programming language (using an experimental ruby protocol buffers gem), contributed by Claudio  Cherubino.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://sna-projects.com/blog/2009/12/voldemort-060-released/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
