<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dragons in the Algorithm &#187; Programming</title>
	<atom:link href="http://mcherm.com/permalinks/1/category/programming/feed" rel="self" type="application/rss+xml" />
	<link>http://mcherm.com</link>
	<description>Adventures in Programming</description>
	<lastBuildDate>Tue, 13 Mar 2012 02:10:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<item>
		<title>Constant Crawl Design &#8211; Part 4</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-4</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-4#comments</comments>
		<pubDate>Tue, 13 Mar 2012 02:10:45 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=634</guid>
		<description><![CDATA[Suppose you wanted to build a tool for anonymously capturing the websites that a user visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../permalinks/1/constant-crawl-design-part-1">Suppose you wanted to build</a> a tool for anonymously capturing the websites that a user  visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was &#8220;public&#8221; and to do so without keeping any record (not even on the user&#8217;s own machine) of the sites visited or even tying together the different sites by one ID (even an anonymous one).<span id="more-634"></span>In this essay I will propose a possible solution to this.</p>
<p>The first issue is to decide your definition of a &#8220;public&#8221; site. Simple solutions like &#8220;anything over SSL is private&#8221; fail in both directions: some content which is not private gets served up over SSL (generally a good thing), and some content which <em>should</em> be private gets served up over unsecured connections (like Facebook pages). I propose the following rule instead: any page which is seen with <em>exactly</em> the same content by at least N different individuals will be considered public. (The value of N can be debated, but something like 6 or 9 could be a starting place.) This means that any pages that personalize to each user (even just a &#8220;Hello Janet Smith&#8221; at the top) will not get shared &#8212; that may well be exactly the behavior that is desired. And it means that any page that IS shared must at least be known to a reasonably large group (so it&#8217;s not THAT secret). This definition works reasonably well for things like Google Docs (stays private unless it is widely shared) but works poorly for things like a corporate intranet (may get shared unintentionally if too many people use the tool). Perhaps a configuration in the tool could allow certain domains to be excluded from sharing; even more likely is that no company with serious security concerns will allow a tool like this to be run on their intranet anyway.</p>
<p>If we accept this definition for when pages should be shared, what we have left is just a technical problem (albeit a difficult one): how can we determine whether the exact content of a certain page has been seen by at least N other users <em>without</em> keeping a record anywhere that it has been seen by any individual user. Storing information is not difficult; we have already accepted that this project will use a P2P storage network and use some local storage so we can store information locally or inject it into a P2P network (perhaps a different network than the one used for document storage). But some obvious approaches won&#8217;t work. We can&#8217;t store a record of what has been seen on the local disk because that would record the entire browser history. We can&#8217;t store it in the P2P network in a form that can be retrieved because then we have <em>already</em> made the page public. And we can&#8217;t store it tied to a particular user-id (even an anonymous one) because tying together different pages visited is <a href="https://www.nytimes.com/2006/08/09/technology/09aol.html?_r=1&amp;ei=5070&amp;en=6c5dfa2a9c1be4ec&amp;ex=1155787200&amp;emc=eta1&amp;pagewanted=all">an effective way</a> to identify an individual.</p>
<p>I will motivate the final solution by showing a series of partial solutions, each of which gets closer to solving the problem. As a first pass, we could use a P2P network which manages a large distributed hash table in which we can anonymously store values (multiple values for a given key) and which allows us to anonymously query the values. When a user viewed a page, they would use its URL as a key, and store a hash of the page as one value for that key, along with a counter. If the counter had already reached N then the page was public and program could go ahead and store the value in the storage network. The contents of non-public pages are not stored (locally <em>or</em> in the network), only their hash value. Old values would be kept in the P2P network for a good long time &#8212; a month or so would be good, to be sure of capturing the &#8220;long tail&#8221; of rarely viewed content, but would eventually be cleared out to make space for newer content.</p>
<p>This fails because there may be malicious participants in the P2P network. As soon as a malicious participant saw a value being inserted with a count of 1 they could immediately insert it again N-1 times. Or perhaps they could even lie and return a count of N even though the count was actually 0. In either case, users would be tricked into storing content that had not actually been seen by N distinct users.</p>
<p>To protect against the malicious participants we need to clarify our threat model. At this stage (where we determine what content to make public), a malicious participant is one who is trying to force the user to reveal content that shouldn&#8217;t be public. Since public is defined as &#8220;seen by more than N individuals&#8221;, a malicious participant who already knew the content could just publicize it directly (or push it into the storage network), so it is malicious participants who do <em>not</em> know the content of the page that we are concerned with. That leads immediately to a solution: instead of trusting the count, we can store something that only a user who had actually seen the page could generate: a &#8220;proof&#8221; that the page has been seen.</p>
<p>Simply create N fixed blocks of text. (It could be as simple as &#8220;This is block &lt;N&gt;&#8221; &#8212; the content doesn&#8217;t really matter.) To obtain proof number n, concatenate the page text with block n then take the hash of it. A user will query the P2P network for all hash values stored for a certain URL. Some of them may match the hash codes for proofs 1 through N. If all of those hashes are present, then N other participants have seen this value and it should be pushed into the storage network; if not then the lowest-numbered missing value should be added. A malicious participant, not having the content, cannot generate any of these values, and they cannot determine anything from the hash values themselves (since a hash cannot be inverted).</p>
<p>There is, unfortunately, one problem remaining. If a single user views the same content N times, they might push all N proofs into the network all by themselves. The values cannot be tagged with an ID of the individual who found it because that would allow multiple page views to be tied together, leaking identity information. And they cannot keep a record of which sites they have visited as that too would  reveal information. The solution I propose solves this problem but replaces the certainty that there were N viewers with a probabilistic solution.</p>
<p>When the application is first installed it will select a random &#8220;user id&#8221;. This is just a string which is used to identify the installation so we can be sure not to double-count it. There needs to be a way for the user to read the &#8220;user id&#8221; and to modify it. That way someone who frequently uses several different computers can set the user id to the same value on all of them and will not double-count views just by moving between home, work, and mobile locations. Of course, it is important that we do not link this user id to the view history.</p>
<p>Now, instead of creating N different blocks to concatenate with the page contents we will create K different blocks (where K is somewhat less than N). For instance, K=8 is a reasonable value; certainly we require that K is <em>much</em> smaller than the number of users of the crawling system. As before, query to see which of the K proofs are present in the P2P system. The system takes a hash of the user id, mods by K to select a random (but consistent) value, k (where k is in [0..K-1]) essentially dividing all users up into K groups. If the proof for k is present, do nothing: it is possible that this proof was added by this same user. If the proof for k is missing, then add it. And if the total number of proofs is more than some threshold n, then enough people have seen the page and it can be published into the storage network.</p>
<p>What should n be? Well, malicious participants cannot affect anything since they don&#8217;t have the page contents and can&#8217;t generate the proofs. Each actual viewer of the page has what is essentially a randomly chosen value for k. So as people view the page, they are selecting a random value of k (from 0 to K-1). This continues until n distinct values have been chosen (some may have been chosen multiple times). We observe the count of distinct values seen so far (call it m) and want to estimate the number of actual page viewers.</p>
<p>I&#8217;ll skip over the math and jump right to the answer: the expected number of actual viewers <a href="http://math.stackexchange.com/questions/114544/coupon-problem-generalized-or-birthday-problem-backward">is</a>:</p>
<p><a href="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_1.png"><img class="aligncenter size-full wp-image-640" title="K\left( H_{K}-H_{K-m}\right) " src="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_1.png" alt="K\left( H_{K}-H_{K-m}\right) " width="154" height="23" /></a>where Hn is the n&#8217;th harmonic number:</p>
<p style="text-align: center;"><a href="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_2.png"><img class="aligncenter size-full wp-image-641" title="H_{n}=\sum _{i=1}^{n}\dfrac {1} {i}" src="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_2.png" alt="H_{n}=\sum _{i=1}^{n}\dfrac {1} {i}" width="104" height="60" /></a></p>
<p>So if we use 8 for K and we see 4 entries, that means there <em>must </em>have been at least 4 users seeing the same content and there was probably about 6. If we see 5 entries then there <em>must</em> have been 5 different individual seeing the same content and there were probably 306/35 or about 8.7. The values required can be adjusted to reach the desired value for expected number of individuals seeing the page, while keeping K small (if it is large enough that the number of participants in on bucket is small then it can potentially be used to help determine your browser history).</p>
<p>Well, that&#8217;s enough essays on this topic for now. If you want to learn more, you may want to check out <a href="https://github.com/titanous/constantcrawl">https://github.com/titanous/constantcrawl</a> where there may eventually be an effort to build this.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-4/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 3</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-3</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-3#comments</comments>
		<pubDate>Sat, 10 Mar 2012 12:01:52 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=631</guid>
		<description><![CDATA[Suppose you were building a tool integrated with web browsers to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network? There are many different types of P2P networks for storing an [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../permalinks/1/constant-crawl-design-part-1">Suppose you were building</a> a tool <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-2">integrated with web browsers</a> to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network?<span id="more-631"></span></p>
<p>There are <a href="https://en.wikipedia.org/wiki/Peer-to-peer#Applications">many different types</a> of P2P networks for storing an retrieving files. Different networks and protocols have been designed for different purposes and thus have different strengths and weaknesses. For instance, <a href="https://freenetproject.org/">Freenet</a> is designed to store and retrieve files with a focus on extremely strong anonymity guarantees and resistance to files being deleted, but with weaknesses such as being particularly slow. The <a href="https://en.wikipedia.org/wiki/Gnutella">Gnutella</a> network is decentralized and efficient but the usage is not anonymized. Rather than fully specifying the design of a storage network, this essay will just attempt to list the requirements. Re-using an existing protocol (or even an existing network) might be the best approach; next best would be to design one by combining well-tested components from existing successful networks.</p>
<p>The P2P storage network would need to support the following functions:</p>
<p>1. Storing values by key with some metadata. The key would be fixed and known (the URL of the content). The metadata would be small &#8212; things like the date the content was viewed and the hash of the content. The value would be the content of the URL and would <em>not</em> be small &#8212; it would be the size of a web page, image, PDF, or whatever was being stored. (Content larger than a certain size could be rejected.)</p>
<p>2. Storing <em>multiple</em> values by the same key. A page may change or may be viewed differently by different users. The storage network needs to support storing these multiple values.</p>
<p>3. Querying the metadata of a certain key. In particular, we need to query to find out whether there already exists an entry for a given URL (key) with a particular value for the hash of the content. To prevent bad actors from responding &#8220;yes&#8221; when it is really absent we may need to return the value. We would expect this query to be by FAR the most common query performed &#8212; hundreds of times more frequent than any other function the storage network would perform.</p>
<p>4. Retrieving the value (and metadata) for a given key. If multiple values were stored we would either return all of them or select one by metadata.</p>
<p>5. Querying to find out what keys had new values stored within some recent time period. This would not be used by normal users, but by organizations like the Internet Archive or Duck Duck Go that actually wanted to download the contents of this storage, thus making it into a means for crawling the entire web. It would be acceptable if the only way to perform this query were to contribute significant resources to the P2P network and also if the query were likely to be accurate instead of certain to be accurate.</p>
<p>Those are the only functions that would be required, but there are some broad characteristics of the P2P network that are also essential:</p>
<p>A. Anonymity: it should be quite difficult to determine who is performing any of calls 1 through 4. (It is OK if call 5 is not anonymous.) Otherwise, the network could be used to determine a user&#8217;s browsing history. This could be accomplished via onion routing, the elaborate approach used by Freenet, or any other solution that makes it impossible for any participant in the network to tell for sure what other members of the network are searching for.</p>
<p>B. Legal Protection for Stored Content. Legal liability for storing content on a P2P network can be a genuine problem. But in essentially all jurisdictions this can be avoided if the individuals participating in the network are not able to determine just what content they have. For instance, some systems break each file up into chunks which are encrypted so they cannot be understood without possessing all chunks, then store different chunks in different locations. No one person has any file on their system.</p>
<p>C. Reliability. The P2P system must retain its data even if peers join and drop out with some regularity. This means keeping redundant copies of everything. It must also function if some minority of the members of the P2P network are antagonists who will abuse the protocol in an attempt to harm the network. These are standard features on most P2P networks.</p>
<p>D. Purging of old values. The size of the P2P storage network cannot simply grow forever. So values need to be evicted eventually. Evicting values that are older than a certain age should work fairly well &#8212; if the content is still being viewed frequently it will be re-loaded rapidly and by keeping for a certain amount of time we would provide an opportunity to download the content for those using this to crawl the web. Also acceptable, but perhaps less ideal, would be automatically evicting the least frequently requested values.</p>
<p>That&#8217;s about it for requirements for the storage network, except that given these constraints we would like for it to be as fast as possible, especially for frequently requested values. This implies that the number of network hops should be minimized (despite this, anonymity will require a fairly large number) and may favor solutions that store frequently-requested information in multiple locations.</p>
<p>The <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">next (and final) installment</a> of this series will discuss the interesting (and challenging) question of how to tell when a viewed page can be made public.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-3/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 2</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-2</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-2#comments</comments>
		<pubDate>Tue, 06 Mar 2012 03:18:54 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=624</guid>
		<description><![CDATA[Suppose you were building a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be? The basic experience would be a perfectly normal browsing experience: users would launch their favorite web browser normally, would browse around the web normally and everything would &#8220;just work&#8221;. This means that clearly [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-1">Suppose you were building</a> a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be?<span id="more-624"></span></p>
<p>The basic experience would be a perfectly normal browsing experience:  users would launch their favorite web browser normally, would browse  around the web normally and everything would &#8220;just work&#8221;. This means  that clearly the system would function as a browser plug-in.  Fortunately, nearly all modern browsers <a href="https://developer.mozilla.org/en/Extensions">support</a> <a href="https://code.google.com/chrome/extensions/docs.html">some</a> <a href="https://developer.apple.com/programs/safari/">form</a> of plug-ins. In principle, one could also develop this as a proxy, but  it would be much more difficult to develop an effective UI.</p>
<p>An easy-to-use installation process is important if one is seeking a large user base. This means using the normal means for the platform (an installer for windows, but RPM, yum, etc for linux). It means that the installer sets up the browser plugin, allocates the disk space needed for storage, and creates the services needed to join the P2P network.</p>
<p>The plugin itself offers two basic pieces of functionality. One would be that it captures the content of the web as it is viewed, and (where appropriate) archives it for the crawl. The other is the benefit for the user of the plugin: it allows them to view content from the crawled archive when the normal site is slow or unavailable. (For instance, the &#8220;slashdot effect&#8221; where a smaller site is featured on a popular news site like Slashdot or Reddit and becomes overwhelmed.)</p>
<p>The plugin should have three basic &#8220;modes&#8221;. Any well-behaved plugin should provide an easy way turn it off, so one mode is &#8220;disabled&#8221;. Another mode would be for loading all content (or perhaps just re-loading the current page) from the archive. And of course there would be the normal mode (more on this in a moment). The mode affects the page rather drastically (changes where we are getting it from or whether we are potentially sharing it with the world) so the plugin should probably provide an indicator of some sort in or near the URL bar, and this indicator might as well provide the means for switching the mode as well.</p>
<p>What behavior would we want in &#8220;normal mode&#8221;? Pages that get viewed are eligible for sharing, but <em>only</em> if that page is determined to be a &#8220;public&#8221; page (see other essays for details on this). So the plugin would need to capture the content of the page and immediately after rendering it (perhaps in a separate thread) begin to process it for possible sharing. I&#8217;ve used the term &#8220;page&#8221;, but essentially all content should be treated this way, including images, CSS and JavaScript files, even AJAX calls: any content downloaded by the browser.</p>
<p>The next question is when content should be downloaded from the archive. Unlike Google Web Accelerator, I think it is unlikely that this design involving anonymous P2P technology will ever be <em>faster </em>than ordinary browsing to a normally functioning web site. But it can be available in those cases where the ordinary site no longer is, where the HTTP request times out, or a 404 (page not found), 410 (page gone), 503 (server overloaded) or some other error is returned. The simplest solution would be to attempt to load the page from the storage network whenever these conditions occur. Always trying to load every page from both web and storage (and displaying the one that arrives first) would put far too much load on the storage network for pages that would never be viewed.</p>
<p>It is worth noting that this storage network can store <em>multiple</em> versions of each page. This could be because the page has changed over time, because it is served up differently to different classes of user (perhaps by geographic region), or for stranger reasons like a malicious user injecting false versions of a page into the network. This property might lead to some very interesting and powerful uses (&#8220;See old versions of any page!&#8221;) and it might pose new technical challenges (&#8220;Allow trusted entities like the EFF to somehow flag which version of the page is &#8216;real&#8217;&#8221;). Such considerations are an excellent subject for future consideration, but the <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-3">next essay</a> will address technical challenges with the storage and the <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">final part</a> will show how to anonymously determine whether to share a page.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 1</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-1</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-1#comments</comments>
		<pubDate>Mon, 05 Mar 2012 02:42:08 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=620</guid>
		<description><![CDATA[Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google&#8217;s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more [...]]]></description>
			<content:encoded><![CDATA[<p>Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google&#8217;s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more reliably; the advantage to Google was that they got to crawl the web &#8220;as the user sees it&#8221; instead of just what Googlebot gets&#8230; and that they got to see <em>every single page</em> you viewed, thus feeding even more into the giant maw of information that is Google.</p>
<p>Well, Google eventually dropped Google Web Accelerator (I wonder why?), but the idea is interesting. Suppose you wanted to build a similar tool that would capture the web viewing experience of thousands of users (or more). For users it could provide a reliable source for sites that go down or that get hit with the &#8220;slashdot&#8221; effect. For the Internet Archive or someone a smaller search engine like Duck Duck Go, it would provide a means of performing a massive web crawl. For someone like the EFF or human-rights groups it would provide a way to monitor whether some users (such as those in China) are being &#8220;secretly&#8221; served different content. But unlike Google Web Accelerator, a community-driven project would have to solve one very hard problem: how do this while keeping the user&#8217;s browsing history secret &#8212; the exact <em>opposite</em> of what Google&#8217;s project did.<span id="more-620"></span></p>
<p>This topic came up at a meeting of the Philly Startup Hackers group, and after an entire evening of vigorous discussion, we think that such a project would be technically feasible. In this series of essays I will attempt to outline the technical architecture of this solution. This first one will explain the major components and how they fit together.</p>
<p>Broadly, I&#8217;ll describe three different problems to be solved and we&#8217;ll assume that the solution to each one is a layer in the architecture. Problem (A) is the user interface, problem (B) is deciding what information is public (surprisingly, this turns out to be the most difficult part), and problem (C) is storing the pages.</p>
<p>The solution to problem (A) (user interface) is quite straightforward. That is not to minimize it: implementing the user interface well is by far the most work of the whole project and the piece most likely to contribute to the success or failure of it. But the approach to take is clear. This should integrate into the customer&#8217;s browser, and with modern browsers that means implementing it as a browser plug-in. Also essential to the user experience is the installation experience: to be successful, this needs to be extremely easy to install and very simple to configure (preferably with <em>no</em> configuration required for use).</p>
<p>Problem B is to decide what pages should be public, and which are private to the user. The UI can help here if users can easily flip into modes where everything is captured or where nothing is. But one cannot expect the user to click something before (or after) every page &#8212; there also needs to be a &#8220;normal&#8221; browsing mode. We&#8217;d like to (anonymously) record the majority of pages visited (after all, that&#8217;s the point of the tool), but a page showing your bank account probably shouldn&#8217;t be shown, nor should a Google Docs essay you&#8217;ve been writing. Assuming that everything viewed with HTTPS is private and everything viewed with HTTP is public seems much too simple a rule, particularly as privacy sensitive sites are beginning to default to HTTPS for all users.</p>
<p>So the approach I am proposing is that we assume that if several people see the <em>exact </em>same page then it must be a public one. My bank&#8217;s logged-in view of my accounts won&#8217;t be seen by <em>any</em> other users, while the Google Doc essay I share with a friend will only be seen by a couple of people. If we set the threshold to something like 6 or 9 users then we can be fairly confident that the content was public. To capture rarely-seen sites we&#8217;d want the count to last for some time: 6-9 users within the same month, perhaps. Now the technical challenge is to figure out how to tell whether several people have seen the content <em>without</em> revealing it (since it&#8217;s private) and without leaving any trace that <em>we</em> have viewed it (for privacy reasons).</p>
<p>Problem C is storing the content. Spotify is a popular music player which has been installed by <a href="http://mashable.com/2011/09/21/spotify-2-million-subscribers/">millions</a> of users. Yet they don&#8217;t need huge servers to transmit all those data streams. Instead, they <a href="http://www.csc.kth.se/~gkreitz/spotify/">use P2P technology</a> and each user provides a certain amount of storage and a certain amount of bandwidth. Other projects like <a href="https://freenetproject.org/papers.html">Freenet</a> have proven that P2P sharing can store data and keep all the participants anonymous. So I propose leveraging fairly standard P2P approaches (or better yet, an existing P2p storage network) for storing, finding, and retrieving the content.</p>
<p>Well, that&#8217;s enough for one essay. Check back in <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-2">part 2</a> for more details about the UI, <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-3">part 3</a> for a discussion of the storage network, and <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">part 4</a> for analysis of how to anonymously determine whether something can be shared.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-1/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Host Error 2</title>
		<link>http://mcherm.com/permalinks/1/host-error-2</link>
		<comments>http://mcherm.com/permalinks/1/host-error-2#comments</comments>
		<pubDate>Fri, 03 Feb 2012 20:21:09 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=617</guid>
		<description><![CDATA[Another posting on how to understand Profile errors. If you ever see &#8220;Host error number XXX&#8221;, it means that this was the XXX&#8217;th error of the day that this Profile instance wrote to the logs. Get someone to look it up in the Profile logs. Also, Calling mrpc ZWRAP with [925, 8864, ""44758220"", &#124;!&#124;] will [...]]]></description>
			<content:encoded><![CDATA[<p>Another posting on how to understand Profile errors.<span id="more-617"></span></p>
<p>If you ever see &#8220;Host error number XXX&#8221;, it means that this was the XXX&#8217;th error of the day that this Profile instance wrote to the logs. Get someone to look it up in the Profile logs.</p>
<p>Also, <em>Calling mrpc ZWRAP with [925, 8864, ""44758220"", |!|]</em> will fail if 8864 is not a valid profile userid (which is the case for me).</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/host-error-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Namespace for a valid SOAP message</title>
		<link>http://mcherm.com/permalinks/1/namespace-for-a-valid-soap-message</link>
		<comments>http://mcherm.com/permalinks/1/namespace-for-a-valid-soap-message#comments</comments>
		<pubDate>Mon, 12 Dec 2011 14:35:29 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=606</guid>
		<description><![CDATA[A brief hint: if you see an error message like this: InputStream does not represent a valid SOAP 1.1 Message check the namespace of the SOAP envelope SOAP 1.1: http://schemas.xmlsoap.org/soap/envelope/ SOAP 1.2: http://www.w3.org/2003/05/soap-envelope/]]></description>
			<content:encoded><![CDATA[<p>A brief hint: if you see an error message like this:</p>
<p style="padding-left: 30px;">InputStream does not represent a valid SOAP 1.1 Message</p>
<p>check the namespace of the SOAP envelope</p>
<p>SOAP 1.1: <a rel="nofollow" href="http://schemas.xmlsoap.org/soap/envelope/" target="_blank">http://schemas.xmlsoap.org/soap/envelope/</a></p>
<p>SOAP 1.2: <a rel="nofollow" href="http://www.w3.org/2003/05/soap-envelope/" target="_blank">http://www.w3.org/2003/05/soap-envelope/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/namespace-for-a-valid-soap-message/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Binary Backward Compatibility</title>
		<link>http://mcherm.com/permalinks/1/binary-backward-compatibility</link>
		<comments>http://mcherm.com/permalinks/1/binary-backward-compatibility#comments</comments>
		<pubDate>Thu, 08 Dec 2011 03:00:12 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=601</guid>
		<description><![CDATA[I saw this interesting article about a weakness in the Scala language. The weakness applies not just to Scala, but to pretty much any language: the community using the language cannot grow past a certain point until it somehow solves the problem of libraries depending on other libraries in a large (deep) tree. Why is [...]]]></description>
			<content:encoded><![CDATA[<p>I saw this <a href="http://lift.la/scalas-version-fragility-make-the-enterprise">interesting article</a> about a weakness in the Scala language. The weakness applies not just to Scala, but to pretty much any language: the community using the language cannot grow past a certain point until it somehow solves the problem of libraries depending on other libraries in a large (deep) tree.<span id="more-601"></span> Why is this a problem? Because when the language moves forward (to the next version) a deep dependency tree means you can&#8217;t move forward until <em>every</em> library in the tree is moved to the new version, and making every library in the community do that simultaneously is extremely difficult. You can see the problem right now in the Python community: Python 3 was realeased THREE YEARS ago, but today many major libraries still don&#8217;t support it.</p>
<p>What I found most interesting was something that David Pollak (the post&#8217;s author) alluded to but did not emphasize: an example of a language that <em>has</em> solved this problem. Surprisingly, it is the much-maligned Java. (And perhaps this feature is one of the reasons for Java&#8217;s success in &#8220;the enterprise&#8221;, where backward compatibility to old or unmaintained libraries is often a very big deal.) The Java solution is to provide an incredibly strong amount of backward compatibility at the binary level (not just the source). As far as I know, essentially all code written under Java 1.0 (16 years ago) will still compile under the most recent Java release, and code <em>compiled</em> by that Java 1.0 compiler will still run under the most recent JVM. The price paid is some real ugliness in the name of backward compatibility like old APIs that still return Hashtable or ArrayList instead of Map or List, and type erasure that makes typed collection less powerful than they could be). But however much you may scoff at Java for poor language design, this feat of backward compatibility is something quite impressive.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/binary-backward-compatibility/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Story Points</title>
		<link>http://mcherm.com/permalinks/1/story-points</link>
		<comments>http://mcherm.com/permalinks/1/story-points#comments</comments>
		<pubDate>Thu, 29 Sep 2011 01:27:08 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=585</guid>
		<description><![CDATA[If you have complete and accurate requirements for your project which won&#8217;t change, and your development team is spot-on in estimating and highly consistent in their development pace. and there are no surprises, then you can produce highly accurate project timeline estimates up front. Such accurate estimates are (or, more accurately, would be) quite useful [...]]]></description>
			<content:encoded><![CDATA[<p>If you have complete and accurate requirements for your project which won&#8217;t change, and your development team is spot-on in estimating and highly consistent in their development pace. and there are no surprises, then you can produce highly accurate project timeline estimates up front. Such accurate estimates are (or, more accurately, would be) quite useful and well worth the effort it takes to produce them because of how nicely you can schedule everything. But how about the rest of us, for which none of this is true?<span id="more-585"></span></p>
<p>There really isn&#8217;t much benefit to putting in lots of hours developing a detailed estimate if the project isn&#8217;t going to proceed according to plan <em>anyway</em> (and it rarely does). This is why most agile development approaches &#8212; including Scrum &#8212; use a less-precise but also less time-consuming approach. By going with rough requirements, and a simple imprecise estimation process a team can produce rough estimates in a surprisingly short amount of time. The time saved writing requirement documents and producing estimates can be used to build something useful instead.</p>
<p>The process that I have found to be most useful starts out with requirements that are simple: just a paragraph or two written down for a feature and a few minutes discussion to make sure everyone understands it. The team meets, making sure to include someone from the &#8220;business side&#8221; who can answer questions about what is needed, the developers, QA, DBAs, and whatever other specialists are needed. The business person explains what is needed; the team talks through how they will code and how it will be tested. Then we&#8217;re ready to estimate.</p>
<p>Everyone just says how long they think it will take. To avoid &#8220;groupthink&#8221; where everyone just agrees with the first person to speak, it&#8217;s good to have each person come up with their idea independently before comparing: selecting cards and all revealing at the same time is one way to do this. Everyone estimates: yes, that means the DBA may estimate a Java coding task, but that&#8217;s OK. To avoid long useless debates over whether it&#8217;s 23.2 or 23.4 we usually limit the estimated sizes to some discrete values: 1, 2, 3, 5, 8, 13, 20, and &#8220;more&#8221; are a widely used set of values (the values chosen to make it easy to split a task). If, after hearing what was said we all agree on the size then we&#8217;re done (this is where we all discount the DBA&#8217;s estimate of the Java task); if not then we discuss for a few more minutes: maybe someone realized an extra step the others missed or knows where to find test data without having to enter it. If we still disagree after that, just take the larger estimate.</p>
<p>That&#8217;s it! It takes only a few minutes to produce estimates this way. Of course, the estimates are worth what you put into them: the business MUST realize that these are only rough numbers. A common way to do that is to estimate in &#8220;Story Points&#8221; instead of &#8220;hours&#8221; or &#8220;days&#8221;. Speaking in terms of a unit that is less concrete seems to help remind everyone that this is only a rough value. But it is a rough value that did NOT take weeks of preparation, and thus well worth it.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/story-points/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Even Immutables are Hard with Threads</title>
		<link>http://mcherm.com/permalinks/1/how-even-immutables-are-hard-with-threads</link>
		<comments>http://mcherm.com/permalinks/1/how-even-immutables-are-hard-with-threads#comments</comments>
		<pubDate>Wed, 24 Aug 2011 03:09:52 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=574</guid>
		<description><![CDATA[Armen Rigo has a blog posting (worthy of an article of its own) proposing using STM (Software Transactional Memory) in PyPy. In a discussion on reddit someone suggested that you could have weaker threading guarantees and just use locks manually. It wouldn&#8217;t be so hard, they explained, because: You really only have to do it [...]]]></description>
			<content:encoded><![CDATA[<p>Armen Rigo has <a href="http://morepypy.blogspot.com/2011/08/we-need-software-transactional-memory.html">a blog posting</a> (worthy of an article of its own) proposing using STM (Software Transactional Memory) in PyPy. In <a href="http://www.reddit.com/r/Python/comments/jrm0t/pypy_status_blog_we_need_software_transactional/">a discussion on reddit</a> someone suggested that you could have weaker threading guarantees and just use locks manually.<span id="more-574"></span> It wouldn&#8217;t be so hard, they explained, because:</p>
<blockquote><p>You really only have to do it for data that is not read-only. I would for example say that it&#8217;s pretty rare for classes to change after they have been set up for the first time (presumably before any threads are even started), making the class basically read-only, which could be safely shared across threads.</p></blockquote>
<p>I wanted to give a detailed response with why this approach is nieve. Actually, it has been tried before and failed. It may work OK with certain kinds of languages (mostly &#8220;functional&#8221; languages), but fails with other kinds of languages, and Python is an extreme example of the kind of language where it won&#8217;t work.</p>
<p>For an example, consider Java. The JVM (Java Virtual Machine) has special features that were added to support exactly this behavior, but in practice few programmers use them. Let&#8217;s take a really simple example: suppose you create some data structure and a function to initialize it. In thread A you create the object, then initialize it, then pass it off to existing threads B and C. Threads B and C simultaneously read stuff from the data structure in ways that WOULD be dangerous except that the data structure is immutable after initialization.</p>
<p>The problem is that the guarantees provided in threading are MUCH weaker than you think. It&#8217;s not just that there are different threads all working at the same time and reading and writing from the same memory locations, the architecture of modern CPUs makes that impossible. You see, it takes hundreds of times longer to read something from memory or write it to memory as it takes to process something in the registers. So to execute &#8220;X = Y + 1&#8243;, the computer COULD spend 100 cycles reading Y, then 1 cycle adding 1 then 100 cycles writing X for a total of 201 cycles to execute. But that would be unbearably slow. Instead, it takes 100 cycles to do a bulk read of the whole memory area around where Y is stored into high-speed caches. It takes another 100 cycles to do a bulk read of the whole memory area aroudn where X is stored. It takes 1 cycle to add, then takes 100 cycles to do a bulk write of the memory area containing X. That&#8217;s 301 cycles&#8230; which sounds even worse.</p>
<p>But it&#8217;s NOT worse if the compiler cheats. Instead, it spends 100 cycles reading Y and 100 cycles reading X. Then it executes the +1 for one cycle. Then, BEFORE writing out X it does some OTHER calculations on the chunks of memory that have been read in. If the program has good cache locality (active objects are near each other in memory) it may get 75 cycles of useful work done before it needs to spend 100 cycles to &#8220;flush the cache out&#8221; (write X and the other things that were updated. That would be a total of 375 cycles to do 75 bits of work, or just 5 cycles per line &#8212; a LOT better than 201!</p>
<p>But in order to do this, the compiler has the &#8220;cheat&#8221;. It has to execute bits of work out of order, although it can take special precautions to make sure that it gets the same answer as if it executed them in the order written. As seen by THIS thread. But as seen by a DIFFERENT thread, the steps may appear to happen in a very different order. The other thread won&#8217;t see the effects until they get flushed to main memory, and that won&#8217;t happen after every computation (unless it is running 100x too slow!!!).</p>
<p>WHEW!! Big wall of text there, but the story should explain why one thread in a program may see the computations by another thread happen in a different order. So imagine this:</p>
<p>&#8220;In thread you A create the object, then initialize it, then pass it off to existing threads B and C.&#8221;</p>
<p>But imagine that from thread C&#8217;s point of view, A created it, then passed it off, and only initializes it LATER. In fact, perhaps C will start using it at the same time that A is initializing it &#8212; so it&#8217;s not really immutable, and terrible errors result. This is NOT just a theoretical risk: I have written real code that exhibited this behavior when running on a multi-core machine.</p>
<p>In order to help protect against this, the Java langage added a special exception to the Java threading model. Despite all other threading rules, if a class is declared &#8220;final&#8221; (immutable) and then all code executed within the constructor is guaranteed to be occur before the constructor ends EVEN AS SEEN BY OTHER THREADS. In theory, this is a great tool for creating data structures ahead of time and then reading them after initialization from other threads.</p>
<p>But in <strong>practice</strong> it isn&#8217;t so good. Initializing everything within the constructor of an immutable object turns out to be a real pain. Often you really want to use a hashtable (not immutable), or use Spring injection to populate your objects after the constructor, or slew of other choices that make it hard to stuff all your setup code inside of constructors. Some languages support this better: Scala and Closure are examples of languages on the JVM that use this feature well, but in Java it is awkward because there&#8217;s no little special support for working with immutable objects. Python, as a language, is even worse: there are NO immutable objects in Python! So while it might be possible to do as you suggest (create special locks and use them around every __init__ method, then carefully make sure nothing is modified outside of __init__), the resulting language wouldn&#8217;t really read like Python.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/how-even-immutables-are-hard-with-threads/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>When to Wrap a Library</title>
		<link>http://mcherm.com/permalinks/1/when-to-wrap-a-library</link>
		<comments>http://mcherm.com/permalinks/1/when-to-wrap-a-library#comments</comments>
		<pubDate>Sun, 03 Jul 2011 16:20:21 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=562</guid>
		<description><![CDATA[I find that this comes up fairly frequently. You find some useful library: perhaps it does logging, or enforces design-by-contract, or it provides an API for calling web services. But someone on the team suggests that instead of using the library directly, we should create a wrapper: &#8220;that way, if we ever decide to switch [...]]]></description>
			<content:encoded><![CDATA[<p>I find that this comes up fairly frequently. You find some useful library: perhaps it does logging, or enforces design-by-contract, or it provides an API for calling web services. But someone on the team suggests that instead of using the library directly, we should create a wrapper: &#8220;that way, if we ever decide to switch to a different library instead it will be easy to switch&#8221;. Is this a good idea?<span id="more-562"></span></p>
<p>There are a few really good reasons for wrapping a library. The most important of these, is in order to add functionality or simplify use of the library. For instance, in a recent project we used Spring&#8217;s library for web service calls in order to make calls to our own company&#8217;s collection of web services. But when calling <em>our</em> web services, there are a bunch of things that would be nice to do. We always want the same value for the address to connect to, the timeout for the calls, and the set of headers to provide. We want additional special handling for errors wrapped around every call. Adding these features in a wrapper makes the wrapper <em>less</em> powerful (now it&#8217;s good only for calling <em>our</em> services whereas Spring&#8217;s original library could call any web service), but at the same time makes it much more useful for that one specific purpose.</p>
<p>I have also seen cases where the existing library had a terrible interface (API), and the wrapper attempts to make it palatable. The &#8220;<a title="Slick" href="http://slick.cokeandcode.com/">Slick</a>&#8221; library is a <del>Python</del>[ed] Java wrapper around <a href="http://www.lwjgl.org/">LWJGL</a> adding no real functionality but making it decent enough to use. This is a rare use case: most libraries that have a terrible interface also have lousy features and you&#8217;re better off finding a different library instead.</p>
<p>The most common argument that I hear is neither of these cases: the most common argument that I hear is that we should wrap the library so we can easily switch to a different library. In fact, I most often hear this from people developing in a language with <a title="strong typing defined" href="http://www.artima.com/weblogs/viewpost.jsp?thread=7590">strong typing</a>, such as Java. I find this argument completely unpersuasive, for two reasons. First of all, when you switch libraries, the new library typically does <em>not</em> have exactly the same API. For example, when we <a title="My previous article on why we switched" href="http://mcherm.com/permalinks/1/logging-apis-evaluating-options">switched</a> to SLF4J for logging one of the reasons for doing so was that it offered a better API that allowed functionality not possible with the previous API. Secondly, if you DO switch to a library with an API that is equivalent, in a strongly-typed language you can use standard refactoring tools to perform the switch without any risk of introducing bugs. (If the APIs are close enough a simple search-and-replace for an import statement may do it.)</p>
<p>There are advantages to using a library directly. Developers who have encountered the library elsewhere may already be familiar with it. The documentation for the library is likely to be far more extensive than the documentation for your wrapper. It is often safe to assume that the designers of the library are better at designing an API for this feature than you are. As the library is upgraded, newer features will automatically be available. Most of all, having one fewer layers means there is simply less to learn to understand the system.</p>
<p>There are still a few advantages to wrapping without extra features. It gives you a place to add some logging code, or timers around an external call (for profiling), or validation checks. And there are some cases where you want to be able to use <em>different</em> libraries with the same codebase &#8212; then a wrapper is indispensable. <a href="http://commons.apache.org/logging/">Commons Logging</a> is an example of this: it allows a library to use different logging frameworks depending on what application it has been embedded in.</p>
<p>So my approach to the &#8220;wrap or not to wrap&#8221; question goes like this. First of all, will I add functionality with my wrappers or will removing functionality but thereby simplify the interface? If so, then wrapping makes sense. Secondly, if I haven&#8217;t yet chosen which library to use or if I want to switch back and forth between libraries, then a wrapper will be required. If neither of these applies, then I begin with a strong presumption that I should use the library on its own, and only a real need to add logging, monitoring, or other wrapped behavior will persuade me otherwise.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/when-to-wrap-a-library/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>


