<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dragons in the Algorithm</title>
	<atom:link href="http://mcherm.com/feed" rel="self" type="application/rss+xml" />
	<link>http://mcherm.com</link>
	<description>Adventures in Programming</description>
	<lastBuildDate>Thu, 18 Apr 2013 10:34:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>I dream of Satoshi Nakamoto</title>
		<link>http://mcherm.com/permalinks/1/i-dream-of-satoshi-nakamoto</link>
		<comments>http://mcherm.com/permalinks/1/i-dream-of-satoshi-nakamoto#comments</comments>
		<pubDate>Thu, 18 Apr 2013 10:34:53 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=679</guid>
		<description><![CDATA[&#8220;Satoshi Nakamoto&#8221; is the alias of the anonymous person who invented and published the protocol for Bitcoin. So far, no one knows for sure who it is, although attempts have been made to unmask the person (or people) by an analysis of their writing style and similar indicators. Now, in a blogpost, Sergio Demian Lerner [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://mcherm.com/blog/wp-content/uploads/2013/04/bitcoin_license_plate.jpg"><img class="alignright size-full wp-image-680" alt="bitcoin_license_plate" src="http://mcherm.com/blog/wp-content/uploads/2013/04/bitcoin_license_plate.jpg" width="240" height="180" /></a>&#8220;Satoshi Nakamoto&#8221; is the alias of the anonymous person who invented and published the protocol for Bitcoin. So far, no one knows for sure who it is, although attempts have been made to unmask the person (or people) by an analysis of their writing style and similar indicators. Now, in <a href="https://bitslog.wordpress.com/2013/04/17/the-well-deserved-fortune-of-satoshi-nakamoto/">a blogpost</a>, Sergio Demian Lerner has found a way to recognize coins mined by the same computer and has picked out the distinctive pattern of a certain individual who began mining almost from block one and continued mining at a consistent rate with regular restarts for a long time, without spending any of those coins.</p>
<p>This, he says, is Satoshi, and I applaud Sergio for this clever way to recognize an individual miner. Like Sergio, I am pleased that Satoshi&#8217;s fortune in Bitcoins is now apparently worth around $100 million USD. But Sergio also suggests that he expects this will lead to the unmasking of Satoshi once others track this to a Bitcoin somewhere which HAS been spent. (Bitcoin has many advantages, but it is NOT fully anonymous: in fact,  anyone can track a payment back to see which (anonymous) account it came from previously.)</p>
<p>I hope he is wrong about the unmasking. I prefer to imagine that Satoshi Nakamoto is living and working a normal job, still haunting cryptography boards in the evenings and on weekends, and occasionally checking the news to see how that Bitcoin thing is progressing. I imagine that someday, many years from now, when she dies her husband will open that envelope she left in the safe-deposit-box and it will contain a hard drive and stack of papers labeled &#8220;Now that I am gone, please publish this for the world to read.&#8221;</p>
<p>Okay, it&#8217;s just a romantic dream, but I&#8217;m hanging onto it as long as I can.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/i-dream-of-satoshi-nakamoto/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How NOT to do technical recruiting: Sunil Kumar of Panzer Solutions</title>
		<link>http://mcherm.com/permalinks/1/how-not-to-do-technical-recruiting-sunil-kumar-of-panzer-solutions</link>
		<comments>http://mcherm.com/permalinks/1/how-not-to-do-technical-recruiting-sunil-kumar-of-panzer-solutions#comments</comments>
		<pubDate>Fri, 01 Feb 2013 03:25:28 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Software Development]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=674</guid>
		<description><![CDATA[So, &#8220;Sunil Kumar&#8221; of Panzer Solutions wrote to me a ten days ago offering a position. Normally, I appreciate hearing from recruiters. As it happens, I have no interest in a new job; I am happy with my current position and have plenty of new challenges there recently. But it is nice to hear the signs [...]]]></description>
				<content:encoded><![CDATA[<p>So, &#8220;Sunil Kumar&#8221; of Panzer Solutions wrote to me a ten days ago offering a position. Normally, I appreciate hearing from recruiters. As it happens, I have no interest in a new job; I am happy with my current position and have plenty of new challenges there recently. But it is nice to hear the signs that my industry is doing well, and keeping up contacts with recruiters in my area is a good idea.</p>
<p>But Mr. Kumar didn&#8217;t write me about a position commiserate with my specific skills, he wrote to tell me &#8220;We have more than 100 W2 working currently with successful hit.&#8221; (That&#8217;s not quite English, but it&#8217;s fairly close.) There are recruiters who work hard to match up a particular applicant with a position where their skills and their career/environment preferences are a good fit. When I am doing the hiring (and just to note, Capital One <em>is</em> hiring right now in the Wilmington area), I love working with these recruiters: they bring me just 3 resumes and I end up wanting to bring in all 3 for further interviews. That&#8217;s a much more pleasant experience than digging through a stack of resumes most of whom can&#8217;t pass <a title="FizzBuzz" href="http://imranontech.com/2007/01/24/using-fizzbuzz-to-find-developers-who-grok-coding/">the FizzBuzz test</a>.</p>
<p>Mr. Kumar is in a different category altogether: he clearly thinks recruiting is a numbers game: if he just sends enough applicant names to enough open positions then he&#8217;ll be successful. He won&#8217;t be, because he&#8217;s not adding value. So I politely wrote back to Mr. Kumar explaining this and asking that he not send me &#8220;blind mailing&#8221; style job offers. A week later I have received TWO other emails from Mr. Kumar stating that &#8220;Panzer Solutions is looking to hire 10-20 New H1b&#8217;s and OPT EAD&#8217;s in coming one month.&#8221; (Still, not quite English.) Besides being a violation of federal employment law (I&#8217;m not a lawyer, but I was under the impression that companies were not permitted to favor H1B holders over citizens), this is no better than spam, either for the recipient (me) or the employer to whom the names are offered.</p>
<p>So I am Naming and Shaming Mr. Sunil Kumar of Panzer Solutions, and I will never do business with him or his company. Here&#8217;s hoping this article jumps to the top of the search rankings for those names so that others will recognize their uselessness sooner and Panzer and Mr. Kumar can quickly go out of business and leave space for better recruiters who actually make the hiring process easier, not harder.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/how-not-to-do-technical-recruiting-sunil-kumar-of-panzer-solutions/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Things my next phone should have</title>
		<link>http://mcherm.com/permalinks/1/things-my-next-phone-should-have</link>
		<comments>http://mcherm.com/permalinks/1/things-my-next-phone-should-have#comments</comments>
		<pubDate>Fri, 21 Sep 2012 17:23:21 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=667</guid>
		<description><![CDATA[After looking at the iPhone 5, I see people saying that phones already have everything they need&#8230; nothing new will happen. That&#8217;s completely absurd. There are tons of little things, for instance I want a browser that can do everything the PC browsers can do. But there are also HUGE changes needed. Here are some [...]]]></description>
				<content:encoded><![CDATA[<p>After looking at the iPhone 5, I see people <a href="http://thecodist.com/article/i_39_m_sorry_but_the_revolution_in_smartphones_is_long_over">saying</a> that phones already have everything they need&#8230; nothing new will happen. That&#8217;s completely absurd. There are tons of little things, for instance I want a browser that can do everything the PC browsers can do. But there are also HUGE changes needed. Here are some things that I want for my phone:</p>
<ul>
<li><strong>Talk to it.</strong> Today I *almost* have this: my Jellybean-based phone can perform near real-time voice recognition (without a network connection). But the error rate is still high enough that after adding in the time to go back and correct errors the whole process takes longer than typing it in on the device&#8217;s keyboard. But not much longer&#8230; I expect this one very soon.</li>
</ul>
<ul>
<li><strong>Context aware.</strong> My phone should know when it&#8217;s OK to ring, and when it isn&#8217;t (if I&#8217;m in a meeting or a movie). While I&#8217;m driving, it shouldn&#8217;t send me texts. When I start asking for directions it should guess (with a degree of accuracy) where I might want to go to.</li>
</ul>
<ul>
<li> <strong>Expandable screen.</strong> I want an iPad sized screen, but I want it to fit in my pocket. The only way to do that is to have an expandable or pull-out screen of some sort, or perhaps a projector.</li>
</ul>
<ul>
<li> <strong>Keyboard.</strong> Something real that I can type on &#8212; keyboards are SO amazingly effective. But I don&#8217;t like carrying around a bluetooth keyboard (they&#8217;re either too small to type on or too big to carry comfortably).</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/things-my-next-phone-should-have/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Story Points Aren&#8217;t Accurate &#8211; That&#8217;s Why They&#8217;re Good</title>
		<link>http://mcherm.com/permalinks/1/story-points-arent-accurate-thats-why-theyre-good</link>
		<comments>http://mcherm.com/permalinks/1/story-points-arent-accurate-thats-why-theyre-good#comments</comments>
		<pubDate>Thu, 30 Aug 2012 13:53:20 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=663</guid>
		<description><![CDATA[Ben Northrop wrote to complain that story points are not accurate. They don&#8217;t (always) map linearly to hours spent, so adding up story points over a large project won&#8217;t accurately give hours for the project. In the spirit of expressing controversial opinions, I will agree, and explain why I think that&#8217;s a good thing. I [...]]]></description>
				<content:encoded><![CDATA[<p>Ben Northrop <a href="http://www.bennorthrop.com/Essays/2012/velocity-and-story-points-they-dont-add-up.php">wrote</a> to complain that story points are not accurate. They don&#8217;t (always) map linearly to hours spent, so adding up story points over a large project won&#8217;t accurately give hours for the project. In the spirit of <a href="http://prog21.dadgum.com/149.html">expressing controversial opinions</a>, I will agree, and explain why I think that&#8217;s a good thing.</p>
<p>I believe that story points serve as a &#8220;rough&#8221; estimate. In the teams I work with, story point estimates are made quickly (a few minutes to be sure we understand the story, then quickly discuss and reach a consensus estimate). They are quantized (must round off to some Fibonacci number) which means that any given estimate is necessarily imperfect.</p>
<p>As such, they provide a cheap (didn&#8217;t take long to generate) but rough (not perfectly accurate) estimate, and they have to be respected as such. Story point estimates would not be useful to answer questions like &#8220;Will this project deliver in October or November?&#8221;, but they ARE useful for questions like &#8220;Would this be a 3-month project or a 1 year project?&#8221; For some purposes, a more precise estimate is needed, and then it may be necessary to invest a few hours to a few weeks to perform detailed work to generate a more precise estimate. However, I think that such situations are rare: people *want* perfect estimates ahead of time but rarely *need* them. Also I think that people are usually fooling themselves: most (usually waterfall) projects with precise up-front estimates later discover that those estimates are not accurate.</p>
<p>One of the strengths of story points is that everyone (including the customer) REALIZES that they are rough and don&#8217;t correspond to a precise delivery date &#8212; something that can be difficult to explain for estimates expressed in hours.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/story-points-arent-accurate-thats-why-theyre-good/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 4</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-4</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-4#comments</comments>
		<pubDate>Tue, 13 Mar 2012 02:10:45 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=634</guid>
		<description><![CDATA[Suppose you wanted to build a tool for anonymously capturing the websites that a user visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was [...]]]></description>
				<content:encoded><![CDATA[<p><a href="../../permalinks/1/constant-crawl-design-part-1">Suppose you wanted to build</a> a tool for anonymously capturing the websites that a user  visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was &#8220;public&#8221; and to do so without keeping any record (not even on the user&#8217;s own machine) of the sites visited or even tying together the different sites by one ID (even an anonymous one).<span id="more-634"></span>In this essay I will propose a possible solution to this.</p>
<p>The first issue is to decide your definition of a &#8220;public&#8221; site. Simple solutions like &#8220;anything over SSL is private&#8221; fail in both directions: some content which is not private gets served up over SSL (generally a good thing), and some content which <em>should</em> be private gets served up over unsecured connections (like Facebook pages). I propose the following rule instead: any page which is seen with <em>exactly</em> the same content by at least N different individuals will be considered public. (The value of N can be debated, but something like 6 or 9 could be a starting place.) This means that any pages that personalize to each user (even just a &#8220;Hello Janet Smith&#8221; at the top) will not get shared &#8212; that may well be exactly the behavior that is desired. And it means that any page that IS shared must at least be known to a reasonably large group (so it&#8217;s not THAT secret). This definition works reasonably well for things like Google Docs (stays private unless it is widely shared) but works poorly for things like a corporate intranet (may get shared unintentionally if too many people use the tool). Perhaps a configuration in the tool could allow certain domains to be excluded from sharing; even more likely is that no company with serious security concerns will allow a tool like this to be run on their intranet anyway.</p>
<p>If we accept this definition for when pages should be shared, what we have left is just a technical problem (albeit a difficult one): how can we determine whether the exact content of a certain page has been seen by at least N other users <em>without</em> keeping a record anywhere that it has been seen by any individual user. Storing information is not difficult; we have already accepted that this project will use a P2P storage network and use some local storage so we can store information locally or inject it into a P2P network (perhaps a different network than the one used for document storage). But some obvious approaches won&#8217;t work. We can&#8217;t store a record of what has been seen on the local disk because that would record the entire browser history. We can&#8217;t store it in the P2P network in a form that can be retrieved because then we have <em>already</em> made the page public. And we can&#8217;t store it tied to a particular user-id (even an anonymous one) because tying together different pages visited is <a href="https://www.nytimes.com/2006/08/09/technology/09aol.html?_r=1&amp;ei=5070&amp;en=6c5dfa2a9c1be4ec&amp;ex=1155787200&amp;emc=eta1&amp;pagewanted=all">an effective way</a> to identify an individual.</p>
<p>I will motivate the final solution by showing a series of partial solutions, each of which gets closer to solving the problem. As a first pass, we could use a P2P network which manages a large distributed hash table in which we can anonymously store values (multiple values for a given key) and which allows us to anonymously query the values. When a user viewed a page, they would use its URL as a key, and store a hash of the page as one value for that key, along with a counter. If the counter had already reached N then the page was public and program could go ahead and store the value in the storage network. The contents of non-public pages are not stored (locally <em>or</em> in the network), only their hash value. Old values would be kept in the P2P network for a good long time &#8212; a month or so would be good, to be sure of capturing the &#8220;long tail&#8221; of rarely viewed content, but would eventually be cleared out to make space for newer content.</p>
<p>This fails because there may be malicious participants in the P2P network. As soon as a malicious participant saw a value being inserted with a count of 1 they could immediately insert it again N-1 times. Or perhaps they could even lie and return a count of N even though the count was actually 0. In either case, users would be tricked into storing content that had not actually been seen by N distinct users.</p>
<p>To protect against the malicious participants we need to clarify our threat model. At this stage (where we determine what content to make public), a malicious participant is one who is trying to force the user to reveal content that shouldn&#8217;t be public. Since public is defined as &#8220;seen by more than N individuals&#8221;, a malicious participant who already knew the content could just publicize it directly (or push it into the storage network), so it is malicious participants who do <em>not</em> know the content of the page that we are concerned with. That leads immediately to a solution: instead of trusting the count, we can store something that only a user who had actually seen the page could generate: a &#8220;proof&#8221; that the page has been seen.</p>
<p>Simply create N fixed blocks of text. (It could be as simple as &#8220;This is block &lt;N&gt;&#8221; &#8212; the content doesn&#8217;t really matter.) To obtain proof number n, concatenate the page text with block n then take the hash of it. A user will query the P2P network for all hash values stored for a certain URL. Some of them may match the hash codes for proofs 1 through N. If all of those hashes are present, then N other participants have seen this value and it should be pushed into the storage network; if not then the lowest-numbered missing value should be added. A malicious participant, not having the content, cannot generate any of these values, and they cannot determine anything from the hash values themselves (since a hash cannot be inverted).</p>
<p>There is, unfortunately, one problem remaining. If a single user views the same content N times, they might push all N proofs into the network all by themselves. The values cannot be tagged with an ID of the individual who found it because that would allow multiple page views to be tied together, leaking identity information. And they cannot keep a record of which sites they have visited as that too would  reveal information. The solution I propose solves this problem but replaces the certainty that there were N viewers with a probabilistic solution.</p>
<p>When the application is first installed it will select a random &#8220;user id&#8221;. This is just a string which is used to identify the installation so we can be sure not to double-count it. There needs to be a way for the user to read the &#8220;user id&#8221; and to modify it. That way someone who frequently uses several different computers can set the user id to the same value on all of them and will not double-count views just by moving between home, work, and mobile locations. Of course, it is important that we do not link this user id to the view history.</p>
<p>Now, instead of creating N different blocks to concatenate with the page contents we will create K different blocks (where K is somewhat less than N). For instance, K=8 is a reasonable value; certainly we require that K is <em>much</em> smaller than the number of users of the crawling system. As before, query to see which of the K proofs are present in the P2P system. The system takes a hash of the user id, mods by K to select a random (but consistent) value, k (where k is in [0..K-1]) essentially dividing all users up into K groups. If the proof for k is present, do nothing: it is possible that this proof was added by this same user. If the proof for k is missing, then add it. And if the total number of proofs is more than some threshold n, then enough people have seen the page and it can be published into the storage network.</p>
<p>What should n be? Well, malicious participants cannot affect anything since they don&#8217;t have the page contents and can&#8217;t generate the proofs. Each actual viewer of the page has what is essentially a randomly chosen value for k. So as people view the page, they are selecting a random value of k (from 0 to K-1). This continues until n distinct values have been chosen (some may have been chosen multiple times). We observe the count of distinct values seen so far (call it m) and want to estimate the number of actual page viewers.</p>
<p>I&#8217;ll skip over the math and jump right to the answer: the expected number of actual viewers <a href="http://math.stackexchange.com/questions/114544/coupon-problem-generalized-or-birthday-problem-backward">is</a>:</p>
<p><a href="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_1.png"><img class="aligncenter size-full wp-image-640" title="K\left( H_{K}-H_{K-m}\right) " src="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_1.png" alt="K\left( H_{K}-H_{K-m}\right) " width="154" height="23" /></a>where Hn is the n&#8217;th harmonic number:</p>
<p style="text-align: center;"><a href="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_2.png"><img class="aligncenter size-full wp-image-641" title="H_{n}=\sum _{i=1}^{n}\dfrac {1} {i}" src="http://mcherm.com/blog/wp-content/uploads/2012/03/formula_2.png" alt="H_{n}=\sum _{i=1}^{n}\dfrac {1} {i}" width="104" height="60" /></a></p>
<p>So if we use 8 for K and we see 4 entries, that means there <em>must </em>have been at least 4 users seeing the same content and there was probably about 6. If we see 5 entries then there <em>must</em> have been 5 different individual seeing the same content and there were probably 306/35 or about 8.7. The values required can be adjusted to reach the desired value for expected number of individuals seeing the page, while keeping K small (if it is large enough that the number of participants in on bucket is small then it can potentially be used to help determine your browser history).</p>
<p>Well, that&#8217;s enough essays on this topic for now. If you want to learn more, you may want to check out <a href="https://github.com/titanous/constantcrawl">https://github.com/titanous/constantcrawl</a> where there may eventually be an effort to build this.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-4/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 3</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-3</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-3#comments</comments>
		<pubDate>Sat, 10 Mar 2012 12:01:52 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=631</guid>
		<description><![CDATA[Suppose you were building a tool integrated with web browsers to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network? There are many different types of P2P networks for storing an [...]]]></description>
				<content:encoded><![CDATA[<p><a href="../../permalinks/1/constant-crawl-design-part-1">Suppose you were building</a> a tool <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-2">integrated with web browsers</a> to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network?<span id="more-631"></span></p>
<p>There are <a href="https://en.wikipedia.org/wiki/Peer-to-peer#Applications">many different types</a> of P2P networks for storing an retrieving files. Different networks and protocols have been designed for different purposes and thus have different strengths and weaknesses. For instance, <a href="https://freenetproject.org/">Freenet</a> is designed to store and retrieve files with a focus on extremely strong anonymity guarantees and resistance to files being deleted, but with weaknesses such as being particularly slow. The <a href="https://en.wikipedia.org/wiki/Gnutella">Gnutella</a> network is decentralized and efficient but the usage is not anonymized. Rather than fully specifying the design of a storage network, this essay will just attempt to list the requirements. Re-using an existing protocol (or even an existing network) might be the best approach; next best would be to design one by combining well-tested components from existing successful networks.</p>
<p>The P2P storage network would need to support the following functions:</p>
<p>1. Storing values by key with some metadata. The key would be fixed and known (the URL of the content). The metadata would be small &#8212; things like the date the content was viewed and the hash of the content. The value would be the content of the URL and would <em>not</em> be small &#8212; it would be the size of a web page, image, PDF, or whatever was being stored. (Content larger than a certain size could be rejected.)</p>
<p>2. Storing <em>multiple</em> values by the same key. A page may change or may be viewed differently by different users. The storage network needs to support storing these multiple values.</p>
<p>3. Querying the metadata of a certain key. In particular, we need to query to find out whether there already exists an entry for a given URL (key) with a particular value for the hash of the content. To prevent bad actors from responding &#8220;yes&#8221; when it is really absent we may need to return the value. We would expect this query to be by FAR the most common query performed &#8212; hundreds of times more frequent than any other function the storage network would perform.</p>
<p>4. Retrieving the value (and metadata) for a given key. If multiple values were stored we would either return all of them or select one by metadata.</p>
<p>5. Querying to find out what keys had new values stored within some recent time period. This would not be used by normal users, but by organizations like the Internet Archive or Duck Duck Go that actually wanted to download the contents of this storage, thus making it into a means for crawling the entire web. It would be acceptable if the only way to perform this query were to contribute significant resources to the P2P network and also if the query were likely to be accurate instead of certain to be accurate.</p>
<p>Those are the only functions that would be required, but there are some broad characteristics of the P2P network that are also essential:</p>
<p>A. Anonymity: it should be quite difficult to determine who is performing any of calls 1 through 4. (It is OK if call 5 is not anonymous.) Otherwise, the network could be used to determine a user&#8217;s browsing history. This could be accomplished via onion routing, the elaborate approach used by Freenet, or any other solution that makes it impossible for any participant in the network to tell for sure what other members of the network are searching for.</p>
<p>B. Legal Protection for Stored Content. Legal liability for storing content on a P2P network can be a genuine problem. But in essentially all jurisdictions this can be avoided if the individuals participating in the network are not able to determine just what content they have. For instance, some systems break each file up into chunks which are encrypted so they cannot be understood without possessing all chunks, then store different chunks in different locations. No one person has any file on their system.</p>
<p>C. Reliability. The P2P system must retain its data even if peers join and drop out with some regularity. This means keeping redundant copies of everything. It must also function if some minority of the members of the P2P network are antagonists who will abuse the protocol in an attempt to harm the network. These are standard features on most P2P networks.</p>
<p>D. Purging of old values. The size of the P2P storage network cannot simply grow forever. So values need to be evicted eventually. Evicting values that are older than a certain age should work fairly well &#8212; if the content is still being viewed frequently it will be re-loaded rapidly and by keeping for a certain amount of time we would provide an opportunity to download the content for those using this to crawl the web. Also acceptable, but perhaps less ideal, would be automatically evicting the least frequently requested values.</p>
<p>That&#8217;s about it for requirements for the storage network, except that given these constraints we would like for it to be as fast as possible, especially for frequently requested values. This implies that the number of network hops should be minimized (despite this, anonymity will require a fairly large number) and may favor solutions that store frequently-requested information in multiple locations.</p>
<p>The <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">next (and final) installment</a> of this series will discuss the interesting (and challenging) question of how to tell when a viewed page can be made public.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-3/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 2</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-2</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-2#comments</comments>
		<pubDate>Tue, 06 Mar 2012 03:18:54 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=624</guid>
		<description><![CDATA[Suppose you were building a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be? The basic experience would be a perfectly normal browsing experience: users would launch their favorite web browser normally, would browse around the web normally and everything would &#8220;just work&#8221;. This means that clearly [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-1">Suppose you were building</a> a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be?<span id="more-624"></span></p>
<p>The basic experience would be a perfectly normal browsing experience:  users would launch their favorite web browser normally, would browse  around the web normally and everything would &#8220;just work&#8221;. This means  that clearly the system would function as a browser plug-in.  Fortunately, nearly all modern browsers <a href="https://developer.mozilla.org/en/Extensions">support</a> <a href="https://code.google.com/chrome/extensions/docs.html">some</a> <a href="https://developer.apple.com/programs/safari/">form</a> of plug-ins. In principle, one could also develop this as a proxy, but  it would be much more difficult to develop an effective UI.</p>
<p>An easy-to-use installation process is important if one is seeking a large user base. This means using the normal means for the platform (an installer for windows, but RPM, yum, etc for linux). It means that the installer sets up the browser plugin, allocates the disk space needed for storage, and creates the services needed to join the P2P network.</p>
<p>The plugin itself offers two basic pieces of functionality. One would be that it captures the content of the web as it is viewed, and (where appropriate) archives it for the crawl. The other is the benefit for the user of the plugin: it allows them to view content from the crawled archive when the normal site is slow or unavailable. (For instance, the &#8220;slashdot effect&#8221; where a smaller site is featured on a popular news site like Slashdot or Reddit and becomes overwhelmed.)</p>
<p>The plugin should have three basic &#8220;modes&#8221;. Any well-behaved plugin should provide an easy way turn it off, so one mode is &#8220;disabled&#8221;. Another mode would be for loading all content (or perhaps just re-loading the current page) from the archive. And of course there would be the normal mode (more on this in a moment). The mode affects the page rather drastically (changes where we are getting it from or whether we are potentially sharing it with the world) so the plugin should probably provide an indicator of some sort in or near the URL bar, and this indicator might as well provide the means for switching the mode as well.</p>
<p>What behavior would we want in &#8220;normal mode&#8221;? Pages that get viewed are eligible for sharing, but <em>only</em> if that page is determined to be a &#8220;public&#8221; page (see other essays for details on this). So the plugin would need to capture the content of the page and immediately after rendering it (perhaps in a separate thread) begin to process it for possible sharing. I&#8217;ve used the term &#8220;page&#8221;, but essentially all content should be treated this way, including images, CSS and JavaScript files, even AJAX calls: any content downloaded by the browser.</p>
<p>The next question is when content should be downloaded from the archive. Unlike Google Web Accelerator, I think it is unlikely that this design involving anonymous P2P technology will ever be <em>faster </em>than ordinary browsing to a normally functioning web site. But it can be available in those cases where the ordinary site no longer is, where the HTTP request times out, or a 404 (page not found), 410 (page gone), 503 (server overloaded) or some other error is returned. The simplest solution would be to attempt to load the page from the storage network whenever these conditions occur. Always trying to load every page from both web and storage (and displaying the one that arrives first) would put far too much load on the storage network for pages that would never be viewed.</p>
<p>It is worth noting that this storage network can store <em>multiple</em> versions of each page. This could be because the page has changed over time, because it is served up differently to different classes of user (perhaps by geographic region), or for stranger reasons like a malicious user injecting false versions of a page into the network. This property might lead to some very interesting and powerful uses (&#8220;See old versions of any page!&#8221;) and it might pose new technical challenges (&#8220;Allow trusted entities like the EFF to somehow flag which version of the page is &#8216;real&#8217;&#8221;). Such considerations are an excellent subject for future consideration, but the <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-3">next essay</a> will address technical challenges with the storage and the <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">final part</a> will show how to anonymously determine whether to share a page.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constant Crawl Design &#8211; Part 1</title>
		<link>http://mcherm.com/permalinks/1/constant-crawl-design-part-1</link>
		<comments>http://mcherm.com/permalinks/1/constant-crawl-design-part-1#comments</comments>
		<pubDate>Mon, 05 Mar 2012 02:42:08 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=620</guid>
		<description><![CDATA[Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google&#8217;s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more [...]]]></description>
				<content:encoded><![CDATA[<p>Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google&#8217;s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more reliably; the advantage to Google was that they got to crawl the web &#8220;as the user sees it&#8221; instead of just what Googlebot gets&#8230; and that they got to see <em>every single page</em> you viewed, thus feeding even more into the giant maw of information that is Google.</p>
<p>Well, Google eventually dropped Google Web Accelerator (I wonder why?), but the idea is interesting. Suppose you wanted to build a similar tool that would capture the web viewing experience of thousands of users (or more). For users it could provide a reliable source for sites that go down or that get hit with the &#8220;slashdot&#8221; effect. For the Internet Archive or someone a smaller search engine like Duck Duck Go, it would provide a means of performing a massive web crawl. For someone like the EFF or human-rights groups it would provide a way to monitor whether some users (such as those in China) are being &#8220;secretly&#8221; served different content. But unlike Google Web Accelerator, a community-driven project would have to solve one very hard problem: how do this while keeping the user&#8217;s browsing history secret &#8212; the exact <em>opposite</em> of what Google&#8217;s project did.<span id="more-620"></span></p>
<p>This topic came up at a meeting of the Philly Startup Hackers group, and after an entire evening of vigorous discussion, we think that such a project would be technically feasible. In this series of essays I will attempt to outline the technical architecture of this solution. This first one will explain the major components and how they fit together.</p>
<p>Broadly, I&#8217;ll describe three different problems to be solved and we&#8217;ll assume that the solution to each one is a layer in the architecture. Problem (A) is the user interface, problem (B) is deciding what information is public (surprisingly, this turns out to be the most difficult part), and problem (C) is storing the pages.</p>
<p>The solution to problem (A) (user interface) is quite straightforward. That is not to minimize it: implementing the user interface well is by far the most work of the whole project and the piece most likely to contribute to the success or failure of it. But the approach to take is clear. This should integrate into the customer&#8217;s browser, and with modern browsers that means implementing it as a browser plug-in. Also essential to the user experience is the installation experience: to be successful, this needs to be extremely easy to install and very simple to configure (preferably with <em>no</em> configuration required for use).</p>
<p>Problem B is to decide what pages should be public, and which are private to the user. The UI can help here if users can easily flip into modes where everything is captured or where nothing is. But one cannot expect the user to click something before (or after) every page &#8212; there also needs to be a &#8220;normal&#8221; browsing mode. We&#8217;d like to (anonymously) record the majority of pages visited (after all, that&#8217;s the point of the tool), but a page showing your bank account probably shouldn&#8217;t be shown, nor should a Google Docs essay you&#8217;ve been writing. Assuming that everything viewed with HTTPS is private and everything viewed with HTTP is public seems much too simple a rule, particularly as privacy sensitive sites are beginning to default to HTTPS for all users.</p>
<p>So the approach I am proposing is that we assume that if several people see the <em>exact </em>same page then it must be a public one. My bank&#8217;s logged-in view of my accounts won&#8217;t be seen by <em>any</em> other users, while the Google Doc essay I share with a friend will only be seen by a couple of people. If we set the threshold to something like 6 or 9 users then we can be fairly confident that the content was public. To capture rarely-seen sites we&#8217;d want the count to last for some time: 6-9 users within the same month, perhaps. Now the technical challenge is to figure out how to tell whether several people have seen the content <em>without</em> revealing it (since it&#8217;s private) and without leaving any trace that <em>we</em> have viewed it (for privacy reasons).</p>
<p>Problem C is storing the content. Spotify is a popular music player which has been installed by <a href="http://mashable.com/2011/09/21/spotify-2-million-subscribers/">millions</a> of users. Yet they don&#8217;t need huge servers to transmit all those data streams. Instead, they <a href="http://www.csc.kth.se/~gkreitz/spotify/">use P2P technology</a> and each user provides a certain amount of storage and a certain amount of bandwidth. Other projects like <a href="https://freenetproject.org/papers.html">Freenet</a> have proven that P2P sharing can store data and keep all the participants anonymous. So I propose leveraging fairly standard P2P approaches (or better yet, an existing P2p storage network) for storing, finding, and retrieving the content.</p>
<p>Well, that&#8217;s enough for one essay. Check back in <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-2">part 2</a> for more details about the UI, <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-3">part 3</a> for a discussion of the storage network, and <a href="http://mcherm.com/permalinks/1/constant-crawl-design-part-4">part 4</a> for analysis of how to anonymously determine whether something can be shared.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/constant-crawl-design-part-1/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Host Error 2</title>
		<link>http://mcherm.com/permalinks/1/host-error-2</link>
		<comments>http://mcherm.com/permalinks/1/host-error-2#comments</comments>
		<pubDate>Fri, 03 Feb 2012 20:21:09 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=617</guid>
		<description><![CDATA[Another posting on how to understand Profile errors. If you ever see &#8220;Host error number XXX&#8221;, it means that this was the XXX&#8217;th error of the day that this Profile instance wrote to the logs. Get someone to look it up in the Profile logs. Also, Calling mrpc ZWRAP with [925, 8864, ""44758220"", &#124;!&#124;] will [...]]]></description>
				<content:encoded><![CDATA[<p>Another posting on how to understand Profile errors.<span id="more-617"></span></p>
<p>If you ever see &#8220;Host error number XXX&#8221;, it means that this was the XXX&#8217;th error of the day that this Profile instance wrote to the logs. Get someone to look it up in the Profile logs.</p>
<p>Also, <em>Calling mrpc ZWRAP with [925, 8864, ""44758220"", |!|]</em> will fail if 8864 is not a valid profile userid (which is the case for me).</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/host-error-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Removing the &#8220;Macros&#8221; warning in PowerPoint</title>
		<link>http://mcherm.com/permalinks/1/removing-the-macros-warning-in-powerpoint</link>
		<comments>http://mcherm.com/permalinks/1/removing-the-macros-warning-in-powerpoint#comments</comments>
		<pubDate>Mon, 30 Jan 2012 13:16:20 +0000</pubDate>
		<dc:creator>mcherm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mcherm.com/?p=614</guid>
		<description><![CDATA[When you open any PowerPoint presentation made by my company&#8217;s default presentation format, you get a warning that it contains macros and asking whether the macros should be disabled. The macros are useless, but removing this is somewhat awkward and difficult to remember so I&#8217;m writing down the instructions. Launch PowerPoint (these instructions work for [...]]]></description>
				<content:encoded><![CDATA[<p>When you open any PowerPoint presentation made by my company&#8217;s default presentation format, you get a warning that it contains macros and asking whether the macros should be disabled. The macros are useless, but removing this is somewhat awkward and difficult to remember so I&#8217;m writing down the instructions.<span id="more-614"></span></p>
<ol>
<li>Launch PowerPoint (these instructions work for Office 2010).</li>
<li>Open the presentation to be fixed.</li>
<li>Go to the File menu, and select &#8220;Options&#8221;.</li>
<li>Select &#8220;Customize Ribbon&#8221;.</li>
<li>On the right-hand side of the complex dialog find the &#8220;Main Tabs&#8221; section. Check the checkbox next to the &#8220;Developer&#8221; tab.</li>
<li>Click &#8220;OK&#8221; and return. You should now have a &#8220;Developer&#8221; menu above the ribbon.</li>
<li>Select the &#8220;Developer&#8221; menu to display the developer ribbon.</li>
<li>On the ribbon, click the button labeled &#8220;Visual Basic&#8221;</li>
<li>In the upper-left-hand side of the screen there is something labeled &#8220;VBAProject&#8221; with sub-folders labeled &#8220;Modules&#8221; some of which have entries with names like &#8220;Module1&#8243;. You may have to expand the tree-view widget to see some of these.</li>
<li>For each module, double-clicking will open it. Those that are completely blank (all of them in my company&#8217;s template) are useless and can clearly be deleted.</li>
<li>Delete a module by closing it (if you had opened it), then right-clicking on the module and selecting &#8220;Remove Module1&#8230;&#8221;. It will offer to save, but you won&#8217;t need to do that.</li>
<li>After doing this for all unwanted modules, go to the &#8220;File&#8221; menu and select &#8220;Close and Return to Microsoft PowerPoint&#8221;.</li>
<li>Save your newly-changed document.</li>
</ol>
<p>By the way, in case anyone couldn&#8217;t tell, after experience with it I <em>really</em> hate Microsoft&#8217;s &#8220;ribbon&#8221;. The old approach &#8220;Menus&#8221; required people to look around through lots of menus to find the commands they needed (although if they used a command frequently, they could read the menu to see what keyboard command would execute that). Power users could modify the menus if they wanted to (but hardly anyone did). In the new &#8220;Ribbon&#8221; interface, the commands that people use a lot are just one click away &#8212; as long as you know what arcane icon represents the action that you want to perform. If you need to find a new command you no longer have to look through the menus to find it&#8230; instead you simply perform a web search to find someone else who ran that command and follow the arcane set of clicks that they wrote in order to locate the mysterious and well-hidden button that performs the action. Power-users are people who know how to perform an action without looking it up on the web.</p>
]]></content:encoded>
			<wfw:commentRss>http://mcherm.com/permalinks/1/removing-the-macros-warning-in-powerpoint/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>


<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: mcherm.com @ 2013-05-22 13:12:38 -->