Constant Crawl Design – Part 4

Filed Under Programming

Suppose you wanted to build a tool for anonymously capturing the websites that a user visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was “public” and to do so without keeping any record (not even on the user’s own machine) of the sites visited or even tying together the different sites by one ID (even an anonymous one). Read more

Constant Crawl Design – Part 3

Filed Under Programming

Suppose you were building a tool integrated with web browsers to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network? Read more

Constant Crawl Design – Part 2

Filed Under Programming

Suppose you were building a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be? Read more

Constant Crawl Design – Part 1

Filed Under Programming

Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google’s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more reliably; the advantage to Google was that they got to crawl the web “as the user sees it” instead of just what Googlebot gets… and that they got to see every single page you viewed, thus feeding even more into the giant maw of information that is Google.

Well, Google eventually dropped Google Web Accelerator (I wonder why?), but the idea is interesting. Suppose you wanted to build a similar tool that would capture the web viewing experience of thousands of users (or more). For users it could provide a reliable source for sites that go down or that get hit with the “slashdot” effect. For the Internet Archive or someone a smaller search engine like Duck Duck Go, it would provide a means of performing a massive web crawl. For someone like the EFF or human-rights groups it would provide a way to monitor whether some users (such as those in China) are being “secretly” served different content. But unlike Google Web Accelerator, a community-driven project would have to solve one very hard problem: how do this while keeping the user’s browsing history secret — the exact opposite of what Google’s project did. Read more

Host Error 2

Filed Under Programming

Another posting on how to understand Profile errors. Read more

Namespace for a valid SOAP message

Filed Under Programming

A brief hint: if you see an error message like this:

InputStream does not represent a valid SOAP 1.1 Message

check the namespace of the SOAP envelope

SOAP 1.1: http://schemas.xmlsoap.org/soap/envelope/

SOAP 1.2: http://www.w3.org/2003/05/soap-envelope/

Binary Backward Compatibility

Filed Under Programming

I saw this interesting article about a weakness in the Scala language. The weakness applies not just to Scala, but to pretty much any language: the community using the language cannot grow past a certain point until it somehow solves the problem of libraries depending on other libraries in a large (deep) tree. Read more

Story Points

Filed Under Programming

If you have complete and accurate requirements for your project which won’t change, and your development team is spot-on in estimating and highly consistent in their development pace. and there are no surprises, then you can produce highly accurate project timeline estimates up front. Such accurate estimates are (or, more accurately, would be) quite useful and well worth the effort it takes to produce them because of how nicely you can schedule everything. But how about the rest of us, for which none of this is true? Read more

How Even Immutables are Hard with Threads

Filed Under Programming

Armen Rigo has a blog posting (worthy of an article of its own) proposing using STM (Software Transactional Memory) in PyPy. In a discussion on reddit someone suggested that you could have weaker threading guarantees and just use locks manually. Read more

When to Wrap a Library

Filed Under Programming

I find that this comes up fairly frequently. You find some useful library: perhaps it does logging, or enforces design-by-contract, or it provides an API for calling web services. But someone on the team suggests that instead of using the library directly, we should create a wrapper: “that way, if we ever decide to switch to a different library instead it will be easy to switch”. Is this a good idea? Read more