Posts filed under 'developers'

SxSWi Day Four: Considerations for scalable web ventures (How to scale)

Panelists of Scalable web ventures

Panelists
Kevin Rose | Digg
Cal Hendersen | Flickr
Joe Stump | Digg
Chris Lea | Media Temple
Garrett Camp Stumble Upon
Matt Mullenweg | WordPress

{Discussion: Kevin seems to be moderating}

  • Consensus is that you think about scaling when you get there
  • Joe Stump says that it (not worrying about scaling initially) helped them concentrate on building cool features
  • Software load balancers suck … Squid is highly recommended
  • Pound is good for http load balancing
  • Joe Stump says that at 15 million pageviews you should start thinking of specializing servers (images, db, static files … that sort of thing)
  • Wordpress uses mainly rented server boxes …. 1000 of them (didn’t know that)
  • Cal from Flickr says that engineering time is expensive and that if you can just solve the problem by throwing money at it … then you should.
  • They recommend that when your development staff grows past 2 people that someone be appointed to standardize code (underscores vs Camel Case)
  • TRAC is brought up by Stump … use it if you
  • Lea says to document your code. Its actually a monetary issue, time spent figuring out code is time not coding, if you can reduce that you save money.
  • Cal cracks a really great joke “What is this documentation thing you speak of?” … “seriously … just hire people that agree with you”
  • Question comes up about remote workers … Kevin says that at one point they didn’t care … they hired a guy from the East to help digg scale.
  • Matt says his people don’t see each other for  months and they get together usually for social purposes more than anything
  • Says they are trying to get to a stage where they have users clustered in certain cities.

{Floor opened for questions}
Q: What bottlenecks do you guys usually encounter

  • If its not your db then its your file storage (NFS etc) … Cal concurs … says its almost IO … “With a teeeny bit of foresight … you can avoid that” Love this guy!
  • Talking about digg architecture … Joe Stump says digg db has 200 tables but only about 2 are problem children
  • says one of them has about 200 million records … and the two get the highest 2 read and write requests
  • Stump says that your language never matters … Cal chimes in “Unless its Ruby” … the crowd roars in appreciation!
  • PS:not sure where this fits, but there is a discussion about how their admin tools are usually not very well done. Joe says that whenever something goes wrong with digg, they usually check with the Admins first to see if they’ve screwed with anything.

Q:recommended software

  • Cal says use Ganglia for graphing, puppet for admin. Lea suggests Unin for graphing. Cal says Ganglia and unin are almost the same. LVS comes up.

Q: How do you keep the community from becoming obnoxious

  • Kevin talks about giving the community the tools to moderate themselves, talks about the success of the “bury” option.

The panel

Q: About source control

  • They do pushes at digg from once a day to 45 times a day, but Rose says officially they’ll be pushing twice a month.
  • Joes suggests that when you’re ready to push live, you should freeze your code and create a new branch
  • This way when something breaks, you can fix the code in the branch and push it live from there.

Q: asks about what they do when they can’t have a local development environment

  • Cal says they can’t support a local dev environments because its just too complex, too many moving parts. They just use dev servers. Everyone on the panel concurs.
  • Cal: Talks about how Flickr lives on memcached and Squid, suggests taking a look at Varnish if all your stuff is in memory. Says that they have 32,000 requests per second for images.
  • Joe: Says they cache things like user objects containing user data (because users barely alter their info after they first get set up)
  • Matt: Talks about the use of output caching … says that if you’re getting 20 million page views then caching pages for, say one second, can take you down to about 8 million or was it 800,000 …
  • Joe: Talks about queuing. Says that when a user diggs something, they just cache it for that user so it shows up dugg, but they queue it. So a digg might not get into the databases for a few minutes.
  • Digg uses gearman for queuing. Cal talks about a “ghetto queue” of cron and mysql to implement queuing. Joe says that’s “housing projects” bad.

Q: about API’s

  • I love Cal! He says that there’s something about something about API’s that brings out the stupidity in people. They just try to suck down all your data as fast as possible … where they wouldn’t try to do that with a regular web page. He says that implementing a throttling  system is a good idea, to avoid getting creamed by idiots. Did I mention that I love this guy!

Q: any good documentation for using the tools they recommended

  • Answer: A resounding NO. Joe says 90% of what he’s learned with open source is by trial and error.

Definitely Worth the price of admission!

Add comment March 11th, 2008

Coding Horror On Beautiful code

I got around to reading this very insightful piece about truly beautiful code on coding horror today.

In it, Jeff Atwood talks about the problem with the book Beautiful Code, which is that it actually talks about code and not the ideas behind the code.

To put it succinctly in Jeff’s words …

“Ideas are beautiful. Algorithms are beautiful. Well executed ideas and algorithms are even more beautiful. But the code itself is not beautiful. The beauty of code lies in the architecture, the ideas, the grander algorithms and strategies that code represents”

Beautiful Code … the book

I just remember thumbing through the book at my local Barnes & Noble and not being enamored of it … I went on to spend 2 hours reading Obie Fernandez’s The Rails Way instead.

Add comment February 26th, 2008

Why query strings in urls drive Googlebot and other search engine crawlers insane

“We have to reinvent the wheel every once in a while, not because we need a lot of wheels; but because we need a lot of inventors.”
- Bruce Joyce

I wrote about my experience writing a site crawler in php in an earlier post, and I’m going to use some of the background there to make my point here. So it might help to go read it if you haven’t already.

[Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php]

From my casual observation of the way Googlebot crawls some of the sites I work on, I have reached the conclusion that it works in much the same way that a crawler I wrote a year ago worked.

Google bot goes page to page, gathering links from your page and tacking them onto the current url that it is at, right then. So why do query strings give it such a problem?

The answer is simple. Imagine this url for an item that doesn’t exist anymore.

www. example.com/store.php?buyid=29&catid=12

When a crawler encounters this url and tests it to see if it returns a 404 … it doesn’t.

Why?

Because  www. example.com/store.php is usually still a valid page. It  won’t give the crawler an error, unless you explicitly code it to.

So the crawler now tosses  www. example.com/store.php?buyid=29&catid=12 onto its list of pages to be crawled. Can you see the disaster waiting to happen?

www. example.com/store.php?buyid=29&catid=12 and any other non-existent urls like it are basically just mapping to the still valid www. example.com/store.php but in the crawlers mind they are all different urls.

Now , if there are other urls on that page (store.php), like for related products for example. Google just takes the url and tacks it on to the url (it thinks) its at right now. So it winds up with

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11

It does that for every invalid query string url that has store.php in its base. It then goes back and crawls them again and now it has.

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11store.php?buyid=39&catid=11

The crawler is now in a tailspin … going around in circles trying to crawl your site. Chewing up your cpu cycles and generally being a nuisance.

I hope this helps you understand why Googlebot hates query strings so much.

I haven’t tried this yet, but I think it should be clear that making the base url of a query string  resolve to a 404 error will help it out a lot.

So as an example

www. example.com/store.php?buyid=29&catid=12 

should return a code 200/ok

and

www. example.com/store.php

should give a 404 error.

This is just my theory, I don’t know that It’d be practical.

PS: I hope this further helps you understand why search engine crawlers also hate PHP session ids on your content.

Add comment February 8th, 2008

Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php

I spent a lot of time early last year, trying to write a crawler in php (I know, I know).

It was supposed to sit on the server and when so that when you went to the url, it’d generate a google sitemap for your entire site.

What I found out was that writing a good crawler is very hard work. Not because of the recursion involved, but because of the infinite ways link tags appear.

Now Google has validated my experience (more on this in a second).

Just a couple of things I had to consider with my crawler were

  • I had to program it to look for a base tag so that I’d know if to treat the links as relative or absolute.
  • I had to check each link to figure out if it was an internal or external link so I’d know whether to crawl it.
  • Then I had to keep a running list of links crawled, so that I’d know if I had crawled a link before or not
  • I had to let the program know that if there was a “/” in a relative link, to let it know to substitute the domain for it.
  • knowing how to deal with “../” … this was a pain and a half
  • I had to let it know how to deal with mailtos, javascript, and improperly written urls like <a href=”www.concept47.com”> (more on this later) … etc

My crawler worked by gathering a list of links that it continually added urls to be crawled onto. As it crawled the urls, it put them in another array that each new url had to be crosschecked with before being added to the list to be crawled.

The problem  though,  is the way people write html markup. As many of you know, there are some nastily written pages out there … so if someone wrote <a href=”www.concept47.com”> or <a href=”concept47.com”> or even <a href=”screwyougoogle”> my crawler had to know not to add it to the list to be crawled.

This is very difficult to do correctly and for all the time I spent on it, there is no real way to deal with it. You could write special cases for <a href=”www.concept47.com”> but what about <a href=”ww.concept47.com”> or
<a href=”w.concept47.com”>  … see the problem?

Even though the urls give you a 404, they’ll make it onto your list to be crawled and waste the crawlers time. I felt like such a loser for not being able to figure out this issue, but it seems the Googlebot has the same problem.

Look at this. [click the image to make it bigger]

Googlebot has problems with bad link tags

This is from my webmaster tools console.

The problem here was that I had a link tag on one of my blog posts that went like this

<a href=”www.unfuddle.com”>

As you can see, even the mighty Googlebot didn’t pick up on the error. It just tacked the url onto the current url, it was at and went on about its business.

Validation!

Read the next in this series [Why query strings in urls drive Googlebot and other search engine crawlers insane]

Add comment February 8th, 2008

The right way to update software for your customers by Firefox

I consider myself a power user of windows xp, so why haven’t I upgraded from winamp 5.35 to winamp 5.52?

After all, every single time I start winamp it bugs me to.

winamp update available!

The answer is simple … Its because I’m lazy.

I’m not going to go to winamp.com, try to figure out which version to download and then actually install it over again, just so winamp runs exactly the same as it did before! No way.

But, if the program went out there got the update and installed it for me … I wouldn’t object.

Firefox does this right.

An update for firefox is available

When an update is available, it goes out and finds it for me. If I okay it, it installs the update for me and restarts my browser, putting me back viewing the page I was looking at before … like nothing happened. All I have to do is hit “Download & Install Now”. How easy is that?

downloading and updating Firefox

Nag screens/prompts/dialogs are very annoying. My natural instinct is to close them and get on with my life.

In that scenario, everybody loses.

So if you write software, you should strive to have it update automatically, if you possibly can. That would definitely be a selling point for me as your customer. (Hear that Blumenthals software?)

PS: Most software (including wordpress) does require you to go download and install the newest versions. Since automatically updating software is so rare it could be a killer feature if you incorporated it into your software.

Add comment February 7th, 2008

Apparently the creator of Ruby on Rails doesn’t comment his code … kinda

Here’s some excerpts from DHH’s post and comments yesterday on 37 signals

  • The short answer is that we don’t document our projects. At least not in the traditional sense of writing a tome that exists outside of the code base that somebody new to a project would go read …
  • Further more, I don’t really find it necessary for the kind of work that we do. Our biggest product, Basecamp, is about 10,000 lines of code. That really isn’t a whole lot in the grand scheme of things. Everything we do is build is also using Ruby on Rails, which means that most Rails programmers would know their way around our applications straight away. It’s the same conventions and patterns used throughout.
  •  Finally, we write our application in a wonderfully expressive and succinct programming language like Ruby that leads itself very well to a programming style like the one Kent Beck preaches in Smalltalk Best Practice Patterns. Keep your methods short and expressive. On average, our models have methods just four lines long. Adding documentation to a method should usually only be done when you’re doing something non-obvious that can’t be rewritten in an obvious way.
  • [comment] Wim, yes there’s RDoc. I just generally don’t use it for projects. When methods are only an average of 4 lines long written in a language like Ruby, it’s often faster and better to merely browse the code base rather than rely on explicit commenting.

Keep in mind that I’m no Ruby on Rails genius, and from the little I’ve done I can see where DHH is going with this. But I’ve always thought that this argument of a language being so succinct and clear that you don’t have to write comments is just a bit silly for a couple of reasons.

  • I believe that you don’t write code for machines, you write code for people (other developers). So any help you can give them in navigating your code is typically good to have. It saves them time and their employers money … that is what being a great consultant is about, you have to be thinking in terms of how to help your clients’ business and saving them money falls in that category.
  • People who use this line of argument are either too lazy to comment and are trying to justify it …
  • … or don’t understand that there are developers of all skill levels in the industry. So whereas, someone with your skill level would be able to navigate your code quickly, someone who wasn’t as good might take longer …why not avoid that.

Note, that I’m not of the school of thought of commenting just for the sake of it, like I’ve heard some “blub programmers” do. However, I do think that you should always be thinking of other developers when you code and if commenting can get them to a point where they can modify your code in 1 minute instead of a minute and a half … then you should comment.

In the end, I guess its a bit unfair to criticize DHH, because its not clear that he doesn’t comment his code much … though its easy to infer that. I just know from my experience that people who say things like he says have a tendency to have 3 lines of comments in some piece of code 500 lines long.

But if you’re a “rockstar developer, I guess everyone has to dance to your tune, wherever you are right?

Add comment February 6th, 2008

Why php_value directives for php.ini set in .htaccess fail when php is running as cgi or fcgi

I was trying to take advantage of PHP5’s new auto_prepend_file directive today, by using the php_value directive to set it in a .htaccess file. But as soon as I did that, my cheerfully running application puked and died, with the familiar message.

“Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request”

I had seen this behavior before, when I was writing an app for a client a few months ago, but I hadn’t had time to investigate it. Today I decided to go a-googling and I promptly found my answer

Those are Apache directives, but in CGI mode Apache calls the php binary, which turn reads php.ini. Since the binary doesn’t read httpd.conf it has no effect on PHP. As PHP isn’t loaded into Apache, Apache doesn’t know what to do with the directives and borks.

Add comment February 5th, 2008

A lesson in PHP’s design and why its deployment model “just works”

Ian Bickings’s “What PHP Deployment gets right

This is a wonderfully written article on how PHP works, and the funny thing is that Ian seems to be more of a Python guy than anything. Needless to say, I learned a few things from reading this …

  • What people don’t realize is that PHP is effectively a CGI model of execution. People don’t appreciate this because PHP is implemented with mod_php, an Apache module. There are many other modules like mod_perl (the first of these mod_language modules), mod_python, mod_ruby, etc. None of these other modules are like mod_php. This has led many a commentator astray because they don’t get this. This is because the PHP language was written for mod_php. Perl, Python, Ruby — none of them were written to be used as an Apache module. You can’t take one of these existing languages and just retrofit it to be like PHP or like mod_php.

  • PHP processes can leak memory like crazy. It doesn’t matter because they only leak memory for one request.

  • This one helped me understand why Ruby On Rails and its fast-cgi implementation is so slow —->.

    Most of the language (PHP) is implemented in C, in a shared library. In comparison Python has “batteries included”, but those batteries are largely written in Python. Python code is not shareable, and can take time to load up. So while a single Python CGI script might be small, it probably imports lots of code which would have to be loaded each request. PHP scripts actually are small. (Stuff like PEAR changes this by adding substantial libraries written in PHP, but also seriously effects PHP performance.)

Add comment January 20th, 2008

Is rails a ghetto? developments in the ROR space.

Although I haven’t done as much work with Rails as I’d like, I follow it very closely because I like the language and the platform, plus I’m sure to write a web application in it in the next month or two.

In the last few weeks, though,  there have been some interesting developments in the ROR space. Ace programmer Zed Shaw fired two broadsides against the Rails community a few weeks back titled Rails is a ghetto. [aside: there is this very interesting O'Reilly interview with Zed that might help you understand his accomplishments with Ruby]

Its an interesting read, albeit unprofessional and rather profane. I came away from the two part rant feeling that if Mr. Shaw were as unprofessional as he sounded, then it might be the reason he kept bumping into the types of characters he “railed” against. I also didn’t get enough evidence from the rant to really justify (in my mind at least) that the Rails community was a ghetto. It just seems that the immediate space that Zed seems to work in was a bit crappy but the job postings I see for Rails are an order of magnitude more coherent and well-mannered than the ones I see for php. Anecdotal … I know.

Obie Hernandez, another big time Rails person (author of the book “The Rails way”) posted a very balanced counter point to the ghetto rant … About Rails and Ghettos. I felt this was a better assessment of the community. This is my perspective as a one-foot-in-one-foot-out guy in relation to Rails.

The second episode concerns the mounting concern with Ruby on Rails performance and ease of deployment. In case you don’t know, Ruby is slow and painful to deploy. (I helped a client with a Ruby on Rails application deployment late last year and the experience was not pleasant). Dreamhost offers Ruby on Rails hosting for its shared server accounts, and in this post “How Ruby on Rails could be much better” they outline the problems they have had hosting Ruby on Rails and gives suggestions for making it better.

To this, the creator of Ruby on Rails responded with “The deal with shared hosts” which I think shows a disregard for the importance of shared hosting accounts and a bit of his brashness. In summary.

In exchange, I’ll ask a few, small favors. Don’t treat the current Rails community as your unpaid vendor. Wipe the wah-wah tears off your chin and retract the threats of imminent calamity if we don’t drop everything we’re doing to pursue your needs. Stop assuming that it’s either a “complete lack of understanding of how web hosting works, or an utter disregard for the real world” that we’re not working on issues that would benefit your business.

[aside: I kind of agree with this, Dreamhost should hire someone to actually get under the hood and make Rails work for them, after all they are making money off of it. I just think the tone is a tad arrogant]. To continue, Dreamhost wrote another post “Rails is as Rails does“, softening their tone but still making the excellent point that making Rails easy to deploy can only win it converts. Somehow, I don’t think the Rails folks really care about converts that much (the interesting question is “Do they need to?”). Apple thrives on that air of exclusivity that its products high prices give it, maybe Rails is targeting elite programmers who don’t complain but fix what they don’t like and get on with it.

At at the end of all this though, I am left with a weird taste in my mouth about Ruby on Rails. I’ve always felt that Rails folks are a bit arrogant and condescending. I think there is the feeling that everyone should develop with ROR a certain way and if you don’t, you’re not worth wasting time on. In fact here is the creator’s message to people who have a problem with the way he handles Rails, (with some background so you don’t think I’m being unfair) …

if you don’t like the way I’m creating Rails then fuck you.

I see where they’re coming from, but I’m not sure I like it. I love Ruby, I have the best Ruby on Rails IDE installed and hope to be knee deep in it soon. It will be interesting to see if the impressions I have formed about the community hold up.

Add comment January 20th, 2008

How shady companies steal domain names you search for

domain thiefI worked as a developer with a Search Engine Optimization firm for some time, where I learned that some shady companies are able to buy domain names that you search for online.

It wasn’t exactly clear to me how this was happening until I chanced across this excellent article on my new favorite blog

How firms steal domain names you research

If you’re too lazy to read the entire thing then here’s a brief summary of how to avoid losing a domain you just searched for.

  • Avoid address bar guessing.
  • Avoid search engines that don’t make a billion dollars a year in revenue.
  • Avoid browser plug-ins that send data back to the Internet.
  • Go directly to trusted registrars and whois companies.

PS: I am now up to 80 feeds in my RSS reader :P

Add comment January 4th, 2008

Nifty email application feature: Windows live Mail

This is a neat little feature in the new Windows Live Mail.

windows live mail feature

For those who don’t know, Windows live mail is the Microsoft’s email replacement for Outlook Express 6.

I love this feature so much because now, my email client won’t corrupt my contacts with a ton of useless entries (if you use craigslist a lot you’ll know what I mean).

Its exellent logic, because if I exchange emails with a person a couple of times, then its a pretty good bet that I want that person in my contact list but am too lazy to do it manually.  Now Live Mail does it automagically, so you don’t have to … thats called an “intelligent default”. Makes for excellent application design because it “doesn’t make me think

What would be even better would be the ability to set the threshold number manually.

Add comment December 19th, 2007

jscript 5.7: Minor update to IE6’s javascript engine

Microsoft just released an update to IE6’s JavaScript engine. It was in response to problems with its (jscript 5.6) garbage collection that would cause poor performance with Large Ajaxy applications … like Gmail. It probably also helps their hotmail web ui too, since that uses ajax as well.

I would personally have liked to see more done with this update (its only a “minor” update), but I suppose you don’t want to give people a reason to hold on to IE 6 right? Hopefully this stops Feed Demon (my RSS reader) from freezing on CNN’s pages?

Read more about the update here

Download IE6’s jscript 5.7

Add comment December 19th, 2007

Next Posts


Recommended

Posts by Category

Calendar

July 2010
S M T W T F S
« Apr    
 123
45678910
11121314151617
18192021222324
25262728293031

Posts by Month