Learning Python to build a web crawler

from XKCD

Although I shouldn’t really be procrastinating, writing for hours makes me depressed. Learning a new programming language makes me happy. Hence, for the past 2~3 weeks, I spent like 2~3 hours on Saturdays or Sundays to build a web crawler which I never done before. The crawler was implemented to capture and import posts from my Korean blog to this wordpress blog. In order to do this, I learned a new language, python.

This is what I did:
1. I opened up a http connection using urllib2 module.
2. In order to parse the content of interest, I used BeautifulSoup module. It is built on top of regular expressions and sgml. I can traverse the html tree very easily and can search a node using regular expressions.
3. I dumped it out to a text file in Movable Type format, which was inserted into wordpress import system.

I got stuck on these:
1. I had to solve many encoding and decoding errors. For instance, when you open a default file descriptor, it is opened with “ascii” encoding. Korean is usually expressed in one of the following encoding ‘utf-8’ or ‘euc-kr’. When I tried to write Korean string using the default f.write() method, the compiler stopped. Hence, the file descriptor had to be opened using a codec module.

I had to use many encode and decode functions. Furthermore, the order of function calls was another source of frustration. When I called common string manipulation functions such as replace or strip, these functions somehow changed the strings that these calls made it impossible to decode the strings. I am assuming that the reason is because these functions also assumed that strings manipulation would being done on ascii characters. I haven’t yet figured it out.

2. The connection was terminated from the Korean blog server during crawling. I solved this problem by simply running the crawling again from another ip address at home. Someone told me that such problem can easily be solved by faking the header of the http request. I have not tried this though yet.

3. WordPress import module reports error if there is even a mild formatting error. Getting rid of redundant white spaces were not that difficult. However, subtle details such as the number of dashes in a delimiter mattered too. For example, the delimiter for each post had to be exactly eight dashes (five dashes for different contents), no more or no less. I thought the delimiter was seven dashes for a period of time, during which I was convinced that the wordpress import module was broken.

For now, I’ve uploaded all the posts to a blog that I created merely for testing purpose.
http://hcigirl.wordpress.com/ (I kinda like this URL!! I am thinking about turning it into my personal webpage.) I realized that there are still many post processing that needs to be done. Many hyper links and categories can be substituted to reduce the amount of link referring to my old blog. I also found out that due to style related tags that I didn’t do a rigorous job in taking out, the layout of the blog looks horrible. As Movable Type import format doesn’t support differences between categories and tags, I haven’t yet figured out how to categorize or tag them yet.

All these work is for another day and now I should go and do some laundry. =)

9 Comments on “Learning Python to build a web crawler”

  1. Hey if you are into crawling with Python, you should take a look at Scrapy, it works like a charm and is really efficient. You will write a spider in no time.

    But what you did is probably best for your case 🙂 and most easy to implement, although scrapy also seems to handly encoding issues.

    Ana hasee yoo!

  2. I am now not certain where you’re getting your information, but great topic. I needs to spend a while finding out much more or understanding more. Thanks for magnificent information I used to be looking for this info for my mission.

  3. newbs guide says:

    I do trust all the concepts you have introduced on your post. They are really convincing and can definitely work. Still, the posts are too brief for beginners. Could you please prolong them a bit from next time? Thank you for the post.

  4. Somebody essentially help to make seriously posts I might state. This is the first time I frequented your website page and thus far? I amazed with the research you made to create this particular submit incredible. Excellent process!

  5. nana says:

    wow….I’m trying to learn this on Udacity right now and honestly, I feel as if I’ve learned more from your single post than I did from the 30 two – four minute long videos I watched. To me the videos don’t seem to show much really and the test aren’t what I’d expect, it feels as if its missing something (well at least to me!) it’s not as informative as I hoped.

    But be assured, I will come back here for more!

  6. You realize therefore significantly when it comes to this matter, made me for my part believe it from so many varied angles. Its like men and women are not involved until it is something to accomplish with Woman gaga! Your personal stuffs nice. All the time deal with it up!

  7. Great goods from you, man. I have be aware your stuff prior to and you are just too great. I really like what you’ve obtained here, really like what you are stating and the best way in which you say it. You make it enjoyable and you continue to care for to stay it wise. I can not wait to read far more from you. That is actually a tremendous web site.

  8. kenju254 says:

    Reblogged this on ikenju254 and commented:
    Writing a simple Web Crawler

  9. Nice weblog right here! Also your web site quite a bit up fast!
    What host are you using? Can I get your associate link to your host?
    I want my site loaded up as quickly as yours lol

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s