Learning Python to build a web crawlerPosted: 2011/02/19
Although I shouldn’t really be procrastinating, writing for hours makes me depressed. Learning a new programming language makes me happy. Hence, for the past 2~3 weeks, I spent like 2~3 hours on Saturdays or Sundays to build a web crawler which I never done before. The crawler was implemented to capture and import posts from my Korean blog to this wordpress blog. In order to do this, I learned a new language, python.
This is what I did:
1. I opened up a http connection using urllib2 module.
2. In order to parse the content of interest, I used BeautifulSoup module. It is built on top of regular expressions and sgml. I can traverse the html tree very easily and can search a node using regular expressions.
3. I dumped it out to a text file in Movable Type format, which was inserted into wordpress import system.
I got stuck on these:
1. I had to solve many encoding and decoding errors. For instance, when you open a default file descriptor, it is opened with “ascii” encoding. Korean is usually expressed in one of the following encoding ‘utf-8’ or ‘euc-kr’. When I tried to write Korean string using the default f.write() method, the compiler stopped. Hence, the file descriptor had to be opened using a codec module.
I had to use many encode and decode functions. Furthermore, the order of function calls was another source of frustration. When I called common string manipulation functions such as replace or strip, these functions somehow changed the strings that these calls made it impossible to decode the strings. I am assuming that the reason is because these functions also assumed that strings manipulation would being done on ascii characters. I haven’t yet figured it out.
2. The connection was terminated from the Korean blog server during crawling. I solved this problem by simply running the crawling again from another ip address at home. Someone told me that such problem can easily be solved by faking the header of the http request. I have not tried this though yet.
3. WordPress import module reports error if there is even a mild formatting error. Getting rid of redundant white spaces were not that difficult. However, subtle details such as the number of dashes in a delimiter mattered too. For example, the delimiter for each post had to be exactly eight dashes (five dashes for different contents), no more or no less. I thought the delimiter was seven dashes for a period of time, during which I was convinced that the wordpress import module was broken.
For now, I’ve uploaded all the posts to a blog that I created merely for testing purpose.
http://hcigirl.wordpress.com/ (I kinda like this URL!! I am thinking about turning it into my personal webpage.) I realized that there are still many post processing that needs to be done. Many hyper links and categories can be substituted to reduce the amount of link referring to my old blog. I also found out that due to style related tags that I didn’t do a rigorous job in taking out, the layout of the blog looks horrible. As Movable Type import format doesn’t support differences between categories and tags, I haven’t yet figured out how to categorize or tag them yet.
All these work is for another day and now I should go and do some laundry. =)