Learning Python to build a web crawler

Posted: 2011/02/19 | Author: Guava | Filed under: ENGLISH, python | Tags: python | 9 Comments

from XKCD

Although I shouldn’t really be procrastinating, writing for hours makes me depressed. Learning a new programming language makes me happy. Hence, for the past 2~3 weeks, I spent like 2~3 hours on Saturdays or Sundays to build a web crawler which I never done before. The crawler was implemented to capture and import posts from my Korean blog to this wordpress blog. In order to do this, I learned a new language, python.

This is what I did:
1. I opened up a http connection using urllib2 module.
2. In order to parse the content of interest, I used BeautifulSoup module. It is built on top of regular expressions and sgml. I can traverse the html tree very easily and can search a node using regular expressions.
3. I dumped it out to a text file in Movable Type format, which was inserted into wordpress import system.

I got stuck on these:
1. I had to solve many encoding and decoding errors. For instance, when you open a default file descriptor, it is opened with “ascii” encoding. Korean is usually expressed in one of the following encoding ‘utf-8’ or ‘euc-kr’. When I tried to write Korean string using the default f.write() method, the compiler stopped. Hence, the file descriptor had to be opened using a codec module.

I had to use many encode and decode functions. Furthermore, the order of function calls was another source of frustration. When I called common string manipulation functions such as replace or strip, these functions somehow changed the strings that these calls made it impossible to decode the strings. I am assuming that the reason is because these functions also assumed that strings manipulation would being done on ascii characters. I haven’t yet figured it out.

2. The connection was terminated from the Korean blog server during crawling. I solved this problem by simply running the crawling again from another ip address at home. Someone told me that such problem can easily be solved by faking the header of the http request. I have not tried this though yet.

3. WordPress import module reports error if there is even a mild formatting error. Getting rid of redundant white spaces were not that difficult. However, subtle details such as the number of dashes in a delimiter mattered too. For example, the delimiter for each post had to be exactly eight dashes (five dashes for different contents), no more or no less. I thought the delimiter was seven dashes for a period of time, during which I was convinced that the wordpress import module was broken.

For now, I’ve uploaded all the posts to a blog that I created merely for testing purpose.
http://hcigirl.wordpress.com/ (I kinda like this URL!! I am thinking about turning it into my personal webpage.) I realized that there are still many post processing that needs to be done. Many hyper links and categories can be substituted to reduce the amount of link referring to my old blog. I also found out that due to style related tags that I didn’t do a rigorous job in taking out, the layout of the blog looks horrible. As Movable Type import format doesn’t support differences between categories and tags, I haven’t yet figured out how to categorize or tag them yet.

All these work is for another day and now I should go and do some laundry. =)

HCI Talks Video Archives

Posted: 2011/02/15 | Author: Guava | Filed under: ENGLISH, HCI | Tags: hci, talks, video | 2 Comments

I am currently collecting web links that contain high quality research talks.

Georgia Tech GVU center brown bag video archive
http://www.gvu.gatech.edu/bb_archive
Talks from UIST 2006 were recorded. I happen to present there as well. I haven’t changed much in the past 4 years..
http://www.idiap.ch/mmm/talk-webcast/uist-06/
There are 15 talks on this website under “human computer interaction” category.
http://videolectures.net/Top/Computer_Science/Human_Computer_Interaction/
Stanford Lecture Series (Thanks~ Juho)
http://scpd.stanford.edu/search/publicCourseSearchDetails.do?method=load&courseId=11894
MIT CSAIL HCI Seminar (Thanks~ Hwajung)
http://www.csail.mit.edu/taxonomy/term/34

If you know more useful sites, please let me know. I will add to the list and acknowledge you for the tip~! ^_^

Google art project http://www.googleartproject.com/

Posted: 2011/02/01 | Author: Guava | Filed under: ENGLISH, MISC | 5 Comments

I enjoy going to art museums. At one point, I thought that museum curators have the most amazing job ever. Google art project relieves web surfers the problem of physically being present at an art museum.
On this website http://www.googleartproject.com/, ou can go through just the collections inside an art gallery or navigate around the museum similar to how you navigate around google street view application.

Here are demonstration videos.

Here is what people are saying about this project.
http://www.engadget.com/2011/02/01/google-art-project-offers-gigapixel-images-of-art-classics-ind/

Comment> The navigation controls are really unintuitive. It brings up the frustration that I experienced when using GoogleTV.

Regardless, I am adding this to my list of cool google websites.
The last website that was added was google demo slam (http://www.demoslam.com/)

Faculty candidate talks in Cornell InfoSci

Posted: 2011/02/01 | Author: Guava | Filed under: cornell, ENGLISH | Tags: cornell, infosci, talk | 4 Comments

Last and this academic year, Cornell Information Science Department has been trying to find a faculty candidate for a tenured-tracked position in Policy subfield. In general, Policy is not my interest, and I was rather lukewarm about going to the talks and participating in the discussions. Recently, I became more active in going to these talks and joining the discussions as I became interested in learning the hiring process in an academic institute. Currently, two candidates have been invited: Katie Shilton and Laura DeNardis. It was very interesting to attend these two talks as I always appreciate good female presenters. I also learned a lot about interviewing techniques from them even though their topics have nothing to do with what I do. Interestingly, they were strong in different ways. Laura had numerous books that she’s written, variety of research contacts. Katie had an admirable presentation skill. I especially learned a lot from her question answering skills. While I do not remember details of Katie’s talk other than the general topic and some snippets of my personal interest, I took notes on Laura’s talk, which made me reflect about her research topics and candidate talks in general.

The title of Laura’s talk was “Arrangement of technical architecture is an arrangement of political influence and power”. Laura started her talk by giving us a general picture of what “Internet Governance” (Protocols, Critical Internet Resources, Communication Rights, Intellectual Property Rights, Internet Security & Infrastructure management) is. I really appreciated how she started her talk. She gave the audience a very nice reference frame to understand the rest of her talk. This made me realize that in my current presentation deck (for my will-it-happen? interview), I do not have any slide that gives an overview of my field (HCI) or subfield (devices and interaction techniques) to the general audience.

Then she narrowed down to the first topic, protocols, and gave out three protocol examples: ODF, IPv6, MAC addresses. While talking about these examples, issues such as open source licenses, commercial versus non-commercial standards, socio-economic constraints to participation (thx Karin), was raised. She spent a lot of her time talking about IPv6 versus IPv4. She argued that it is a moral imperative for US to quickly upgrade to IPv6 to maintain dominance in global internet market. It was very evident that she knew her topic and her field very well. Comparatively, I still strive to show such confidence in giving out talks and I struggle.

It is always debatable whether an academic should also pursue popularity. This is the question that I raised for the rest of the day. Laura’s main example, political power play in transition between IPv4 and IPv6, is not as hot these days as Katie’s main topic, personal privacy in designing sensor networks. P raised a question that made me think about this issue even longer. When Laura talked about MAC addresses invading privacy issues, he asked why we should care about MAC addresses when our cell phone is more intrusive. In my personal opinion, privacy implications of MAC addresses and that of sensors (e.g., GPS) are orthogonal topics. Hence, when this question was raised, I personally felt that the popularity of the topic (IPv6 vs Sensors in a mobile phone) was more in question than anything else. This was actually a sad realization because my research topic is also considered a fading star. Market size of the digital pens has shrunk so much thanks to Apple and Steve Jobs. When people from industry ask me questions, they ask “We do not have any plan to support styluses, if so, how do you plan to make your research useful to us?”.

Lastly, attending these policy topic talks are really new experiences for me because I find it very difficult to internalize the basic premises of these work. I am a strong believer in “regulations make things slow”. The entire time Katie was advocating how we should impose security concerns to the designers, I sincerely wanted to ask, “Did you observe any scenarios where the security concerns limited how much cellular network applications can innovate?”. Thankfully I didn’t really have to raise these “I don’t believe your research” questions because someone else always did. It was a sincerely fruitful experience to see how the speakers respond to these questions. Some candidates become very defensive and try to convince the opponent, which usually does more harm to the speaker than anything else. Some candidates politically avoid the touchy topic and beat around the bushes and show that “maybe you and I can agree that we disagree?”. From the observations, the former seemed to make the speaker more satisfied. The latter seemed to make the rest of the crowd happier.

How to write a ACM sigchi rebuttal

Posted: 2010/12/18 | Author: Guava | Filed under: HCI | Tags: CHI, writing | 15 Comments

I originally wanted to post this during the SIGCHI rebuttal period. However, I was afraid that it may hurt the acceptance rate of my paper somehow. (Maybe by revealing my identity to the external reviewers? I don’t know. Anything is possible. No?) Hence, I decided to write this immediately after the notification, while everything is still hot in my head. Maybe students that have to write a rebuttal for CHI2012 may find my post useful.

First, I will start by my opinion on why it is difficult to write a convincing rebuttal.

If 90% of writing a paper requires knowing how to explain a research idea and 10% knowing how to convince another researcher, writing a rebuttal requires 10% of the former and 90% of the latter. As a junior researcher (like myself, a graduate student), the only senior researcher you talk to in a daily basis is your advisor. Yet, your advisor is only one sample point among the pool of senior researchers. How on earth would you know how to convince another senior researcher when you do not even know how to initiate a serious conversation with them?

It is always difficult to take criticism. This is especially the case for people like us Ph.D. students who always strive to produce results that are flawless. We seldom hear that our work is a crap in our face. Additionally, the criticism that we have to take in the review seems unjust especially when you believe that the person who’s making the criticism spent maybe a day or two reading a document that you spent endless nights editing over and over again for several months. Some reviewers are actually nasty too. Although it is written in fancy and erudite terms, sometimes the reason that they are rejecting your paper is simply because, “you did not do a good job in impressing me” or “I do not buy your research story”. How would you possibly stay sane when you read these comments?

Despite, many reasons to yell at your reviewers and say, “you are full of s**t~!!”. We should not even express this in any indirect way while writing our rebuttal. In the past, I always had to discard the very first draft that I produced after reading the harsh reviews. My rebuttal was bitter, smirking, cynical and mean. This was not obvious until I slept at least two days crying over it and returned to my calm and reasonable self.

This year, due to my post (in my Korean blog) about my experience in writing a CHI rebuttal in the past years, I’ve been asked by several junior students (outside my institute) to help out in writing their rebuttal. While doing so, there are several tips that I repeated. Here are some of them.

[Understanding and analyzing the review]
Read your reviews with another coauthor and have an in-depth discussion. It is important to address the most important issues first and address only the problems that reviewers raise. Sometimes, I realize that I misunderstood what the reviewer meant and was addressing something that was totally unnecessary. Sometimes, I structured my arguments in the wrong order: order of least importance to most importance. Many authors actually make these mistakes during the writing process, not because they are careless, but because the reviews are somewhat encrypted. Not all of them are kind enough to tell us “A is unconvincing and A’ is my opinion. B and C are what I do not understand but authors should only address B because C is not as important”. It comes more like this, “A is weird, B is weird, C is weird”. Usually, meta-reviewer tries help us by decrypting the dialect of the external reviewers so that the authors are not at a loss. However, not all meta-reviewer are nice either. For this reason, I always talk with my advisor or one of my coauthors for 2~3 hours about the reviews before writing anything. This usually helps a lot.

[Writing process]
Agree with your reviewers. Last year, one of the rebuttal of the paper that I reviewed basically stated that “R1 (myself) is wrong because so and so, and our paper is awesome”. This rebuttal didn’t acknowledge some of the important problems that I pointed out and tried to challenge my judgement. I was offended and became more strict in finding faults of the paper during the discussion period. This is the last thing that the authors want. Making an enemy among the reviewers. To make an ally, you have to tell them how useful their feedbacks are and you have to sincerely mean it too.

Specify how the camera ready version will be reflected based on the reviewer’s request. Often, there are rebuttals that just say, “I understand R2’s point”. This is only half-baked response. The goal of the rebuttal is to demonstrate how the camera ready version will be changed according to the issues raised. Hence, the response should be more specific and go even one step further. Like this: “the question A is raised by R2 because we only explained B in section C when it is also needed in section D. In the camera ready version we will clarify B in section D”

Do not say that the draft will be improved with a major change. I have seen several authors that say in their rebuttal “After the submission we did A,B, and C which addresses all of R1,R2, R3 questions which will be updated in the final draft”. This approach is very bad. First, you are admitting that current draft has many issues. Second, during the PC meeting, papers are discussed “as is”. If it is concluded in the PC meeting that the paper requires major revision, PC members advise that it should be submitted to a future venue. Better approach is to figure out what reviewers misunderstood. Explain why there was a miscommunication and offer ways for reviewers to solve those misunderstandings. Point to a paragraph or a figure in the paper. If needed, direct them to a reference that is not cited in your paper. This is what a rebuttal is for; to clarify.

[Formatting and Style]
Although 5000 characters limit may seem insufficient to explain everything, do not hesitate to allocate some of those characters in white spaces and phrases such as (in response to R1, as mentioned by R2, in our RELATED WORK section, in p8~9). At a glance, they want to see which major raised points have been addressed and which part of the paper they should read again. Sometimes, I use the web browser search tool (namely Ctrl+F) to locate my reviewer id (RX) in the rebuttal and read the accompanying paragraph more carefully to make sure that I didn’t miss anything.

Last but not least, write short and direct sentences. Any sentence that you write to explain in your rebuttal have 50/50 chance of helping your paper and hurting your paper. The longer and indirect a sentence, the higher chance of mis-interpretation. On top of that, reviewers have very short attention span. If sentence become convoluted, they will read what they think the sentence is saying as opposed to what the sentence is actually saying.

The biggest question behind all this is, “Does reviewers actually change their score after reading a rebuttal?”.

And the answer is “YES”~!!. Among the 7 papers that I reviewed this year, I increased the score by 0.5 in one paper because I was happy to learn something that I didn’t know from the rebuttal. Among my 4 papers (2 in previous years and 2 in current year) that have been accepted, 3 paper scores actually increased (by +0.4, +0.1, +0.4) after the rebuttal period.

Although it is painful and tedious process to write a good rebuttal, it is very rewarding once you DO write a good one.

HYUNYOUNG SONG