Learning Python to build a web crawler


from XKCD

Although I shouldn’t really be procrastinating, writing for hours makes me depressed. Learning a new programming language makes me happy. Hence, for the past 2~3 weeks, I spent like 2~3 hours on Saturdays or Sundays to build a web crawler which I never done before. The crawler was implemented to capture and import posts from my Korean blog to this wordpress blog. In order to do this, I learned a new language, python.

This is what I did:
1. I opened up a http connection using urllib2 module.
2. In order to parse the content of interest, I used BeautifulSoup module. It is built on top of regular expressions and sgml. I can traverse the html tree very easily and can search a node using regular expressions.
3. I dumped it out to a text file in Movable Type format, which was inserted into wordpress import system.

I got stuck on these:
1. I had to solve many encoding and decoding errors. For instance, when you open a default file descriptor, it is opened with “ascii” encoding. Korean is usually expressed in one of the following encoding ‘utf-8′ or ‘euc-kr’. When I tried to write Korean string using the default f.write() method, the compiler stopped. Hence, the file descriptor had to be opened using a codec module.

I had to use many encode and decode functions. Furthermore, the order of function calls was another source of frustration. When I called common string manipulation functions such as replace or strip, these functions somehow changed the strings that these calls made it impossible to decode the strings. I am assuming that the reason is because these functions also assumed that strings manipulation would being done on ascii characters. I haven’t yet figured it out.

2. The connection was terminated from the Korean blog server during crawling. I solved this problem by simply running the crawling again from another ip address at home. Someone told me that such problem can easily be solved by faking the header of the http request. I have not tried this though yet.

3. WordPress import module reports error if there is even a mild formatting error. Getting rid of redundant white spaces were not that difficult. However, subtle details such as the number of dashes in a delimiter mattered too. For example, the delimiter for each post had to be exactly eight dashes (five dashes for different contents), no more or no less. I thought the delimiter was seven dashes for a period of time, during which I was convinced that the wordpress import module was broken.

For now, I’ve uploaded all the posts to a blog that I created merely for testing purpose.
http://hcigirl.wordpress.com/ (I kinda like this URL!! I am thinking about turning it into my personal webpage.) I realized that there are still many post processing that needs to be done. Many hyper links and categories can be substituted to reduce the amount of link referring to my old blog. I also found out that due to style related tags that I didn’t do a rigorous job in taking out, the layout of the blog looks horrible. As Movable Type import format doesn’t support differences between categories and tags, I haven’t yet figured out how to categorize or tag them yet.

All these work is for another day and now I should go and do some laundry. =)


HCI Talks Video Archives

I am currently collecting web links that contain high quality research talks.

If you know more useful sites, please let me know. I will add to the list and acknowledge you for the tip~! ^_^


Google art project http://www.googleartproject.com/

I enjoy going to art museums. At one point, I thought that museum curators have the most amazing job ever. Google art project relieves web surfers the problem of physically being present at an art museum.
On this website http://www.googleartproject.com/, ou can go through just the collections inside an art gallery or navigate around the museum similar to how you navigate around google street view application.

Here are demonstration videos.

Here is what people are saying about this project.

http://www.engadget.com/2011/02/01/google-art-project-offers-gigapixel-images-of-art-classics-ind/

Comment> The navigation controls are really unintuitive. It brings up the frustration that I experienced when using GoogleTV.

Regardless, I am adding this to my list of cool google websites.
The last website that was added was google demo slam (http://www.demoslam.com/)


Faculty candidate talks in Cornell InfoSci

Last and this academic year, Cornell Information Science Department has been trying to find a faculty candidate for a tenured-tracked position in Policy subfield. In general, Policy is not my interest, and I was rather lukewarm about going to the talks and participating in the discussions. Recently, I became more active in going to these talks and joining the discussions as I became interested in learning the hiring process in an academic institute. Currently, two candidates have been invited: Katie Shilton and Laura DeNardis. It was very interesting to attend these two talks as I always appreciate good female presenters. I also learned a lot about interviewing techniques from them even though their topics have nothing to do with what I do. Interestingly, they were strong in different ways. Laura had numerous books that she’s written, variety of research contacts. Katie had an admirable presentation skill. I especially learned a lot from her question answering skills. While I do not remember details of Katie’s talk other than the general topic and some snippets of my personal interest, I took notes on Laura’s talk, which made me reflect about her research topics and candidate talks in general.

The title of Laura’s talk was “Arrangement of technical architecture is an arrangement of political influence and power”. Laura started her talk by giving us a general picture of what “Internet Governance” (Protocols, Critical Internet Resources, Communication Rights, Intellectual Property Rights, Internet Security & Infrastructure management) is. I really appreciated how she started her talk. She gave the audience a very nice reference frame to understand the rest of her talk. This made me realize that in my current presentation deck (for my will-it-happen? interview), I do not have any slide that gives an overview of my field (HCI) or subfield (devices and interaction techniques) to the general audience.

Then she narrowed down to the first topic, protocols, and gave out three protocol examples: ODF, IPv6, MAC addresses. While talking about these examples, issues such as open source licenses, commercial versus non-commercial standards, socio-economic constraints to participation (thx Karin), was raised. She spent a lot of her time talking about IPv6 versus IPv4. She argued that it is a moral imperative for US to quickly upgrade to IPv6 to maintain dominance in global internet market. It was very evident that she knew her topic and her field very well. Comparatively, I still strive to show such confidence in giving out talks and I struggle.

It is always debatable whether an academic should also pursue popularity. This is the question that I raised for the rest of the day. Laura’s main example, political power play in transition between IPv4 and IPv6, is not as hot these days as Katie’s main topic, personal privacy in designing sensor networks. P raised a question that made me think about this issue even longer. When Laura talked about MAC addresses invading privacy issues, he asked why we should care about MAC addresses when our cell phone is more intrusive. In my personal opinion, privacy implications of MAC addresses and that of sensors (e.g., GPS) are orthogonal topics. Hence, when this question was raised, I personally felt that the popularity of the topic (IPv6 vs Sensors in a mobile phone) was more in question than anything else. This was actually a sad realization because my research topic is also considered a fading star. Market size of the digital pens has shrunk so much thanks to Apple and Steve Jobs. When people from industry ask me questions, they ask “We do not have any plan to support styluses, if so, how do you plan to make your research useful to us?”.

Lastly, attending these policy topic talks are really new experiences for me because I find it very difficult to internalize the basic premises of these work. I am a strong believer in “regulations make things slow”. The entire time Katie was advocating how we should impose security concerns to the designers, I sincerely wanted to ask, “Did you observe any scenarios where the security concerns limited how much cellular network applications can innovate?”. Thankfully I didn’t really have to raise these “I don’t believe your research” questions because someone else always did. It was a sincerely fruitful experience to see how the speakers respond to these questions. Some candidates become very defensive and try to convince the opponent, which usually does more harm to the speaker than anything else. Some candidates politically avoid the touchy topic and beat around the bushes and show that “maybe you and I can agree that we disagree?”. From the observations, the former seemed to make the speaker more satisfied. The latter seemed to make the rest of the crowd happier.


How to write a ACM sigchi rebuttal

I originally wanted to post this during the SIGCHI rebuttal period. However, I was afraid that it may hurt the acceptance rate of my paper somehow. (Maybe by revealing my identity to the external reviewers? I don’t know. Anything is possible. No?) Hence, I decided to write this immediately after the notification, while everything is still hot in my head.  Maybe students that have to write a rebuttal for CHI2012 may find my post useful.

First, I will start by my opinion on why it is difficult to write a convincing rebuttal.

If 90% of writing a paper requires knowing how to explain a research idea and 10% knowing how to convince another researcher, writing a rebuttal requires 10% of the former and 90% of the latter. As a junior researcher (like myself, a graduate student), the only senior researcher you talk to in a daily basis is your advisor.  Yet, your advisor is only one sample point among the pool of senior researchers. How on earth would you know how to convince another senior researcher when you do not even know how to initiate a serious conversation with them?

It is always difficult to take criticism. This is especially the case for people like us Ph.D. students who always strive to produce results that are flawless. We seldom hear that our work is a crap in our face. Additionally, the criticism that we have to take in the review seems unjust especially when you believe that the person who’s making the criticism spent maybe a day or two reading a document that you spent endless nights editing over and over again for several months.  Some reviewers are actually nasty too. Although it is written in fancy and erudite terms, sometimes the reason that they are rejecting your paper is simply because, “you did not do a good job in impressing me” or “I do not buy your research story”. How would you possibly stay sane when you read these comments?

Despite, many reasons to yell at your reviewers and say, “you are full of s**t~!!”. We should not even express this in any indirect way while writing our rebuttal. In the past, I always had to discard the very first draft that I produced after reading the harsh reviews. My rebuttal was bitter, smirking, cynical and mean. This was not obvious until I slept at least two days crying over it and returned to my calm and reasonable self.

This year, due to my post (in my Korean blog) about my experience in writing a CHI rebuttal in the past years, I’ve been asked by several junior students (outside my institute) to help out in writing their rebuttal. While doing so, there are several tips that I repeated. Here are some of them.

[Understanding and analyzing the review]
Read your reviews with another coauthor and have an in-depth discussion. It is important to address the most important issues first and address only the problems that reviewers raise. Sometimes, I realize that I misunderstood what the reviewer meant and was addressing something that was totally unnecessary. Sometimes, I structured my arguments in the wrong order: order of least importance to most importance.  Many authors actually make these mistakes during the writing process, not because they are careless, but because the reviews are somewhat encrypted. Not all of them are kind enough to tell us “A is unconvincing and A’ is my opinion. B and C are what I do not understand but authors should only address B because C is not as important”. It comes more like this, “A is weird, B is weird, C is weird”. Usually, meta-reviewer tries help us by decrypting the dialect of the external reviewers so that the authors are not at a loss. However, not all meta-reviewer are nice either. For this reason, I always talk with my advisor or one of my coauthors for 2~3 hours about the reviews before writing anything. This usually helps a lot.

[Writing process]
Agree with your reviewers. Last year, one of the rebuttal of the paper that I reviewed basically stated that “R1 (myself) is wrong because so and so, and our paper is awesome”. This rebuttal didn’t acknowledge some of the important problems that I pointed out and tried to challenge my judgement. I was offended and became more strict in finding faults of the paper during the discussion period. This is the last thing that the authors want. Making an enemy among the reviewers. To make an ally, you have to tell them how useful their feedbacks are and you have to sincerely mean it too.

Specify how the camera ready version will be reflected based on the reviewer’s request. Often, there are rebuttals that just say, “I understand R2’s point”. This is only half-baked response. The goal of the rebuttal is to demonstrate how the camera ready version will be changed according to the issues raised. Hence, the response should be more specific and go even one step further. Like this:  “the question A is raised by R2 because we only explained B in section C when it is also needed in section D. In the camera ready version we will clarify B in section D”

Do not say that the draft will be improved with a major change. I have seen several authors that say in their rebuttal “After the submission we did A,B, and C which addresses all of R1,R2, R3 questions which will be updated in the final draft”. This approach is very bad. First, you are admitting that current draft has many issues. Second, during the PC meeting, papers are discussed “as is”. If it is concluded in the PC meeting that the paper requires major revision, PC members advise that it should be submitted to a future venue. Better approach is to figure out what reviewers misunderstood. Explain why there was a miscommunication and offer ways for reviewers to solve those misunderstandings. Point to a paragraph or a figure in the paper. If needed, direct them to a reference that is not cited in your paper. This is what a rebuttal is for; to clarify.

[Formatting and Style]
Although 5000 characters limit may seem insufficient to explain everything, do not hesitate to allocate some of those characters in white spaces and phrases such as (in response to R1, as mentioned by R2, in our RELATED WORK section, in p8~9). At a glance, they want to see which major raised points have been addressed and which part of the paper they should read again. Sometimes, I use the web browser search tool (namely Ctrl+F) to locate my reviewer id (RX) in the rebuttal and read the accompanying paragraph more carefully to make sure that I didn’t miss anything.

Last but not least, write short and direct sentences. Any sentence that you write to explain in your rebuttal have 50/50 chance of helping your paper and hurting your paper. The longer and indirect a sentence, the higher chance of mis-interpretation. On top of that, reviewers have very short attention span. If sentence become convoluted, they will read what they think the sentence is saying as opposed to what the sentence is actually saying.

The biggest question behind all this is, “Does reviewers actually change their score after reading a rebuttal?”.

And the answer is “YES”~!!. Among the 7 papers that I reviewed this year, I increased the score by 0.5 in one paper because I was happy to learn something that I didn’t know from the rebuttal. Among my 4 papers (2 in previous years and 2 in current year) that have been accepted, 3 paper scores actually increased (by +0.4, +0.1, +0.4) after the rebuttal period.

Although it is painful and tedious process to write a good rebuttal, it is very rewarding once you DO write a good one.


Cornell Hockey Games

This month, I attended two hockey games against other IVY league school, Dartmouth and Princeton.

The game against Dartmouth (6th) was with Adam and Hronn. Adam made an effort to come to Ithaca after his submission to WWW. Originally Hronn managed to get two hockey tickets from a CS faculty member. As the Gilmores have three season tickets, I was hoping that Adam get to come along even if we have to sit in different section. On the day of the Hockey game, Jim proposed a brilliant idea of exchanging 2 tickets with the 3 tickets that they have so that our couple and Hronn can enjoy the game at their seat and two Gilmores watching the game at the CS faculty seats. Between 1st and 2nd period, Rhonda came over to our seat telling us how impressed she was with Jim’s “intellectual girth” of the day. =)

The game against Dartmouth was relatively relaxing. I already researched before the game that they are ranked the lowest among IVY league teams. Not to my surprise, we won with 5-to-1. Because it was obvious that we will win, I was able to pay attention other aspect of the game. Before, I was occupied in keep track of which player it was that crashed against the wall and where the puck went. (Visually following the puck requires quite a concentration. It feels almost like one of those magic tricks where you have to visually track which one of the three cups are hiding a ball.) I noticed that some hand signs that the referees made for some penalty was not self-explanatory. One time, Cornell pulled out our goalie even when our score was already 4 points or so. One of the Cornell players were knocked out on the ice for about 5 minutes after tripping over another Dartmouth player but I couldn’t figure out what kind of penalty the Dartmouth player got. That night, Adam and I sat across each other on my living room sofa with out legs braided one after another looking up penalty rules on Youtube. (http://en.wikipedia.org/wiki/Penalty_%28ice_hockey%29).

This week, Hronn rushed to my office shouting “I have tickets for the hockey game!”. It was against Princeton which I had very bad feelings against. Last spring, they beat us to 1-to-2 on my very first hockey game experience. It was quite a tragic how we lost. On the very last 1 minute on the last period, they pulled out the goalie to fight with 6 offensive players and scored a goal on the 38 seconds before the end of the game. The game went overtime for sudden death and the result wasn’t pretty. Bottom line, I was very excited to be in the ice rink to see the game hoping to see a revenge.

When Hronn showed me the tickets for the Princeton game, I was slightly puzzled. One was on row 14 and the other was in row 2 of section B. She replied, “We will be fine, it’s in the student section and it’s going to be crazy anyways. We will be able to swap seats or squeeze you in to row 2.” And yes, she was right with both points. We were both able to enjoy the game in row 2 which was spectacular. At the same time, the entire section was totally out of control.

Student section cheers during the ENTIRE game. Everybody screams their lungs out that I often saw droplets of spittle clash against the plastic wall in front of us. One guy behind us was eating pop corn. From time to time, the pop corn will fly and hit against the plastic wall too. Hronn and I felt like being baptized by the Cornell undergrads’ spitting from behind. We told each other that next time we will remember to wear a cap and laundry clothes. Furthermore, some undergrads were clearly drinking alcohol during the game. When the student section shouted “Let’s go Red”, the smell of alcohol fumes from behind made us quite dizzy.

Nevertheless, I learned so many cheers (http://www.elynah.com/?cheers), some of which were very funny, some of which were very brutal and atrocious. Whenever our player scores a goal, the entire student section said “Sieve, Sieve, Sieve, Sieve, You SUCK!” and then it will be followed by “It’s all your fault~, It’s all your fault”. When the other team goalie manage to block a puck, the students shout “Lucky sieve, Lucky sieve~”. This was all cute and funny. However, whenever a period ended and the Princeton coach was leaving the bench, the entire student shouted “Bald, Bald, Bald x10″ which I thought was too insulting.

When 2 minutes were left, Hronn and I both remembered to pull out our key chain and jingle. I told Hronn, “Thank you for sharing your ticket with me~!”. Hronn replied, “Thanks to you that this time, I know that key chain jingle in the third period means it’s the end of the game!”


Google search engine – Hyunyoung Song

My personal web page http://www.cs.umd.edu/~hsong inherits the high page rank due to the high page range of http://www.cs.umd.edu/ host site. Hence, when “Hyunyoung” or “Hyunyoung Song” is entered as a keyword, my webpage is the first item on the search result.

Another competing identity with hyunyoung keyword is a Korean singer named “Hyunyoung” who’s known for her song “Nuna’s dream” (Nuna is a generic termĀ for “older sister”) Usually, google is very good at distinguishing whether a keyword phrase is entered to look at one of my publications, or whether the keyword is designed to search for the Korean singer. For example, if it is “Maryland Hyunyoung”, Google directs my webpage. If it is “Hyunyoung singer”, Google directs the user to the singer’s webpage.

Today, as I was looking at my visitor log and some searched keywords, I found a phrase that is very obvious that was intended to find the Korean singer but the google engine thought that my webpage was more relevant.

The phrase was: “Hyunyoung Nuna Song”

Hahahaha


Follow

Get every new post delivered to your Inbox.