internet tech stuff: April 2011

Wednesday, April 6, 2011

www2011 conference - posters

This is my third post on www2011 conference. The first and second covers the conference and papers. This one covers the posters.

Posters

Text Sentiment Analysis using Stop Words uses a very innovative approach of using the stop words and gaps between them to analyze the sentiment of the text. This is completely opposite of what most text processing/information extraction algorithms do where they ignore the stop words. A nice out of the box thinking idea.

Evaluation of Valuable UGC in Social News websites analyzes the value of UGC to social news websites. It finds that current news and events gets lot more value from UGC than technology news.

Hierarchical Organization of Unstructured Consumer Reviews tries to organize consumer reviews into hierarchy of aspects for a product (iphone -> software, speaker, battery..) of what consumers liked/disliked.

ReadAlong: Reading Articles and Comments Together uses bag of words and topic models (extracted using LDA) to group comments to parts of the article they belong to (typically comments is attached to complete article - while they comment might be about a part of the article). This is a joint work of yahoo and IISc.

Web Information Extraction using Markov Logic Networks proposes the use of MLN for general purpose extraction of structured information from web sites. They demonstrate this for specific domains (like Restaurants, Books etc). This is a joint work of yahoo, IISc and Microsoft.

Detecting Group Review Spam targets a very specific problem of review spam done by group of people (as opposed to spam by individuals). They identify group of individuals and find out whether the patterns the follow together indicates a abnormal uniform group behavior based on set of features.

Classification Based Framework for Concept Summarization is another work from yahoo. This groups images into a concept using a classification based framework and uses LDA to get category information.

Spammer Networks in Twitter analyzes the collaborative strategies used by group of spammers in twitter to avoid detection and increase reach. Repeated URLs in recent tweets were used to initially identify set of suspected spammers. The spammers gets legitimate users following them by follow-backs and also follow each other. An interesting analysis of the patterns of spammers.

www2011 conference - part 2

This is the second post on www2011 conference. First post is here.

Social Network Algorithms

Network Bucket Testing paper was from facebook. This solves the interesting problem of how to select a set of users who're connected (friends) but still form a sample representative of the overall population. This is really important when you want to do A/B (bucket) testing of a social feature. It uses a novel walk based method to do the same.

Information Creditibility

Limiting the spread of mis information in social networks deals with the problem of identifying nodes in a social graph who will help to stop the "bad campaign" and save the nodes. This paper also deals with the issue of the state of a node not being known (affected/unaffected).

Information Credibility on twitter is another paper which yahoo! research is part of. It tries to automatically classify tweets as credible or not credible based on certain features. They use lot of features around characteristics of the tweet (size, url, hashtags etc), network (author, friends, followers), propagation (retweets, num tweets), popularity etc. Lots of useful ideas and information which can be used in any user generated contents.

Diffusion

Who says what to whom on twitter is from Cornell and yahoo! research. Some interesting claims 50% of URLs consumed in twitter are generated by 20K elite users. The URLs broadcast by different categories of users have different lifespans. Most users get their contents from other ordinary users (who're well connected and follow elite) in a two step process. News urls are short-lived, blog urls are long lived and music/videos persist nearly forever.

Information Spread

Information spreading in context has a surprising conclusion that how many people a user forwards the information and the total coverage the information reaches, can be captured

by a simple stochastic branching model and largely independent of context.

I'll cover the posters in the next blog.

Tuesday, April 5, 2011

www2011 conference

I attended the www2011 conference in hyderabad (mar29-apr1). The conference happened first time in india and attended by more than 800 people from 50 countries. There were 3 keynotes, 3 panels, 81 papers (spread across 27 sessions), 88 posters and 25 demos.

Yahoo (who i work for) was well represented with more than 26 papers/posters. We had a booth in which we demo'ed YQL using its console on how one can get real time tweets from twitter translated using google translate api into many languages (including hindi). Yahoo's clues was also demo'ed. These demos were well received by those who stopped by. YQL wow'ed them with the power it can provide for developing mashup applications.

About the conference itself most of the sessions i attended had lots of papers related to twitter and social-media. It seemed as if the entire research community is doing free research for twitter (and none from twitter was there in the conference afaik). It is the power of open data from twitter and how the web community is enamored by them.

The best poster was about predicting popular messages in twitter. The best paper proposed a new model for product search based on maximizing the value of money for the user.

Dr.Abdul Kalam did the first key note on web for societal transformation. He suggested the www research community to help break all barriers especially the language and make the web accessible by each and every one.

Tim Benners Lee invetor of www did the second key note on designing the web for open society. He touched upon openness/net-neutrality, balancing between accountability and anonymity (in expressing opinions), democracy and transparency using web.

Christos Papadimitriou did the final key note on Games, Algorithm and Internet. He discussed the nash equilibrium (non zero sum multi player game). The cost of anarchy (as opposed to optimal path) is 4/3 (30% more), but the cost is unbounded if agents start acting on their own interests/optimizing.

Papers

Monetization:

Incentivizing high quality UGC is one of the papers from yahoo! research. This paper proposes a game theoretic model to balance between quantity and quality to encourage users to contribute content at optimal quality (by a simple rating model by viewers).

Buy it Now or take a chance analyzes the problem with second price auction (SPA) in scenarios like high targeted content. When you have much more precise targeting the number of advertisers interested in it reduces and it can become very attractive to a single advertiser. In this case the SPA is a losing proposition to content owner, auctioner. This paper proposes a buy-it-now price (or take a chance in bidding) to the advertiser which will be better than pure SPA.

Spatio Temporal Analysis:

Unified Analysis of Streaming News is a joint work of yahoo! research with CMU. This paper unifies the clustering, categorization and analysis (all three) of news articles to identify key entities, topics in the stories and reveal the temporal structure of stories as they evolve. The approach used is Rigorous Chinese Restaurant Process for story clustering and Latent Dirichlet Allocation (LDA) for topic extraction and Sequential Monte Carlo for inference.

Temporal Dynamics:

We know who you followed last summer (in twitter) uses bounded methods to estimate when one user followed another. This is very good for estimating the followers of celebrity. The idea is to use the twitter follower graph which returns results in time sorted (latest follower first) and the user account creation time. One can find bounds of time for each user (his follow time has to be greater than his account creation time, but lesser than next person who follows) and with set of users one can find reasonable approximation.

Like Like Alike is a joint effort from yahoo. This paper proposes using both users interests and their friends to target the interests and friendship recommendation (than in isolation).

The second and third posts covers rest of the papers and posters.