Query Log Analysis:

Social and Technological Challenges

A workshop at the 16th International World Wide Web Conference
May 8, 2007 - Banff, Alberta, Canada

Overview

Schedule

Program
   Speakers
   Full Papers
   Position Papers
   Attendee Bios

Call for Participation

Important Dates

Organizing Committee

Related Workshops

Workshop Program

The workshop will be held in three parts. The first part of this workshop will give pactitioners and researchers an opportunity to share recent research using query log data. Speakers will be selected from submitted papers. The second part of the workshop will include discussions on legal and ethical challenges associated with query logs and privacy. Speakers will be invited from relevant organizations. The thrid part of the program will include a panel led discussion focusing on how to move forward in query log research.

See the Schedule for details on the day and links to slides.

Confirmed Speakers / Panel Members

Here is a list of just some of the speakers expected to present:

 

Accepted full length papers

User 4XXXXX9: Anonymizing Query Logs  
Eytan Adar — University of Washington

Abstract: The recent release of the American Online (AOL) Query Logs highlighted the remarkable amount of private and identifying information that users are willing to reveal to a search engine. The release of these types of log files therefore represents a significant liability and compromise of user privacy. However, without such data the academic community greatly suffers in their ability to conduct research on real search engines. This paper proposes two specific solutions (rather than an overly general framework) that attempts to balance the needs of certain types of research while individual privacy. The first solution, based on a threshold cryptography system, eliminates highly identifying queries, in real time, without preserving history or statistics about previous behavior. The second solution attempts to deal with sets of queries, that when taken in aggregate, are overly identifying. Both are novel and represent additional options for data anonymization.  [  paper  |  slides  ]


A Study of Mobile Search Queries in Japan  
Ricardo Baeza-Yates, Georges Dupret and Javier Velasco — Yahoo!

Abstract: In this paper we study the characteristics of search queries on mobile phones in Japan, comparing them with previous results of generic search queries in Japan and mobile search queries in the USA.We confirm some results while find some interesting differences on the query distribution, use of the different script languages and query topics.  [  paper  |  slides  ]


Web Search Engine Evaluation using Clickthrough Data and a User Model  
Georges Dupret, Vanessa Murdock and Benjamin Piwowarski — Yahoo!

Abstract: Traditional search engine evaluation relies on a list of query document pairs along with a score reflecting the document relevance to the query. The score is generally a human assessment, but nothing is said explicitly about the actual user behavior. In this paper we illustrate with a toy model that once the user behavior is agreed upon, the human assessment can be eliminated and the engine performance can be evaluated based on the clickthrough data of past users.  [  paper  |  slides  ]


Query Logs Alone are not Enough  
Carrie Grimes, Diane Tang and Daniel M Russell — Google Research

Abstract: The practice of guiding a search engine based on query logs observed from the engine's user population provides large volumes of data but potentially also sacrifices the privacy of the user. In this paper, we ask the following question: Is it possible, given rich instrumented data from a panel and usability study data, to observe complete information without routinely analyzing query logs? What unique benefits to the user could hypothetically be derived from analyzing query logs? We demonstrate that three different modes of collecting data, the field study, the instrumented user panel, and the raw query log, provide complementary sources of data. The query log is the least rich source of data for individual events, but has irreplaceable information for understanding the scope of resources that a search engine needs to provide for the user.  [  paper  |  slides  ]


Functional Faceted Web Query Analysis  
Viet Bang Nguyen and Min-Yen Kan — University of Singapore

Abstract: We propose a faceted classification scheme for web queries. Unlike previous work, our functional scheme ties its classification to actionable strategies for search engines to take. Our scheme consists of four facets of ambiguity, authority sensitivity, temporal sensitivity and spatial sensitivity. We hypothesize that the classification of queries into such facets yields insight on user intent and information needs. To validate our classification scheme, we asked users to annotate queries with respect to our facets and obtained high agreement. We also assess the coverage of our faceted classification on a random sample of queries from logs. Finally, we discuss the algorithmic approaches we take in our current work to automate such faceted classification.  [  paper  |  slides  ]


Can We Find Common Rules of Browsing Behavior?  
Ganesan Velayathan and Seiji Yamada — NII Tokyo, Japan

Abstract: This paper describes our efforts to investigate factors in userís browsing behavior to automatically evaluate web pages that the user shows interest in. To evaluate web pages automatically, we developed a client-side logging/analyzing tool: the GINIS Framework. We do not focus on just clicking, scrolling, navigation, or duration of visit alone, but we propose integrating these patterns of interaction to recognize and evaluate user response to a given web page. Unlike most previous web studies that have analyzed access seen at proxies or server, this work focuses primarily on client-side user behavior using a customized web browser. First, GINIS unobtrusively gathers logs of user behavior through the userís natural interaction with the web browser. Then it analyses the logs and extracts effective rules to evaluate web pages using C4.5 machine learning system. Eventually, GINIS becomes able to automatically evaluate web pages using these learned rules, after which the evaluation can be utilized for a variety of user profiling applications. We successfully confirmed, for example, that time spent on a web page is not the most important factor in predicting interest from behavior, which conflict with the finding of most previous studies.  [  paper  |  slides  ]


 

Accepted position papers

Access to Query Logs — An Academic Researcher’s Point of View  
Judit Bar-Ilan — Bar-Ilan University

Abstract: Position Paper. Academic researchers have very limited access to query logs of major web search engines. Studying and analyzing large-scale query logs is essential for advancing Web IR. We propose setting up review boards with clear rules for appropriate conduct, and allowing researchers access to logs within this framework.  [  paper  |  slides  ]


Preserving the Collective Expressions of the Human Consciousness  
Bernard J Jansen — Penn State

Abstract: Position Paper. Web search engines use transaction log files to record a copious number of interactions that occur between the user, the Web search engine, and Web content. Search engine companies use these records of interactions to improve system design and online marketing. In order to address privacy concerns, some question whether it is wise for search engine companies to preserve these query logs. However, not preserving the query logs from Web search engines would be (and is) a critical loss of a temporal record of the expression of the collective human consciousness. In this paper, an outline of an action plan to preserve these records is proposed to generate discussion of such a course of action.  [ paper | slides ]


Peopleís Query Logs: Personal Information Management  
Amanda Spink and Bernard J Jansen — Queensland University of Technology, Penn State

Abstract: Position Paper. In this position paper we propose that the challenge for query log analysis goes beyond mapping navigation patterns to provide interaction analysis tools to help people understand their own Web search and information behaviors. The relationship between personal information management and Web logs is also discussed. Further research issues are outlined.  [ paper ]


Towards Privacy-Preserving Query Log Publishing  
Li Xiong and Eugene Agichtein — Emory University

Abstract: Itís an open secret that search engines collect detailed query logs, and sometimes release these data to third parties. Making this wealth of information available raises serious concerns about the privacy of individuals. This paper describe some important applications of query log analysis and discuss requirements on the degree of granularity of query logs. The authors then analyze the sensitive information in query logs and classify them from the privacy perspective. Two orthogonal dimensions are described for anonymizing query logs and a spectrum of approaches along those dimensions is presented. The authors discuss whether existing privacy guidelines such as HIPAA can apply to query logs directly.  [ paper | slides ]


Comparing Click Logs and Editorial Labels for Training Query Rewriting  
Wei Vivian Zhang and Rosie Jones — Yahoo! Research

Abstract: Clicks on web advertisements are one of the major sources of revenue for search companies. Query rewrites significantly increase the coverage of web advertisements available. In previous work we focused on optimizing the relevance between the query issued by the web searcher, and rewritten queries used to place advertisements. In this study, we identify the features which are predictive of the click through rate during query rewriting by mining web search-click logs. We also compare the features which are predictive of relevance (judged by human editors) and the clicks in user query logs during query rewriting. Our preliminary results suggest that similar features are predictive, and so we may be able to train our models on click log data in place of human editorial judgement.  [ paper | slides ]

Query: