TY - GEN
T1 - Search by multiple examples
AU - Zhu, Mingzhu
AU - Wu, Yi Fang Brook
PY - 2014
Y1 - 2014
N2 - It is often difficult for users to adopt keywords to express their information needs. Search-By-Multiple-Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords. Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the users' provided examples (denote as query examples) as positive set and the entire data collection as unlabeled set. However, it is inefficient to treat the entire data collection as unlabeled set, as its size can be huge. In addition, the query examples are treated as being relevant to a single topic, but it is often the case that they can be relevant to multiple topics. As the query examples are much fewer than the unlabeled data, the system performance may downgrade dramatically because of the class imbalance problem. What's more, the experiments conducted in these studies have not taken into account the settings in online search, which are very different from the controlled experiments scenario. This proposed research seeks to explore how to improve SBME by exploring: (1) how to predict user' information needs by modeling the content of the documents using probabilistic topic models; (2) how to deal with the class imbalance problem by reducing the size of the unlabeled data and adopting machine learning techniques. We will also conduct extensive experiments to better evaluate SBME using different sizes of query examples to simulate users' information needs.
AB - It is often difficult for users to adopt keywords to express their information needs. Search-By-Multiple-Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords. Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the users' provided examples (denote as query examples) as positive set and the entire data collection as unlabeled set. However, it is inefficient to treat the entire data collection as unlabeled set, as its size can be huge. In addition, the query examples are treated as being relevant to a single topic, but it is often the case that they can be relevant to multiple topics. As the query examples are much fewer than the unlabeled data, the system performance may downgrade dramatically because of the class imbalance problem. What's more, the experiments conducted in these studies have not taken into account the settings in online search, which are very different from the controlled experiments scenario. This proposed research seeks to explore how to improve SBME by exploring: (1) how to predict user' information needs by modeling the content of the documents using probabilistic topic models; (2) how to deal with the class imbalance problem by reducing the size of the unlabeled data and adopting machine learning techniques. We will also conduct extensive experiments to better evaluate SBME using different sizes of query examples to simulate users' information needs.
KW - information retrieval
KW - positive unlabeled learning
KW - search by multiple examples
KW - transductive inference
UR - http://www.scopus.com/inward/record.url?scp=84906850352&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906850352&partnerID=8YFLogxK
U2 - 10.1145/2556195.2556206
DO - 10.1145/2556195.2556206
M3 - Conference contribution
AN - SCOPUS:84906850352
SN - 9781450323512
T3 - WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining
SP - 667
EP - 671
BT - WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery
T2 - 7th ACM International Conference on Web Search and Data Mining, WSDM 2014
Y2 - 24 February 2014 through 28 February 2014
ER -