Knowledge bases are very good sources for knowledge extraction, the ability to create knowledge from structured and unstructured sources and use it to improve automatic processes as query expansion.
Wikipedia, in particular, could be analyzed to see how articles and categories relate to each other and how these relationships can support a query expansion technique. In particular, the authors of this article show that the structures in the form of dense cycles with a minimum amount of categories tend to identify the most relevant information.
Query expansion is the process of expanding a query issued by a user, introducing new terms, called expansion features, in order to improve the quality of the retrieved results.
Query expansion is motivated by the assumption that the query introduced by the user is not the best to express its real intention. For example, vocabulary mismatch between queries and documents is one of the main causes of a poor precision in information retrieval systems. Poor results also arise from the topic inexperience of the users. The challenge is to properly select the best expansion features.
Wikipedia has been proven to be a good source for query expansion, but the innovation in this paper lies in the fact of considering the differences between a social network and a knowledge base by:
- Creating a ground truth consisting of those articles in Wikipedia that provide good results for each of the queries that are the baseline in the experiments.
- Analysing how the articles and categories of the ground truth are structured within the Wikipedia graph.
- Identifying cycles of articles and categories as an important structure and also trends within them. 30% of the dense cycles with minimum ratio of categories, are tagged as the best expansion features.
- Identifying challenging and open problems for graph processing technologies when it comes to exploit structures of large graphs such as Wikipedia
A quick analysis of the query graphs reveals that they are, in general, disconnected graphs composed by a moderately large connected component. This is an interesting observation as it means that, in general, the terms users introduce in a search engine are semantically related either directly or by means of extra articles or categories. This suggests that Wikipedia, contains this semantic relation encoded within its structure, and therefore, can be exploited. Also, we observe that the largest connected component is clearly dominated by categories.
If you are using graphs for your research don’t hesitate to request being part of our Research program where we grant free licenses of Sparksee.