首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions.  相似文献   

2.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

3.
Propaganda is a mechanism to influence public opinion, which is inherently present in extremely biased and fake news. Here, we propose a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of certain keywords. We experiment thoroughly with different variations of such a model on a new publicly available corpus, and we show that character n-grams and other style features outperform existing alternatives to identify propaganda based on word n-grams. Unlike previous work, we make sure that the test data comes from news sources that were unseen on training, thus penalizing learning algorithms that model the news sources used at training time as opposed to solving the actual task. We integrate our supervised model in a public website, which organizes recent articles covering the same event on the basis of their propagandistic contents. This allows users to quickly explore different perspectives of the same story, and it also enables investigative journalists to dig further into how different media use stories and propaganda to pursue their agenda.  相似文献   

4.
This paper reports on an approach to the analysis of form (layout and formatting) during genre recognition recorded using eye tracking. The researchers focused on eight different types of e-mail, such as calls for papers, newsletters and spam, which were chosen to represent different genres. The study involved the collection of oculographic behavior data based on the scanpath duration and scanpath length based metric, to highlight the ways in which people view the features of genres. We found that genre analysis based on purpose and form (layout features, etc.) was an effective means of identifying the characteristics of these e-mails.  相似文献   

5.
The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.  相似文献   

6.
The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3–6) had similar structures with β-values in the range of 0.66–0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines.  相似文献   

7.
This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments involving different levels of textual word stemming, a textual corpus containing about 25 thousand words (with a total size of about 160KB), and a set of 100 query textual words. The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used. The present results and the inconsistent findings of previous studies raise some questions regarding the efficiency of pure conventional N-gram matching and the ways in which it should be used in languages other than English.  相似文献   

8.
The goal of the study presented in this article is to investigate to what extent the classification of a web page by a single genre matches the users’ perspective. The extent of agreement on a single genre label for a web page can help understand whether there is a need for a different classification scheme that overrides the single-genre labelling. My hypothesis is that a single genre label does not account for the users’ perspective. In order to test this hypothesis, I submitted a restricted number of web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a single genre label to each of the web pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options, i.e. ‘Add a label’ and ‘I don’t know’. The rationale was to observe the level of agreement on a single genre label per web page, and draw some conclusions about the appropriateness of limiting the assignment to only a single label when doing genre classification of web pages. Results show that users largely disagree on the label to be assigned to a web page.  相似文献   

9.
This case study analyzes the Internet-based resources that a software engineer uses in his daily work. Methodologically, we studied the web browser history of the participant, classifying all the web pages he had seen over a period of 12 days into web genres. We interviewed him before and after the analysis of the web browser history. In the first interview, he spoke about his general information behavior; in the second, he commented on each web genre, explaining why and how he used them. As a result, three approaches allow us to describe the set of 23 web genres obtained: (a) the purposes they serve for the participant; (b) the role they play in the various work and search phases; (c) and the way they are used in combination with each other. Further observations concern the way the participant assesses quality of web-based resources, and his information behavior as a software engineer.  相似文献   

10.
This study investigates how resource genres affect the specificity or level of abstraction of user-generated tags. This study found significant variations in frequency of assignment of superordinate, subordinate and basic level terms representing news, blog and ecommerce resource genres. Study observed users’ preferences to represent news and blog resources with basic or subordinate level tags and ecommerce resources with superordinate and basic level of tags. Study also observed multifaceted representation of resource genres, suggesting that use of genre tags is “situated” and grounded in language. This study suggests that representation of knowledge based on resource genres and levels of abstraction of user-generated tags may improve representation, organization, and findability of the resources in the distributed knowledge environments.  相似文献   

11.
Social question-and-answer (Q&A) sites have the potential to serve as a useful source of online information based on their content-focused and collaborative nature. Although previous research has examined various attributes of high-quality information on social Q&A sites (e.g., best answers), relatively less attention has been paid to what affects users’ credibility assessments of information in the social Q&A context. The present study developed a social Q&A platform-specific framework for web credibility assessment, including 21 criteria under six types of web credibility, based on a literature analysis and case study of two online Q&A communities, Stack Exchange and Wikipedia Reference Desk. Using the selected sites’ policies and guidelines (n = 46) as the source of evidence, the case study revealed that content-related attributes (e.g., evidence-based, pertinence) were most frequently identified (12 of 21 criteria) as potential cues and heuristics for web credibility assessments of social Q&A sites, followed by author-related (five of 21; e.g., reputation) and design-related (four of 21; e.g., engaging design) factors. Design-related criteria were rarely included in previous models of web credibility on social Q&A or similar peer-knowledge production platforms. However, our findings showing that both Stack Exchange and Wikipedia Reference Desk have policies regarding all four design-related criteria in our framework—engaging design, moderation, design appropriateness, and ease of use—indicate the potential influences of design features on users’ web credibility assessment on social Q&A sites. Some differences emerged between the two cases, such as policies regarding the answerer's credentials or semantic accuracy that are present on Wikipedia Reference Desk but absent on Stack Exchange. Such differences in the sites’ policies reflect how they position themselves as social Q&A communities—Wikipedia, of which Wikipedia Reference Desk is a part, as an encyclopedia, and Stack Exchange as a community-based platform for learning, sharing knowledge, and building careers of users.  相似文献   

12.
The widespread availability of the Internet and the variety of Internet-based applications have resulted in a significant increase in the amount of web pages. Determining the behaviors of search engine users has become a critical step in enhancing search engine performance. Search engine user behaviors can be determined by content-based or content-ignorant algorithms. Although many content-ignorant studies have been performed to automatically identify new topics, previous results have demonstrated that spelling errors can cause significant errors in topic shift estimates. In this study, we focused on minimizing the number of wrong estimates that were based on spelling errors. We developed a new hybrid algorithm combining character n-gram and neural network methodologies, and compared the experimental results with results from previous studies. For the FAST and Excite datasets, the proposed algorithm improved topic shift estimates by 6.987% and 2.639%, respectively. Moreover, we analyzed the performance of the character n-gram method in different aspects including the comparison with Levenshtein edit-distance method. The experimental results demonstrated that the character n-gram method outperformed to the Levensthein edit distance method in terms of topic identification.  相似文献   

13.
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.  相似文献   

14.
Search systems are limited by their inability to distinguish between information that is on topic and information that is useful, i.e. suitable and applicable to the tasks at hand. This paper presents the results of two studies that examine a possible approach to identifying more useful documents through the relationships between searchers’ tasks and the document genres in the collection. A questionnaire and an experimental user study conducted in two domains, provide evidence that perceptions of usefulness are dependent upon information task type, document genre, and the relationship between these two factors. Expertise is also found to have an effect on usefulness. These results further our understanding of the role of task and genre interactive information retrieval.  相似文献   

15.
Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially important factors in ranking functions, such as term-proximity (the distance relationship between query terms in a document). In our recent work [Zhu, M., Shi, S., Li, M., & Wen, J. (2007). Effective top-k computation in retrieving structured documents with term-proximity support. In Proceedings of 16th CIKM conference (pp. 771–780)], we demonstrated that, when term-proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. To solve this problem, we built the inverted index based on web page structure and proposed the query processing strategies accordingly. The experimental results indicate that the proposed index structures and query processing strategies significantly improve the top-k efficiency. In this paper, we study the possibility of adopting additional techniques to further improve top-k computation efficiency. We propose a Proximity-Probe Heuristic to make our top-k algorithms more efficient. We also test the efficiency of our approaches on various settings (linear or non-linear ranking functions, exact or approximate top-k processing, etc.).  相似文献   

16.
17.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

18.
知识管理学科知识流派划分及发展趋势研究   总被引:1,自引:0,他引:1  
本文借助于作者共被引分析的多元统计分析方法,对SSCI引文数据库中有关知识管理的论文进行分析,试图通过绘制知识地图来揭示知识管理学科的知识结构和划分知识管理流派。通过分析,将知识管理划分为战略流派、过程流派、组织变革流派、应用流派和系统模型流派,并分析了流派的发展趋势。  相似文献   

19.
Many traditional works on off-line Thai handwritten character recognition used a set of local features including circles, concavity, endpoints and lines to recognize hand-printed characters. However, in natural handwriting, these local features are often missing due to rough or quick writing, resulting in dramatic reduction of recognition accuracy. Instead of using such local features, this paper presents a method called multi-directional island-based projection to extract global features from handwritten characters. As the recognition model, two statistical approaches, namely interpolated n-gram model (n-gram) and hidden Markov model (HMM), are proposed. The experimental results indicate that the proposed scheme achieves high accuracy in the recognition of naturally-written Thai characters with numerous variations, compared to some common previous feature extraction techniques. Another experiment with English characters also displays quite promising results.  相似文献   

20.
知识管理流派特征分析及内涵界定   总被引:2,自引:0,他引:2  
利用作者共被引和多元统计分析方法对国内外的知识管理论文进行实证研究,划分出知识管理的五个流派:战略流派、组织变革流派、过程流派、技术流派和应用流派.利用元分析的方法从研究内容、理论基础、研究方法、分析层次等层面揭示了知识管理各流派的研究特征及未来发展动向;归纳出知识管理学科研究的主要内容,探索性地提出知识管理的定义及内涵,以期从整体上把握知识管理的发展状况,为以后的研究提供参考依据.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号