首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
In 2004, the Scottish Parliament commissioned an independent review of abuse in children’s residential establishments between 1950 and 1995. In 2007, the review’s findings were published in a report entitled Historical Abuse Systemic Review: Residential Schools and Children’s Homes in Scotland 1950 to 1995, also known as the Shaw Report. In this article, the Shaw Report provides the jumping off point for a case study of the social justice impact of records. Drawing on secondary literature, interviews, and care-related records, the study identifies narratives that speak to the social justice impact of care records on care-leavers seeking access to them; it also assesses the potential of the surviving administrative records to serve as a foundation on which to construct historical narratives that speak more generally to the experience of children in residential care.  相似文献   

4.
政府电子文件协同管理:美国经验及其启示   总被引:1,自引:0,他引:1  
加强政府电子文件协同管理是促进政府电子文件高效管理和价值实现的有效路径。采用协同创新管理理论框架对近五年美国国家档案与文件署发布的电子文件管理相关政策进行文本分析,并对五个部门工作人员进行访谈,深入了解政策现状,归纳总结出美国在政府电子文高效协同管理方面的措施,揭示其在目标协同、主体协同、客体协同、过程协同、要素协同五个方面的协同经验。最后,结合我国实践需求,提出了借鉴五个方面经验,促进政府电子文件协同管理路径高效的启示。  相似文献   

5.
This paper reviews the current status of the Anglophone (Anglo-American) publishing business and draws some comparisons with publishing in other languages. It then critically reviews the impact of the Harry Potter phenomenon and the questionable progress of e-books in the trade sector, using the example of Stephen King’s Riding the Bullet. It also comments on Amazon’s introduction of the Kindle e-book reader.  相似文献   

6.
In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each wave i may insert only those documents that occur in that tier level into the answer. After processing a wave, the algorithm checks whether the answer achieved might be changed by successive waves or not. A new wave is started only if it has a chance of changing the top-k scores. We show through experiments that such lazy query processing strategy results in smaller query processing times when compared to previous approaches proposed in the literature. We present experiments to compare Waves’ performance to the state-of-the-art document-at-a-time query processing methods that preserve top-k results and show scenarios where the method can be a good alternative algorithm for computing top-k results.  相似文献   

7.
Web spam pages exploit the biases of search engine algorithms to get higher than their deserved rankings in search results by using several types of spamming techniques. Many web spam demotion algorithms have been developed to combat spam via the use of the web link structure, from which the goodness or badness score of each web page is evaluated. Those scores are then used to identify spam pages or punish their rankings in search engine results. However, most of the published spam demotion algorithms differ from their base models by only very limited improvements and still suffer from some common score manipulation methods. The lack of a general framework for this field makes the task of designing high-performance spam demotion algorithms very inefficient. In this paper, we propose a unified score propagation model for web spam demotion algorithms by abstracting the score propagation process of relevant models with a forward score propagation function and a backward score propagation function, each of which can further be expressed as three sub-functions: a splitting function, an accepting function and a combination function. On the basis of the proposed model, we develop two new web spam demotion algorithms named Supervised Forward and Backward score Ranking (SFBR) and Unsupervised Forward and Backward score Ranking (UFBR). Our experiments, conducted on three large-scale public datasets, show that (1) SFBR is very robust and apparently outperforms other algorithms and (2) UFBR can obtain results comparable to some well-known supervised algorithms in the spam demotion task even if the UFBR is unsupervised.  相似文献   

8.
Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models).  相似文献   

9.
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.  相似文献   

10.
Traditional pooling-based information retrieval (IR) test collections typically have \(n= 50\)–100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.  相似文献   

11.
This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets.  相似文献   

12.
Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of shards to search for each query, fewer postings are evaluated. In this paper we extend the study of selective search into new areas using a fine-grained simulation, examining the difference in efficiency when term-based and sample-based resource selection algorithms are used; measuring the effect of two policies for assigning index shards to machines; and exploring the benefits of index-spreading and mirroring as the number of deployed machines is varied. Results obtained for two large datasets and four large query logs confirm that selective search is significantly more efficient than conventional distributed search architectures and can handle higher query rates. Furthermore, we demonstrate that selective search can be tuned to avoid bottlenecks, and thus maximize usage of the underlying computer hardware.  相似文献   

13.
Search engines are increasingly going beyond the pure relevance of search results to entertain users with information items that are interesting and even surprising, albeit sometimes not fully related to their search intent. In this paper, we study this serendipitous search space in the context of entity search, which has recently emerged as a powerful paradigm for building semantically rich answers. Specifically, our work proposes to enhance an explorative search system that represents a large sample of Yahoo Answers as an entity network, with a result structuring that goes beyond ranked lists, using composite entity retrieval, which requires a bundling of the results. We propose and compare six bundling methods, which exploit topical categories, entity specializations, and sentiment, and go beyond simple entity clustering. Two large-scale crowd-sourced studies show that users find a bundled organization—especially based on the topical categories of the query entity—to be better at revealing the most useful results, as well as at organizing the results, helping to discover novel and interesting information, and promoting exploration. Finally, a third study of 30 simulated search tasks reveals the bundled search experience to be less frustrating and more rewarding, with more users willing to recommend it to others.  相似文献   

14.
15.
16.
A recent “third wave” of neural network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, work in this area is often referred to as deep learning. Recent years have witnessed an explosive growth of research into NN-based approaches to information retrieval (IR). A significant body of work has now been created. In this paper, we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.  相似文献   

17.
This paper explores the nature of the archival body and the ways in which it is temporally situated and yet also always in motion. Applying transdisciplinary logics, it argues that the affective nature of archival productions follows the machinations of metamorphoses and (un)becoming. Using two queer/ed and transgender archives as sites of inquiry, the paper explores the erotic and affective nature of accessing the archival body in its multimodal forms. Although touching, smelling and stroking what remains of distinct material lives might elucidate arousal and certain other affective and haptic responses within the visitor to the archives, the records themselves hold and cradle their creators and their storytelling techniques along with their relationships to longing for and belonging in the archival body of knowledge. This approach suggests that understanding of the record and its affects can be enriched by temporal perspectives that acknowledge distinct and diverse temporalities and promote generative understandings of potentially meaningful progressions of time and everyday rhythms embodied within archival materials.  相似文献   

18.
We address the feature extraction problem for document ranking in information retrieval. We then propose LifeRank, a Linear feature extraction algorithm for Ranking. In LifeRank, we regard each document collection for ranking as a matrix, referred to as the original matrix. We try to optimize a transformation matrix, so that a new matrix (dataset) can be generated as the product of the original matrix and a transformation matrix. The transformation matrix projects high-dimensional document vectors into lower dimensions. Theoretically, there could be very large transformation matrices, each leading to a new generated matrix. In LifeRank, we produce a transformation matrix so that the generated new matrix can match the learning to rank problem. Extensive experiments on benchmark datasets show the performance gains of LifeRank in comparison with state-of-the-art feature selection algorithms.  相似文献   

19.
This study is devoted to detection of the lexical environment and demonstration of the thematic medium of the words MEMORY and MEMORIES in the social sciences on the basis of the bibliographic database Social Science Citation Index (SSCI) of the Institute for Scientific Information (USA). The amount of studied material is over 3000 documents in English. Corresponding corpora and subcorpora of summary texts are formed, general frequency dictionaries and frequency dictionaries of binary combinations for each corpus and subcorpus are constructed, words and combinations specific for each subcorpus are found, and corresponding factors (lexical markers) are calculated for them. The general statistical information on the usage of the words under study is given, the obtained results of lexical analysis are represented in a tabulated form, and the corresponding semantic maps are discussed.  相似文献   

20.
For the definition of electronic records, the use of new terms, like literary warrant, is not necessary, and for the European perspective even not understandable. If this expression simply means best practice and professional culture in recordkeeping, we only to know what creators did for centuries and still do today and probably will do also in the future, by referring to the archival science, diplomatics and archival practice for clarifying definitions in the recordkeeping environment. A multi-disciplinary approach is still required for the electronic recordkeeping system as it was in the past for traditional records, but the theory and the terminology should be consistent and based on the deep understanding of essential characteristics of records and essential requirements of good recordkeeping to produce in the first place and maintain reliable and authentic records. Of course, a record is more than recorded information created in the course of business activity: a record is the recorded representation of an act produced in a specific form – the form prescribed by the legal system – by a creator in the course of its activity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号