期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Improvement of building field association term dictionary using passage retrieval

Uddin Md. Elmarhomy Elsayed Masao Kazuhiro Jun-ichi 《Information processing & management》2007,43(6):1793

Field Association (FA) terms are a limited set of discriminating terms that can specify document fields. Document fields can be decided efficiently if there are many relevant FA terms in that documents. An earlier approach built FA terms dictionary using a WWW search engine, but there were irrelevant selected FA terms in that dictionary because that approach extracted FA terms from the whole documents. This paper proposes a new approach for extracting FA terms using passage (portions of a document text) technique rather than extracting them from the whole documents. This approach extracts FA terms more accurately than the earlier approach. The proposed approach is evaluated for 38,372 articles from the large tagged corpus. According to experimental results, it turns out that by using the new approach about 24% more relevant FA terms are appending to the earlier FA term dictionary and around 32% irrelevant FA terms are deleted. Moreover, precision and recall are achieved 98% and 94% respectively using the new approach. 相似文献

2.

Compression of double array structures for fixed length keywords

Masao Fuketa Hiroya Kitagawa Takuki Ogawa Kazuhiro Morita Jun-ichi Aoe 《Information processing & management》2014

A trie is one of the data structures for keyword matching. It is used in natural language processing, IP address routing, and so on. It is represented by the matrix form, the link form, the double array, and LOUDS. The double array representation combines retrieval speed of the matrix form with compactness of the list form. LOUDS is a succinct data structure using bit-string. Retrieval speed of LOUDS is not faster than that of the double array, but its space usage is smaller. This paper proposes a compressed version of the double array by dividing the trie into multiple levels and removing the BASE array from the double array. Moreover, a retrieval algorithm and a construction algorithm are proposed. According to the presented experimental results for pseudo and real data sets, the retrieval speed of the presented method is almost the same as the double array, and its space usage is compressed to 66% comparing with LOUDS for a large set of keywords with fixed length. 相似文献

3.

Context constraint disambiguation of word semantics by field association schemes

Li Wang Masao Fuketa Kazuhiro Morita Jun-ichi Aoe 《Information processing & management》2011

Word sense disambiguation is important in various aspects of natural language processing, including Internet search engines, machine translation, text mining, etc. However, the traditional methods using case frames are not effective for solving context ambiguities that requires information beyond sentences. This paper presents a new scheme for solving context ambiguities using a field association scheme. Generally, the scope of case frames is restricted to one sentence; however, the scope of the field association scheme can be applied to a set of sentences. In this paper, a formal disambiguation algorithm is proposed to control the scope for a set of variable number of sentences with ambiguities as well as solve ambiguities by calculating the weight of fields. In the experiments, 52 English and 20 Chinese words are disambiguated by using 104,532 Chinese and 38,372 English field association terms. The accuracy of the proposed field association scheme for context ambiguities is 65% higher than the case frame method. The proposed scheme shows better results than other three known methods, namely UNED-LS-U, IIT-2, and Relative-based in corpus SENSEVAL-2. 相似文献

4.

A method of extracting malicious expressions in bulletin board systems by using context analysis

Hiroshi Hanafusa Kazuhiro MoritaMasao Fuketa Jun-ichi Aoe 《Information processing & management》2011

Bulletin board systems are well-known basic services on the Internet for information frequent exchange. The convenience of bulletin boards enables us to communicate with other persons and to read the communication contents at any time. However, malicious postings about crimes are serious problems for serving companies and users. The extracting scheme of the traditional methods depends on words or a sequence of words without considering contexts of articles and, therefore, it takes a lot of human efforts to alert malicious articles. In order to reduce the human efforts, this paper presents a new filtering algorithm that can recover the error rate of false positive for non-malicious articles by using context analysis. The presented scheme builds detecting knowledge by introducing multi-attribute rules. By the experimental results for 11,019 test data, it turns out that sensitivity and specificity of the presented method become 38.7 and 24.1 (%) points higher than traditional method, respectively. 相似文献

5.

Word classification and hierarchy using co-occurrence word information

Kazuhiro Morita El-Sayed Atlam Masao Fuketra Kazuhiko Tsuda Masaki Oono Jun-ichi Aoe 《Information processing & management》2004,40(6):9325

By the development of the computer in recent years, calculating a complex advanced processing at high speed has become possible. Moreover, a lot of linguistic knowledge is used in the natural language processing (NLP) system for improving the system. Therefore, the necessity of co-occurrence word information in the natural language processing system increases further and various researches using co-occurrence word information are done. Moreover, in the natural language processing, dictionary is necessary and indispensable because the ability of the entire system is controlled by the amount and the quality of the dictionary. In this paper, the importance of co-occurrence word information in the natural language processing system was described. The classification technique of the co-occurrence word (receiving word) and the co-occurrence frequency was described and the classified group was expressed hierarchically. Moreover, this paper proposes a technique for an automatic construction system and a complete thesaurus. Experimental test operation of this system and effectiveness of the proposal technique is verified. 相似文献

6.

A compact static double-array keeping character codes

Susumu Yata Masaki OonoAuthor VitaeKazuhiro MoritaAuthor Vitae Masao FuketaAuthor VitaeToru SumitomoAuthor Vitae Jun-ichi AoeAuthor Vitae 《Information processing & management》2007

A trie represented by a double-array enables us to search a key fast with a small space. However, the double-array uses extra space to be updated dynamically. This paper presents a compact structure for a static double-array. The new structure keeps character codes instead of indices in order to compress elements of the double-array. In addition, the new structure unifies common suffixes and consists of less elements than the old structure. Experimental results for English keys show that the new structure reduces space usage of the double-array up to 40%. 相似文献

7.

A new method for selecting English field association terms of compound words and its knowledge representation

El-Sayed Atlam K. Morita M. Fuketa Jun-ichi Aoe 《Information processing & management》2002,38(6)

This paper presents a strategy for building a morphological machine dictionary of English that infers meaning of derivations by considering morphological affixes and their semantic classification. Derivations are grouped into a frame that is accessible to semantic stem and knowledge base. This paper also proposes an efficient method for selecting compound Field Association (FA) terms from a large pool of single FA terms for some specialized fields. For single FA terms, five levels of association are defined and two ranks are defined, based on stability and inheritance. About 85% of redundant compound FA terms can be removed effectively by using levels and ranks proposed in this paper. Recall averages of 60–80% are achieved, depending on the type of text. The proposed methods are applied to 22,000 relationships between verbs and nouns extracted from the large tagged corpus. 相似文献

8.

A fast retrieving algorithm of hierarchical relationships using trie structures

Masafumi Koyama Kazuhiro Morita Masao Fuketa Jun-Ichi Aoe 《Information processing & management》1998,34(6):761-773

Case structures are useful for natural language systems, such as word selection of machine translation systems, query understanding of natural language interfaces, meaning disambiguation of sentences and context analyses and so on. The case slot is generally constrained by hierarchical concepts because they are simple knowledge representations. With growing hierarchical structures, they are deeper and the number of concepts to be corresponded to one word increases. From these reasons, it takes a lot of cost to determine whether a concept for a given word is a sub-concept for concepting the case slot or not. This paper presents a faster method to determine the hierarchical relationships by using trie structures. The worst-case time complexity of determining relationships by the presented method could be remarkably improved for the one of linear (or sequential) searching, which depends on the number of concepts in the slot. From the simulation result, it is shown that the presented algorithm is 6 to 30 times faster than linear searching, while keeping the smaller size of tries. 相似文献

9.

New methods for compression of MP double array by compact management of suffixes

Tshering C. Dorji El-sayed Atlam Susumu Yata Mahmoud Rokaya Masao Fuketa Kazuhiro Morita Jun-ichi Aoe 《Information processing & management》2010

Minimal Prefix (MP) double array is an efficient data structure for a trie. However, its space efficiency is degraded by the non-compact management of suffixes. This paper presents three methods to compress the MP double array. The first two methods compress the MP double array by accommodating short suffixes inside the leaf nodes, and pruning leaf nodes corresponding to the end marker symbol. These methods achieve size reduction of up to 20%, making insertion and deletion faster at the same time while maintaining the retrieval time of O(1). The third method eliminates empty spaces in the array that holds suffixes, and improves the maximum size reduction further by about 5% at the cost of increased insertion time. Compared to a Ternary Search Tree, the key retrieval of the compressed MP double array is 50% faster and its size is 3–5 times smaller. 相似文献

10.

Ranking of field association terms using Co-word analysis

Mahmoud Rokaya Elsayed Atlam Masao FuketaAuthor VitaeTshering C. DorjiAuthor Vitae Jun-ichi AoeAuthor Vitae 《Information processing & management》2008

Information retrieval involves finding some desired information in a store of information or a database. In this paper, Co-word analysis will be used to achieve a ranking of a selected sample of FA terms. Based on this ranking a better arranging of search results can be achieved. Experimental results achieved using 41 MB of data (7660 documents) in the field of sports. The corpus was collected from CNN newspaper, sports field. This corpus was chosen to be distributed over 11 sub-fields of the field sports from the experimental results, the average precision increased by 18.3% after applying the proposed arranging scheme depending on the absolute frequency to count the terms weights, and the average precision increased by 17.2% after applying the proposed arranging scheme depending on a formula based on “TF∗IDF” to count the terms weights. 相似文献