Jibiki dictionary project: informations

General presentation of Jibiki Project

Motivations

Although French and Japanese languages are considered well resourced in terms of tools and linguistic resources, the French-Japanese couple is considered an under-resourced language pair. There are indeed no royalty free high quality bilingual lexicons. French-Japanese aligned bilingual corpora and machine translation systems are logically also rare.

Existing high quality Japanese-French dictionaries are published dictionaries that exist only on paper or in compiled electronic dictionaries (denshi-Jishou). There is no online interface for consultation.

For English and German, there are existing good quality bilingual dictionaries available online and freely downloadables. For English, it is the JMdict project led by Jim Breen that contains about 160,000 entries and for German, it is the WaDokuJiTen led by Ulrich Apel that contains about 280,000 entries.

Method

Based on this observation, we defined the following project to build a multilingual lexical system with a high-priority on the French-Japanese couple of languages .

The lexical system will be built with a bilingual French-Japanese aligned corpus and a bilingual dictionary ( initially ) with a pivot structure.

The bilingual corpus consists of texts enriched with automatic analysis tools (lemmas and grammatical categories) . The first goal of the corpus is to be used as a source to find examples that will enrich the dictionary entries. It can also be used for other purposes: construction of a statistical machine translation system, lexicometry, text studies, etc.

The construction of the dictionary will start by reusing existing resources (English-Japanese dictionaries , Wiktionary) and automatic operation (reification of translation links, sense disambiguation of words). Then, volunteer contributors working on the web will complete the data. They will be asked to contribute on the dictionary articles based on their level of expertise and knowledge in the field of lexicography or bilingual translation.

Microstructure

The microstructure of the entries gathered into monolingual volumes is a simplification of the Combinatorial and Explicative Lexicography. Each entry is based on the vocable. A word is either a group of lexies (a word sense), or an idiom.

The lexie consist of a name, grammatical properties, a semantic formula that can be seen as a formal definition - in the case of a predicative lexie, the formula describes the predicate and its arguments and the syntactic realization of the arguments - and a list of lexical semantic features - there are 56 basic lexical functions applicable to any language that can be combined together - a list of examples and finally to a list of idioms.

To cope with different contributor skill levels, the editing interface can adapt itself and display appropriate information. For example, a beginner contributor will be prompted for a simple gloss to characterize a lexie, while an expert linguist will describe a complete semantic formula. Similarly, some contributors only have access to the list of lexical functions to fulfill.

Macrostructure

The macrostructure is called pivot with a monolingual volume for each language and a center pivot volume.

When a new entry in a language A is added, it must be connected to interlingual volume. These links are created either by reusing existing bilingual dictionaries language A → language B, or by adding them manually from a translation. The link language A → language B then becomes language A → pivot → language B. If the entry of language B is already connected to another entry of C language, then the entry of language A also benefit from these links.

However, in order not to confuse users, they contribute through an interface with a classic bilingual dictionary view. Each bilingual link language A → B language added via this interface will actually be translated in the background by the creation of two interlingual links as well as a pivot link representing the original translation link in order to obtain finally language A → pivot axie → language B.

The data

Levels of quality

Each piece of information for each entry is assigned a level of quality. The levels range from 1 star for a draft (recovered data whose quality is not known) to 5 stars, for an entry certified by an expert (eg, a translation link validated by a sworn translator).

Likewise, the contributors will be assigned a skill level (1 to 5 stars as well). 1 star is a beginner level unknown in the community and 5 stars is the level of a recognized expert.

Distribution

Resources generated will be royalty free and designed to be used both by humans via bilingual dictionaries and tools for automatic language processing (analysis, machine translation, etc.).

Publications

This research project is described in the following article :

Mathieu Mangeot (2016) Collaborative Construction of a Good Quality, Broad Coverage and Copyright Free Japanese-French Dictionary. International Journal of Lexicography 2016; doi: 10.1093/ijl/ecw035; 35 p. HTML

The dictionary

The dictionary uses Jibiki platform (under LGPL license), based on Enhydra, a java objects webserver and the Postgresql database. It has already been used in several dictionary projects (DiLAF, GDEF, LexAlp, MotàMot, Pivax, etc.).

Project Description and Information about the Data