新漢字と旧漢字が混在したテキストからの短単位形態素の抽出について

松田, 謙次郎; MATSUDA, Kenjiro

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

新漢字と旧漢字が混在したテキストからの短単位形態素の抽出について

https://doi.org/10.15084/00003440

名前 / ファイル	ライセンス	アクション
papers2108.pdf (850.9 kB)

アイテムタイプ

紀要論文 / Departmental Bulletin Paper(1)

公開日

2021-07-16

タイトル

新漢字と旧漢字が混在したテキストからの短単位形態素の抽出について

タイトル

How to Correctly Morphologically Analyze Text Containing a Mixture of Old- and New-Style Kanji Scripts

言語

jpn

キーワード

主題Scheme

Other

主題

形態素解析

キーワード

主題Scheme

Other

主題

国会会議録

キーワード

主題Scheme

Other

主題

旧字体

キーワード

主題Scheme

Other

主題

新字体

キーワード

主題Scheme

Other

主題

当用漢字字体表

キーワード

言語

主題Scheme

Other

主題

morphological analysis

キーワード

言語

主題Scheme

Other

主題

Minutes of the National Diet

キーワード

言語

主題Scheme

Other

主題

kyūjitai

キーワード

言語

主題Scheme

Other

主題

shinjitai

キーワード

言語

主題Scheme

Other

主題

the Table of Script Styles of Tōyō Kanji

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

departmental bulletin paper

ID登録

10.15084/00003440

ID登録タイプ

JaLC

著者

松田, 謙次郎
MATSUDA, Kenjiro

著者所属

内容記述タイプ

Other

内容記述

神戸松蔭女子学院大学

著者所属(英)

内容記述タイプ

Other

内容記述

Kobe Shoin Women’s University

抄録

内容記述タイプ

Abstract

内容記述

旧字体と新字体の混在するテキストは，形態素解析において誤解析の原因となることが多く，その対策としては形態素解析辞書の記載に異体字を加える方法，そして予め漢字を新字体に置換しておく方法，また複数の辞書を使い分けるといった方法が考えられる。本稿では字体置換6通りと，辞書の使い分け3通りを掛け合わせた18組の組み合わせで國/国，會/会，關/関3対の旧/新字体の対を含んだテキストの形態素解析を行うことで，目的とする漢字を含む形態素がどれほど正確に切り出せるのかを検討した。データとして第1～10回までの国会会議録を用いた。結果は，漢字置換で隣接する漢字が旧字体の場合に旧字体に置換し，隣接しない場合は新字体とするという置換法（デフォルトを新字体とする日和見置換）と，すべてについて近代文語UniDicを用いるか，1949年の当用漢字字体表告示を境として，それ以前では近代文語UniDicを用い，それ以後では現代語書き言葉UniDicを用いる方法が，もっとも正確に当該漢字を含む短単位形態素を切り出せるというものであった。形態素解析辞書の記載に異体字を加える方法には，異体字が記載されていない形態素が出現した場合に対応ができないという欠点があるのに対して，漢字置換と辞書の使い分けを活用する方法は，そうした場合にも柔軟に対応が可能であるという利点があることを主張した。

抄録(英)

内容記述タイプ

Other

内容記述

Japanese texts containing a mixture of old-(kyūjitai) and new-(shinjitai) style kanji scripts pose a serious problem for an automatic morphological analyzer. However, recent developments in various dictionaries by era, undertaken by the corpora project at NINJAL, have brought about a new opportunity to solve this problem. Another promising solution is to replace the script in the text in some way, so that the analyzer can correctly identify the characters/morphemes. We designed an experiment with three dictionary selection methods and six replacement methods using three pairs of old/new kanji scripts (國/国, 會/会 and 關/関) to determine which combination would result in the most precise analysis. An analysis of the text data from the Minutes of the National Diet between 1947 and 1951 demonstrated that, of the 18 combinations, two dictionaries gave the best results. These were, the Contemporary Written Japanese UniDic dictionary up to the public notification of the Table of Script Styles of Jōyō Kanji on April 29, 1949, and The Modern Literary UniDic. With these, we coupled a replacement of a kanji script with an old counterpart when its immediate neighbor was also an old one, and with a new one when it was not. Although the addition of the different scripts to the dictionary entries would be another viable solution, our method is more desirable in that it is applicable to a wider range of texts without dictionary entry modifications.

出版者

国立国語研究所

書誌情報

国立国語研究所論集
en : NINJAL Research Papers

号 21, p. 123-132, 発行日 2021-07

ISSN

収録物識別子タイプ

ISSN

収録物識別子

2186-1358

フォーマット

内容記述タイプ

Other

内容記述

application/pdf

著者版フラグ

出版タイプ

VoR

出版タイプResource

http://purl.org/coar/version/c_970fb48d4fbd8a85

戻る

views

See details

	Views

Versions

Ver.1

2023-05-15 14:46:22.829774

Show All versions

Cite as

Other

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

インデックスリンク

インデックスツリー

アイテム

新漢字と旧漢字が混在したテキストからの短単位形態素の抽出について

× 松田, 謙次郎

× MATSUDA, Kenjiro

Versions

Share

Cite as

Other

エクスポート

コミュニティ

メニューを最小化

インデックスリンク

インデックスツリー

アイテム

新漢字と旧漢字が混在したテキストからの短単位形態素の抽出について

× 松田, 謙次郎

× MATSUDA, Kenjiro

Versions

Share

Cite as

Other

エクスポート

コミュニティ