標題: 應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例
Apply Natural Language Process in Automatic Information Extraction--The Example of IT Product Specification Extraction system
作者: 彭俊彥
Peng, Chun-Yen
楊千
Chyan Yang
管理學院資訊管理學程
關鍵字: 資訊擷取;自然語言處理;本體論;NLP;Information extraction;JAPE;grammar rule
公開日期: 2004
摘要: 經過十多年的發展,網際網路已成為一個龐大的知資載體,然而大部份的資訊都是以自然語言的形式呈現。以自然語言所表達的資訊,雖然便於人類閱覽,但卻無法讓電腦理解。因此如何從大量的網頁中自動擷取隱含的知識,成為目前知識工程研究的一大課題。
資訊擷取的相關研究主要在於取得非結構化文件中的資訊片段,並轉換為較容易為電腦處理和分析的結構化文件格式,目前大致有 Text Mining 和 Web Mining 兩種主要的方式。前者大多與自然語言處理(National Language Processing, NLP)技術相結合,而後者則是將Data Mining的技術運用於網頁的資訊擷取上。

本研究主要是試著結合自然語言處理技術和本體論(Ontology)的概念,建立一個資訊產品規格擷取的系統,以做為自動化網頁資訊擷取的雛型研究。我們運用自然語言處理工具對HTML文件進行適當的處理,再透過JAPE(Java Annotation Patterns Engine)語法規則來擷取文件中特定語法結構的資訊片段,最後再參考預先定義的Ontology,將擷取到的資訊輸出為符合RDF(Resource Description Framework)規範的檔案。

我們針對HP和IBM的網頁撰寫適當的JAPE語法規則,再從這兩家公司的個人電腦(包含Desktop和Notebook),Unix伺服器,顯示器及印表機等四大產品線,抽樣36個網頁進行產品規格的資訊擷取,其平均召回率(Recall)和正確率(Precision)都超過90%以上,由此可以驗証JAPE語法規則針對特定領域的資訊擷取有相當不錯的效能表現。
World Wide Web (WWW) is a large repository of information that includes many resources, like text document, image, multimedia and so on. Most of information is presented in natural language and disperse in different Web sites. Natural language documents are for human reading but not computer. How to extract the information form natural language document is an important topic of knowledge engineering.

The major task of information extraction is to extract information piece from unstructured document and convert to structured document for computer processing and analysis. There are two major methodologies for information extraction, one is test mining and another is web mining. The former always link to natural language processing technology whereas the latter applies data mining to process the web contents.

Our research tries to build up a prototype system that combines the natural language process and the concept of ontology to extract IT product specification information from web page. We use NLP tools to process HTML document and extract information entities by JAPE grammar rule, then refer to predefined ontology to convert the extracted information to be DAML file that follow RDF specification.

We develop the optimal JAPE grammar rules for IBM and HP web pages that describe IT product specification and download 36 IT product web pages to test the extraction performance. The testing result show the average recall and precision are over 90%. It reveals the JAPE grammar rules have good extraction performance when optimized for specific knowledge domain.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009264523
http://hdl.handle.net/11536/77642
顯示於類別:畢業論文


文件中的檔案:

  1. 452301.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。