Full metadata record
DC FieldValueLanguage
dc.contributor.author彭俊彥en_US
dc.contributor.authorPeng, Chun-Yenen_US
dc.contributor.author楊千en_US
dc.contributor.authorChyan Yangen_US
dc.date.accessioned2014-12-12T02:50:00Z-
dc.date.available2014-12-12T02:50:00Z-
dc.date.issued2004en_US
dc.identifier.urihttp://140.113.39.130/cdrfb3/record/nctu/#GT009264523en_US
dc.identifier.urihttp://hdl.handle.net/11536/77642-
dc.description.abstract經過十多年的發展,網際網路已成為一個龐大的知資載體,然而大部份的資訊都是以自然語言的形式呈現。以自然語言所表達的資訊,雖然便於人類閱覽,但卻無法讓電腦理解。因此如何從大量的網頁中自動擷取隱含的知識,成為目前知識工程研究的一大課題。 資訊擷取的相關研究主要在於取得非結構化文件中的資訊片段,並轉換為較容易為電腦處理和分析的結構化文件格式,目前大致有 Text Mining 和 Web Mining 兩種主要的方式。前者大多與自然語言處理(National Language Processing, NLP)技術相結合,而後者則是將Data Mining的技術運用於網頁的資訊擷取上。 本研究主要是試著結合自然語言處理技術和本體論(Ontology)的概念,建立一個資訊產品規格擷取的系統,以做為自動化網頁資訊擷取的雛型研究。我們運用自然語言處理工具對HTML文件進行適當的處理,再透過JAPE(Java Annotation Patterns Engine)語法規則來擷取文件中特定語法結構的資訊片段,最後再參考預先定義的Ontology,將擷取到的資訊輸出為符合RDF(Resource Description Framework)規範的檔案。 我們針對HP和IBM的網頁撰寫適當的JAPE語法規則,再從這兩家公司的個人電腦(包含Desktop和Notebook),Unix伺服器,顯示器及印表機等四大產品線,抽樣36個網頁進行產品規格的資訊擷取,其平均召回率(Recall)和正確率(Precision)都超過90%以上,由此可以驗証JAPE語法規則針對特定領域的資訊擷取有相當不錯的效能表現。zh_TW
dc.description.abstractWorld Wide Web (WWW) is a large repository of information that includes many resources, like text document, image, multimedia and so on. Most of information is presented in natural language and disperse in different Web sites. Natural language documents are for human reading but not computer. How to extract the information form natural language document is an important topic of knowledge engineering. The major task of information extraction is to extract information piece from unstructured document and convert to structured document for computer processing and analysis. There are two major methodologies for information extraction, one is test mining and another is web mining. The former always link to natural language processing technology whereas the latter applies data mining to process the web contents. Our research tries to build up a prototype system that combines the natural language process and the concept of ontology to extract IT product specification information from web page. We use NLP tools to process HTML document and extract information entities by JAPE grammar rule, then refer to predefined ontology to convert the extracted information to be DAML file that follow RDF specification. We develop the optimal JAPE grammar rules for IBM and HP web pages that describe IT product specification and download 36 IT product web pages to test the extraction performance. The testing result show the average recall and precision are over 90%. It reveals the JAPE grammar rules have good extraction performance when optimized for specific knowledge domain.en_US
dc.language.isoen_USen_US
dc.subject資訊擷取zh_TW
dc.subject自然語言處理zh_TW
dc.subject本體論zh_TW
dc.subjectNLPen_US
dc.subjectInformation extractionen_US
dc.subjectJAPEen_US
dc.subjectgrammar ruleen_US
dc.title應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例zh_TW
dc.titleApply Natural Language Process in Automatic Information Extraction--The Example of IT Product Specification Extraction systemen_US
dc.typeThesisen_US
dc.contributor.department管理學院資訊管理學程zh_TW
Appears in Collections:Thesis


Files in This Item:

  1. 452301.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.