Semantic Partitioner: Transformation of HTML Pages into
Semi-structured XML Documents
Abstract:
World Wide Web is transforming itself into the largest information
resource making the process of information extraction (IE) from Web
an important and challenging problem. In this paper, we present an
automated IE system that is domain independent and that can
automatically transform a given Web page into a semi-structured
hierarchical document using presentation regularities. The resulting
documents are weakly annotated in the sense that they might contain
many incorrect annotations and missing labels. We also describe how
to improve the quality of weakly annotated data by using
domain knowledge in terms of a statistical domain model. We
demonstrate that such system can recover from ambiguities in the
presentation and boost the overall accuracy of a base information
extractor by up to 20%. Our experimental evaluations with TAP data,
computer science department Web sites, and RoadRunner document sets
indicate that our algorithms can scale up to very large data sets.
Demo:
Publications:
Conceptual Papers:
- Srinivas Vadrevu, Fatih Gelgi,
Hasan Davulcu. Semantic Partitioning Web Pages. (Accepted
for Publication) In World Wide Web Journal: Internet and
Information Systems, 2006.
- Srinivas Vadrevu, Fatih Gelgi,
Hasan Davulcu.
Semantic Partitioning Web Pages. In The
6th International Conference on Web Information Systems
Engineering (WISE), New York City, NY, November 2005.
(Acceptance Rate: 12%)
Applications of Semantic Partitioner:
- Srinivas Vadrevu, Fatih Gelgi,
Saravanakumar Nagarajan, Hasan Davulcu. Gathering Metadata and
Instances from Object Referral Lists on the Web. (To appear)
In Online Information Review Journal, 2006.
- Srinivas Vadrevu, Fatih Gelgi,
Saravanakumar Nagarajan, Hasan Davulcu.
METEOR: Metadata and Instance Extraction from Object Referral
Lists on the Web. (To appear) In The First Online
Metadata and Semantics Research Conferences (MTSR),
Approaches to Advanced Information Systems, 2005.
- Hasan Davulcu, Srinivas Vadrevu,
Saravanakumar Nagarajan.
OntoMiner: Automated Metadata and instance Mining from News
Websites. In The International Journal of
Web and Grid Services (IJWGS), Volume 1, Number 2, 2005,
Inderscience Publishers.
- Fatih Gelgi, Srinivas Vadrevu,
Hasan Davulcu.
Improving Web Data Annotations with Spreading Activation.
In The 6th International Conference on Web
Information Systems Engineering (WISE), New York City,
NY, November 2005. (Acceptance Rate: 12%)
- Srinivas Vadrevu, Saravanakumar Nagarajan,
Fatih Gelgi, Hasan Davulcu.
Automated Metadata and Instance Extraction from News Web Sites.
In The 2005 IEEE/WIC/ACM International Conference
on Web Intelligence, France, September 2005.
(Acceptance Rate: 18%)
- Hasan Davulcu, Srinivas Vadrevu,
Saravanakumar Nagarajan, Fatih Gelgi.
METEOR: Metadata and Instance
Extraction from Object Referral Lists on the Web.
In The 14th International World Wide Web Conference (WWW 2005),
Chiba, Japan, 2005. (Poster Presentation)
- Hasan Davulcu, Srinivas Vadrevu,
Saravanakumar Nagarajan.
OntoMiner: Bootstrapping Ontologies from
Overlapping Domain Specific Web Sites. In The Thirteenth International
World Wide Web Conference (WWW 2004), New York, 2004. (Poster
Presentation)
- Hasan Davulcu, Srinivas Vadrevu,
Saravanakumar Nagarajan, I.V. Ramakrishnan.
OntoMiner: Bootstrapping and Populating
Ontologies from Domain Specific Web Sites.
In IEEE Intelligent Systems, Volume 18, Number 5
September/October 2003.
- Hasan Davulcu, Srinivas Vadrevu,
Saravanakumar Nagarajan. OntoMiner:
Bootstrapping and Populating Ontologies from Domain Specific Web
Sites In First
International Workshop on Semantic Web and Databases, September
2003, Berlin, Germany.