Multilingual document clustering, topic extraction and data transformations

Research output: Chapter in Book/Report/Conference proceedingChapter

6 Citations (Scopus)

Abstract

This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by ModelBased Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.
Original languageEnglish
Title of host publicationProgress in Artificial Intelligence
EditorsP. Brazdil, A. Jorge
Place of PublicationBerlin
PublisherSpringer
Pages74-87
Number of pages14
ISBN (Print)3-540-43030-X
Publication statusPublished - 1 Jan 2001

Keywords

  • Document Clustering
  • Model-based clustering
  • Clustering documents
  • Data transformation
  • Document Classification

Fingerprint Dive into the research topics of 'Multilingual document clustering, topic extraction and data transformations'. Together they form a unique fingerprint.

Cite this