In this paper, on automatic text categorization, we extensively compare several aspects which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).
|Title of host publication||Proceedings of the 15th Portuguese Conference in Arificial Intelligence, EPIA 2011.|
|Pages||660 to 674|
|Publication status||Published - 1 Jan 2011|
|Event||EPIA 2011, Portuguese Conference on Artificial Inteligence - |
Duration: 1 Jan 2011 → …
|Conference||EPIA 2011, Portuguese Conference on Artificial Inteligence|
|Period||1/01/11 → …|