vyhledání a zhodnocení informací o automatické klasikaci dokumentů, seznámení
s jazykem Perl a balíkem LWP pro potřeby práce s textovými dokumenty, nalezení
klasikátorů v programu WEKA, porovnání různých metod klasikace a parametrizace
textů.
Annotation in English
The aim of diploma thesis is to find sufficient sequence which can sort out unsigned text documents. It means to prepare a lot of training data for classifier learning. The fruitfulness of classifer is tested by the help of testing data. Newspaper articles from server zpravy.atlas.cz are used as a testing data. The first part of diploma thesis is about automatic detection theory. The second part of diploma thesis is about finding the classifier by the help of program WEKA.
Data is processed by the help of programming language Perl and package LWP. Simple text isn't suitable for next processing. For this reason a global dictionary is created. Documents are converted into feature vectors. These vectors can be written by the help of different representation. In diploma thesis different sorts of representation are tested. Program WEKA is used for training classifiers, cluster analysis and select attributes. In this program different representation feature vectors and classifiers algorithms are tested.
Perl, Weka, automatic classification, classifier, feature vector, sort out documents
Length of the covering note
102
Language
CZ
Annotation
vyhledání a zhodnocení informací o automatické klasikaci dokumentů, seznámení
s jazykem Perl a balíkem LWP pro potřeby práce s textovými dokumenty, nalezení
klasikátorů v programu WEKA, porovnání různých metod klasikace a parametrizace
textů.
Annotation in English
The aim of diploma thesis is to find sufficient sequence which can sort out unsigned text documents. It means to prepare a lot of training data for classifier learning. The fruitfulness of classifer is tested by the help of testing data. Newspaper articles from server zpravy.atlas.cz are used as a testing data. The first part of diploma thesis is about automatic detection theory. The second part of diploma thesis is about finding the classifier by the help of program WEKA.
Data is processed by the help of programming language Perl and package LWP. Simple text isn't suitable for next processing. For this reason a global dictionary is created. Documents are converted into feature vectors. These vectors can be written by the help of different representation. In diploma thesis different sorts of representation are tested. Program WEKA is used for training classifiers, cluster analysis and select attributes. In this program different representation feature vectors and classifiers algorithms are tested.