The dump_corpus_2csv utility dumps your entire corpus to an excel compatible CSV file where you can analyze or manipulate it.
The following data items are dumped for each corpus entry;
These versions have been tested in a Windows environment with version 0.19.x and 0.20.x of POPFile and with version 0.21.0. It is not compatible with earlier versions of POPFile. The author believes that the utility is platform independent and will work properly on non-Windows POPFile installs, but has not tested on those platforms.
POPFile is an automatic email classification tool authored by John Graham-Cumming available from SourceForge.
Download the script to your POPFile install directory, normally c:\Program Files\Popfile by clicking below;
run the script from your POPFile installation directory.
Open a DOS box and change to your POPFile directory.
run dump_corpus_2csv
perl dump_corpus_2csv.pl
View the results via Excel, either browse to your POPFile directory and open dump_corpus.csv, or, type
start dump_corpus.csvat the DOS prompt to startup Excel and load dump_corpus.csv.
The following is a sample of the CSV file created by dump_corpus_2csv when run against the author's POPFile installation on June 29, 2003.
"BucketName","Word","BucketCount","WordCount","%Bucket","%Total","Score","Probability" "normal","sourceforge.net","1","546","1.10989145","0.71799592","0.9997843334","0.999763094072891" "normal","freshmeat.net","1","153","0.31101354","0.20119666","0.9992306003","0.99915508515539" "spam","#6a6=","1","94","0.35008007","0.12361102","0.9993164278","0.999249301072207" "spam","cc:pacbell.net","1","82","0.30538900","0.10783089","0.9992164359","0.999139537221576" "spam","dll'","1","69","0.25697367","0.09073575","0.9990688833","0.99897758679565"
The script accepts commandline options to optionally override the separator character or quotes used in producing the CSV file.
-csv_separator this option permits you to override the default comma separator to some other character.
-csv_quote this option permits you to override the default field quoting character (double quote) to some other character.
Changing the default comma separator to a semi-colon.
perl dump_corpus_2csv.pl -csv_separator ;
Changing the default quote character to a single-quote mark (most shell scripts will require you to escape it as shown in the example).
perl dump_corpus_2csv.pl -csv_quote \'
Changing the default comma separator to a colon and the default quote character to a single quote.
perl dump_corpus_2csv.pl -csv_separator : -csv_quote \'
I noticed a temp subdirectory was created in my POPFile folder, why is this?
This occurs only with V 0.19.x or v 0.20.x of POPFile. The program uses the POPFile API to gather all of the corpus data. The API calls automatically create a couple of files, popfile.pid and a popfile#.log file. In order to ensure that running this program does not interfere with your running POPFile installation, we divert the version of those files created by this program to a safe place, the temp subdirectory, where they will be harmless. You can delete the subdirectory and contents at will.
Copyright (C) 2003 - 2007 Scott W. Leighton
Licensed under the terms of the GNU General Public License.
Contributed to the POPFile project under the terms of the POPFile License Agreement.