POPFile dump_corpus_2csv Utility

The dump_corpus_2csv utility dumps your entire corpus to an excel compatible CSV file where you can analyze or manipulate it.

The following data items are dumped for each corpus entry;

These versions have been tested in a Windows environment with version 0.19.x and 0.20.x of POPFile and with version 0.21.0. It is not compatible with earlier versions of POPFile. The author believes that the utility is platform independent and will work properly on non-Windows POPFile installs, but has not tested on those platforms.

POPFile is an automatic email classification tool authored by John Graham-Cumming available from SourceForge.

Instructions for use

  1. Download the script to your POPFile install directory, normally c:\Program Files\Popfile by clicking below;

  2. run the script from your POPFile installation directory.

    • Open a DOS box and change to your POPFile directory.

    • run dump_corpus_2csv

      perl dump_corpus_2csv.pl
      

    • View the results via Excel, either browse to your POPFile directory and open dump_corpus.csv, or, type

      start dump_corpus.csv
      
      at the DOS prompt to startup Excel and load dump_corpus.csv.

Sample CSV File

The following is a sample of the CSV file created by dump_corpus_2csv when run against the author's POPFile installation on June 29, 2003.

"BucketName","Word","BucketCount","WordCount","%Bucket","%Total","Score","Probability"
"normal","sourceforge.net","1","546","1.10989145","0.71799592","0.9997843334","0.999763094072891"
"normal","freshmeat.net","1","153","0.31101354","0.20119666","0.9992306003","0.99915508515539"
"spam","#6a6=","1","94","0.35008007","0.12361102","0.9993164278","0.999249301072207"
"spam","cc:pacbell.net","1","82","0.30538900","0.10783089","0.9992164359","0.999139537221576"
"spam","dll'","1","69","0.25697367","0.09073575","0.9990688833","0.99897758679565"

Commandline Options

The script accepts commandline options to optionally override the separator character or quotes used in producing the CSV file.

Usage Examples

Changing the default comma separator to a semi-colon.

perl dump_corpus_2csv.pl -csv_separator ;

Changing the default quote character to a single-quote mark (most shell scripts will require you to escape it as shown in the example).

perl dump_corpus_2csv.pl -csv_quote \'

Changing the default comma separator to a colon and the default quote character to a single quote.

perl dump_corpus_2csv.pl -csv_separator : -csv_quote \'

FAQS

Copying

Copyright (C) 2003 - 2007 Scott W. Leighton

Licensed under the terms of the GNU General Public License.

Contributed to the POPFile project under the terms of the POPFile License Agreement.


Back to POPFile Utilities