This utility is deprecated and no longer supported. It will not function properly with versions higher than 0.19.x.
The Top Ten utility is a tool for use with POPFile to list the top ten (or some other quantity you select) words in each bucket's corpus ranked high to low on the word count.
This version has been tested in a Windows environment with versions 0.18.1 and 0.19.0 of POPFile. This program is not compatible with Version 0.20.0 or higher of POPFile and the author has no plans to update it. Use the enhanced version, Top Ten Enhanced, instead. The author believes that the utility is platform independent and will work properly on non-Windows POPFile installs, but has not tested on those platforms.
POPFile is an automatic email classification tool authored by John Graham-Cumming available from SourceForge.
Download the script to your POPFile install directory, normally c:\Program Files\Popfile by clicking here.
Open a DOS Command box (click the DOS icon on your desktop or Start/Run and type command in the open box and click ok).
Change to your POPFile installation directory, e.g.,
cd "\program files\popfile"
Run topten.pl using Perl.
perl topten.pl > report.txt
The resulting diagnostic report will be in the file named 'report.txt', open it with a text editor such as notepad.
start notepad.exe report.txt
Note: to select more (or less) words simply place an integer value on the command line when you execute topten.pl, e.g.,
perl topten.pl 50 >report.txtThe above would list the top 50 words in each bucket.
The following is a sample of the output from topten run against the author's corpus on May 25, 2003 with the comandline option of 50 to show the top 50.
Top Ten Utility for POPFile diagnostics complete, no errors found in corpus Top 50 for Bucket corpus/normal word count = 45189 (64.6%) words = 8647 Rank Word From Corpus Word Count % Bucket % Total 1 sourceforge.net 546 1.2083 0.7807 2 html:td 528 1.1684 0.7549 3 forum 348 0.7701 0.4976 4 pacbell.net 272 0.6019 0.3889 5 php 250 0.5532 0.3575 6 email 210 0.4647 0.3003 7 freshmeat.net 153 0.3386 0.2188 8 pop 151 0.3342 0.2159 9 atomz.com 150 0.3319 0.2145 10 www.atomz.com 136 0.3010 0.1945 11 popfile 134 0.2965 0.1916 12 web 132 0.2921 0.1887 13 dsl.xx.xx.xx.xx.xx.xxx 131 0.2899 0.1873 14 xx.xx.xx.xx.xx.xx.xx 131 0.2899 0.1873 15 visit 131 0.2899 0.1873 16 xxxxxx.xx.xxxx.net 131 0.2899 0.1873 17 more 130 0.2877 0.1859 18 php.net 129 0.2855 0.1844 19 prodigy.net 126 0.2788 0.1802 20 monitor 125 0.2766 0.1787 21 html:comment 123 0.2722 0.1759 22 postoffice.pacbell.net 123 0.2722 0.1759 23 from:sourceforge.net 123 0.2722 0.1759 24 66.35.250.xxx 123 0.2722 0.1759 25 palm 121 0.2678 0.1730 26 what 106 0.2346 0.1516 27 new 106 0.2346 0.1516 28 about 104 0.2301 0.1487 29 when 104 0.2301 0.1487 30 file 103 0.2279 0.1473 31 support 103 0.2279 0.1473 32 because 101 0.2235 0.1444 33 now 100 0.2213 0.1430 34 please 100 0.2213 0.1430 35 slashdot.org 98 0.2169 0.1401 36 unsubscribe 97 0.2147 0.1387 37 using 96 0.2124 0.1373 38 list 95 0.2102 0.1358 39 how 94 0.2080 0.1344 40 one 94 0.2080 0.1344 41 line 91 0.2014 0.1301 42 information 90 0.1992 0.1287 43 lists.php.net 89 0.1970 0.1273 44 local 89 0.1970 0.1273 45 which 89 0.1970 0.1273 46 number 88 0.1947 0.1258 47 osdn.com 88 0.1947 0.1258 48 other 88 0.1947 0.1258 49 use 88 0.1947 0.1258 50 businessweek.com 85 0.1881 0.1215 Top 50 for Bucket corpus/spam word count = 24750 (35.4%) words = 6790 Rank Word From Corpus Word Count % Bucket % Total 1 html:comment 558 2.2545 0.7978 2 html:td 520 2.1010 0.7435 3 click 170 0.6869 0.2431 4 email 156 0.6303 0.2231 5 here 150 0.6061 0.2145 6 trick:invisibleink 146 0.5899 0.2088 7 pop 138 0.5576 0.1973 8 information 120 0.4848 0.1716 9 please 117 0.4727 0.1673 10 free 106 0.4283 0.1516 11 216.109.73.xxx 100 0.4040 0.1430 12 #ffffff 95 0.3838 0.1358 13 #6a6= 94 0.3798 0.1344 14 one 89 0.3596 0.1273 15 cc:pacbell.net 82 0.3313 0.1172 16 found 82 0.3313 0.1172 17 receive 81 0.3273 0.1158 18 #ff0000 79 0.3192 0.1130 19 msn 76 0.3071 0.1087 20 now 75 0.3030 0.1072 21 mailapps 71 0.2869 0.1015 22 about 71 0.2869 0.1015 23 loaded 70 0.2828 0.1001 24 matching 70 0.2828 0.1001 25 internet 69 0.2788 0.0987 26 dll' 69 0.2788 0.0987 27 symbolic 68 0.2747 0.0972 28 time 67 0.2707 0.0958 29 unsubscribe 66 0.2667 0.0944 30 just 64 0.2586 0.0915 31 system 64 0.2586 0.0915 32 get 62 0.2505 0.0886 33 only 61 0.2465 0.0872 34 money 59 0.2384 0.0844 35 more 59 0.2384 0.0844 36 #000000 56 0.2263 0.0801 37 offers 52 0.2101 0.0744 38 list 52 0.2101 0.0744 39 report 51 0.2061 0.0729 40 new 50 0.2020 0.0715 41 software 48 0.1939 0.0686 42 encoding:quotedprintable 48 0.1939 0.0686 43 over 47 0.1899 0.0672 44 bin 47 0.1899 0.0672 45 business 47 0.1899 0.0672 46 take 46 0.1859 0.0658 47 like 45 0.1818 0.0643 48 wish 45 0.1818 0.0643 49 common 45 0.1818 0.0643 50 want 44 0.1778 0.0629
Windows users who have Tim Charron's Blat utility can easily set up topten to run automatically and email the results.
Obtain and install Blat from Tim Charron's page here.
install Blat in a directory in your path, or the POPFile directory
run Blat -install <server address> <senders address> to get Blat configured correctly. Make sure that <server address> points to an smtp server that you are permitted to relay mail thru, usually this will be the same smtp server you set up in your mail client.
Create a batch file as follows:
perl topten.pl | blat - -t youremail@address.here -s "Top Ten Report" cls @exit
Save the batch file in your POPFile directory, name it topten.bat
Use the Wizard to browse to your POPFile installation directory, usually "c:\Program Files\Popfile", and select the batch file topten.bat
change the name of the task to top ten
Select the frequency to run it
Select the time and day(s) to run it
Click finish
Close the task scheduler (or test it by right clicking on the new entry you made and selecting run)
You're done. The task scheduler will run the batch file at the time(s) you scheduled. The batch file will run the Top Ten report and email it off to you. No muss, no fuss <g>
Copyright (C) 2003 Scott W. Leighton
Licensed under the terms of the GNU General Public License.
Contributed to the POPFile project under the terms of the POPFile License Agreement.