POPFile topten Utility

This utility is deprecated and no longer supported. It will not function properly with versions higher than 0.19.x.

The Top Ten utility is a tool for use with POPFile to list the top ten (or some other quantity you select) words in each bucket's corpus ranked high to low on the word count.

This version has been tested in a Windows environment with versions 0.18.1 and 0.19.0 of POPFile. This program is not compatible with Version 0.20.0 or higher of POPFile and the author has no plans to update it. Use the enhanced version, Top Ten Enhanced, instead. The author believes that the utility is platform independent and will work properly on non-Windows POPFile installs, but has not tested on those platforms.

POPFile is an automatic email classification tool authored by John Graham-Cumming available from SourceForge.

Instructions for use

  1. Download the script to your POPFile install directory, normally c:\Program Files\Popfile by clicking here.

  2. Open a DOS Command box (click the DOS icon on your desktop or Start/Run and type command in the open box and click ok).

  3. Change to your POPFile installation directory, e.g.,

    cd  "\program files\popfile"

  4. Run topten.pl using Perl.

    perl topten.pl > report.txt

  5. The resulting diagnostic report will be in the file named 'report.txt', open it with a text editor such as notepad.

    start notepad.exe report.txt

Note: to select more (or less) words simply place an integer value on the command line when you execute topten.pl, e.g.,

perl topten.pl 50 >report.txt
The above would list the top 50 words in each bucket.

Sample Output Report

The following is a sample of the output from topten run against the author's corpus on May 25, 2003 with the comandline option of 50 to show the top 50.

Top Ten Utility for POPFile
diagnostics complete, no errors found in corpus

Top 50 for Bucket corpus/normal  word count = 45189 (64.6%) words = 8647
Rank  Word From Corpus              Word Count   % Bucket    % Total
   1  sourceforge.net                    546       1.2083     0.7807
   2  html:td                            528       1.1684     0.7549
   3  forum                              348       0.7701     0.4976
   4  pacbell.net                        272       0.6019     0.3889
   5  php                                250       0.5532     0.3575
   6  email                              210       0.4647     0.3003
   7  freshmeat.net                      153       0.3386     0.2188
   8  pop                                151       0.3342     0.2159
   9  atomz.com                          150       0.3319     0.2145
  10  www.atomz.com                      136       0.3010     0.1945
  11  popfile                            134       0.2965     0.1916
  12  web                                132       0.2921     0.1887
  13  dsl.xx.xx.xx.xx.xx.xxx             131       0.2899     0.1873
  14  xx.xx.xx.xx.xx.xx.xx               131       0.2899     0.1873
  15  visit                              131       0.2899     0.1873
  16  xxxxxx.xx.xxxx.net                 131       0.2899     0.1873
  17  more                               130       0.2877     0.1859
  18  php.net                            129       0.2855     0.1844
  19  prodigy.net                        126       0.2788     0.1802
  20  monitor                            125       0.2766     0.1787
  21  html:comment                       123       0.2722     0.1759
  22  postoffice.pacbell.net             123       0.2722     0.1759
  23  from:sourceforge.net               123       0.2722     0.1759
  24  66.35.250.xxx                      123       0.2722     0.1759
  25  palm                               121       0.2678     0.1730
  26  what                               106       0.2346     0.1516
  27  new                                106       0.2346     0.1516
  28  about                              104       0.2301     0.1487
  29  when                               104       0.2301     0.1487
  30  file                               103       0.2279     0.1473
  31  support                            103       0.2279     0.1473
  32  because                            101       0.2235     0.1444
  33  now                                100       0.2213     0.1430
  34  please                             100       0.2213     0.1430
  35  slashdot.org                        98       0.2169     0.1401
  36  unsubscribe                         97       0.2147     0.1387
  37  using                               96       0.2124     0.1373
  38  list                                95       0.2102     0.1358
  39  how                                 94       0.2080     0.1344
  40  one                                 94       0.2080     0.1344
  41  line                                91       0.2014     0.1301
  42  information                         90       0.1992     0.1287
  43  lists.php.net                       89       0.1970     0.1273
  44  local                               89       0.1970     0.1273
  45  which                               89       0.1970     0.1273
  46  number                              88       0.1947     0.1258
  47  osdn.com                            88       0.1947     0.1258
  48  other                               88       0.1947     0.1258
  49  use                                 88       0.1947     0.1258
  50  businessweek.com                    85       0.1881     0.1215

Top 50 for Bucket corpus/spam  word count = 24750 (35.4%) words = 6790
Rank  Word From Corpus              Word Count   % Bucket    % Total
   1  html:comment                       558       2.2545     0.7978
   2  html:td                            520       2.1010     0.7435
   3  click                              170       0.6869     0.2431
   4  email                              156       0.6303     0.2231
   5  here                               150       0.6061     0.2145
   6  trick:invisibleink                 146       0.5899     0.2088
   7  pop                                138       0.5576     0.1973
   8  information                        120       0.4848     0.1716
   9  please                             117       0.4727     0.1673
  10  free                               106       0.4283     0.1516
  11  216.109.73.xxx                     100       0.4040     0.1430
  12  #ffffff                             95       0.3838     0.1358
  13  #6a6=                               94       0.3798     0.1344
  14  one                                 89       0.3596     0.1273
  15  cc:pacbell.net                      82       0.3313     0.1172
  16  found                               82       0.3313     0.1172
  17  receive                             81       0.3273     0.1158
  18  #ff0000                             79       0.3192     0.1130
  19  msn                                 76       0.3071     0.1087
  20  now                                 75       0.3030     0.1072
  21  mailapps                            71       0.2869     0.1015
  22  about                               71       0.2869     0.1015
  23  loaded                              70       0.2828     0.1001
  24  matching                            70       0.2828     0.1001
  25  internet                            69       0.2788     0.0987
  26  dll'                                69       0.2788     0.0987
  27  symbolic                            68       0.2747     0.0972
  28  time                                67       0.2707     0.0958
  29  unsubscribe                         66       0.2667     0.0944
  30  just                                64       0.2586     0.0915
  31  system                              64       0.2586     0.0915
  32  get                                 62       0.2505     0.0886
  33  only                                61       0.2465     0.0872
  34  money                               59       0.2384     0.0844
  35  more                                59       0.2384     0.0844
  36  #000000                             56       0.2263     0.0801
  37  offers                              52       0.2101     0.0744
  38  list                                52       0.2101     0.0744
  39  report                              51       0.2061     0.0729
  40  new                                 50       0.2020     0.0715
  41  software                            48       0.1939     0.0686
  42  encoding:quotedprintable            48       0.1939     0.0686
  43  over                                47       0.1899     0.0672
  44  bin                                 47       0.1899     0.0672
  45  business                            47       0.1899     0.0672
  46  take                                46       0.1859     0.0658
  47  like                                45       0.1818     0.0643
  48  wish                                45       0.1818     0.0643
  49  common                              45       0.1818     0.0643
  50  want                                44       0.1778     0.0629

Running topten Automatically

Windows users who have Tim Charron's Blat utility can easily set up topten to run automatically and email the results.

  1. Obtain and install Blat from Tim Charron's page here.

  2. install Blat in a directory in your path, or the POPFile directory

  3. run Blat -install <server address> <senders address> to get Blat configured correctly. Make sure that <server address> points to an smtp server that you are permitted to relay mail thru, usually this will be the same smtp server you set up in your mail client.

  4. Create a batch file as follows:

    perl topten.pl | blat - -t youremail@address.here -s "Top Ten Report"

  5. Save the batch file in your POPFile directory, name it topten.bat

  6. Open your task scheduler and add a scheduled task.

You're done. The task scheduler will run the batch file at the time(s) you scheduled. The batch file will run the Top Ten report and email it off to you. No muss, no fuss <g>


Copyright (C) 2003 Scott W. Leighton

Licensed under the terms of the GNU General Public License.

Contributed to the POPFile project under the terms of the POPFile License Agreement.

Back to POPFile Utilities