Well, if you have no pen-pal in Russia, you might wish to elimitate any russian E-mails.
First, edit your local.cf
to set "ok_locales
" option to the list of acceptable locales. Possible choices are: "en
" for English and other European languages, "ja
" for Japanese, "ko
" for Korean, "ru
" for Cyrillic, "th
" for Thai and "zh
" for Chinese or "all
" for any encoding. Items in the list are space-separated. The default value is "all
", so SA doesn't consider any locale as undesirable. Mail::SpamAssassin::Conf(3pm)
manpage states that "CHARSET_FARAWAY_*
" rules are disabled when "ok_locales
" is set to "all
".
Note, however, that there is a "subtle issue" about the way some charsets are treated by Locales.pm. An excerpt from discussion on Bug 4078:
... there are a number of character sets that have the Roman alphabet as the 0x20 to 0x7e ASCII characters and some other language in the high-bit characters. Anyone with a Hebrew Windows machine, for example, is likely to send all mail, including English, in the Windows-1255 charset. That's why every charset that begins with "WINDOWS" is whitelisted.
Version 3.3.2 (the actual one as of this writing) still lists charsets names starting with "WINDOWS", "CP125", "UTF" and "ISO" as "always OK". The grim humor is that KOI8-R charset has also latin letters in its "lower" half, but is not whitelisted, so legitimate E-mails sent from Russia by humans and written entirely in English, would get some extra score points unlike those labeled with CP1251.
A bit more detailed method is offered by TextCat
plugin (see Mail::SpamAssassin::Plugin::TextCat(3pm)
manpage) - you can specify a list of accepted languages in "ok_languages
" option (which defaults to "all
"). TextCat
plugin is disabled by default, you should enable it (using "loadplugin
" option in your "local.cf
").
Note: SA uses Encode::Detect
Perl module to guess encoding of the message's text, if the detection fails - no score points will be assigned to the message by the rules that depend on that module. In russian spam messages cyrillic letters are often replaced with similar-looking latin ones and digits, thus producing a text that looks just like a piece of garbage for automated processing tools, but still does make sence for humans. That's why russian spam can often slip through when "ok_locales
" and "ok_languages
" are set to "en
" only. There is also a problem with identifying russian text in UTF-8 encoded messages (Bug 6364).
You may also find it useful to read Problems with Cyrillic spam
thread in SA maillist.
Next step to block all russian E-mails is to set rules that will fire on various header fields.
First, a good deal of messages in Russian use either KOI8-R or Microsoft CP-1251 encodings. The following header fields may contain russian text: "From:
", "To:
" and "Subject:
". They start with equal sign, followed by "?koi8-r?
" or "?windows-1251?
" charset declaration, so we can use it to make up rules like the following:
header __HDR_FROM_CYR From:raw =~ /=\?((((cs)?)koi8(-?)r)|(windows-1251))\?/i
header __HDR_TO_CYR ToCc:raw =~ /=\?((((cs)?)koi8(-?)r)|(windows-1251))\?/i
header __HDR_SUBJECT_CYR Subject:raw =~ /=\?((((cs)?)koi8(-?)r)|(windows-1251))\?/i
(IANA character set assignment states that "csKOI8R
" is valid alias for "KOI8-R
", though I have never seen it in E-mail headers). Charset declaration can also be found in "Content-type
" header field. However, the message can be a MIME "multipart" one with charset declarations preceding the parts within the body, so we need to use "mimeheader
" rule (falling back to "header
" should the MIMEHeader plugin fail):
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader
ifplugin Mail::SpamAssassin::Plugin::MIMEHeader
mimeheader __MIME_CONTENT_CYR Content-Type:raw =~ /charset=\"((((cs)?)koi8(-?)r)|(windows-1251))\"/i
else
header __MIME_CONTENT_CYR Content-Type:raw =~ /charset=\"((((cs)?)koi8(-?)r)|(windows-1251))\"/i
endif
(For these rules to work MIMEHeader
plugin must be present in your system. See Mail::SpamAssassin::Plugin::MIMEHeader(3pm)
manpage for details).
Note, however, that these rules will seldom fire since spam messages have MIME-parts named in US-ASCII, russian-named MIME-parts are those with attached files and can usually be found in human-sent messages only.
Thanks to John Hardin for his advise about using MIMEHeader
plugin.
Header analysis in UTF-8 encoded messages requires examination of their contents. Russian letters occupy ranges \xD0\x81, \xD0\x90-\xD0\xBF, \xD1\x80-\xD1\x8F, \xD1\x91. We can try using these as range expressions inside square brackets (Perl RE tutorial states, that backslash retains its special meaning in character class definition). If you are in doubt (like me), try running the following script:
foreach my $test_string ( "\xD0\x90","\xD1\x91","Z","1" ) {
if ( $test_string =~ /(\xD0[\x81\x90-\xBF])|(\xD1[\x80-\x8F\x91])/ ) { print "Yes\n"; }
else { print "No\n"; } }
The result must be "Yes,Yes,No,No
". (I am not experienced Perl hacker, but I ran into lot of pitfalls with incompatibilities in various other software, so forgive me this paranoia)
If the script finishes as desired, add the following lines to your ruleset:
header __HDR_FROM_CYR_UTF8_CHARSET_DEFINITION From:raw =~ /=\?utf-8\?/i
header __HDR_TO_CYR_UTF8_CHARSET_DEFINITION To:raw =~ /=\?utf-8\?/i
header __HDR_FROM_CYR_UTF8_CONTENT From =~ /(((\xD0[\x81\x90-\xBF])|(\xD1[\x80-\x8F\x91]))([A-Za-z[:digit:][:blank:][:punct:]])?){3}/
header __HDR_TO_CYR_UTF8_CONTENT To =~ /(((\xD0[\x81\x90-\xBF])|(\xD1[\x80-\x8F\x91]))([A-Za-z[:digit:][:blank:][:punct:]])?){3}/
meta __HDR_FROM_CYR_UTF8 __HDR_FROM_CYR_UTF8_CHARSET_DEFINITION && __HDR_FROM_CYR_UTF8_CONTENT
meta __HDR_TO_CYR_UTF8 __HDR_TO_CYR_UTF8_CHARSET_DEFINITION && __HDR_TO_CYR_UTF8_CONTENT
meta HDR_ADDR_CYR_UTF8 __HDR_FROM_CYR_UTF8 || __HDR_TO_CYR_UTF8
score HDR_ADDR_CYR_UTF8 0.01
These rules check that either "From:" or "To:" header fields are declared as UTF-8 and contain at least three cyrillic letters, possibly interspersed by latin letters, digits, blanks or punctuators (remember, that spammers often replace cyrillic letters with similar-looking latin letters and digits, and also use "gappy" typing).
Another valuable data gained from message headers can be domain part of sender/recipient addresses and names of SMTP relays found in "Received:
" chain:
header __HRD_SENDER_RU From:addr =~ /@((([a-zA-Z0-9])|\.|\-)+)\.ru(\.?)$/i
header __HRD_RECIPIENT_RU From:addr =~ /@((([a-zA-Z0-9])|\.|\-)+)\.ru(\.?)$/i
header __HDR_ENVFROM_RU EnvelopeFrom:addr =~ /@((([a-zA-Z0-9])|\.|\-)+)\.ru(\.?)$/i
header __HDR_RCVD_RU Received:raw =~ /from([[:blank:]]+((([a-zA-Z0-9])|\.|\-)+)\.ru(\.?)[[:blank:]])/i
The above mentioned rules all use functionality of basic SA installation. You can go beyond by using RelayCountry plugin - it offers more advanced treatment of "Received:
" header fields. Stefan Luetje has published (see SA maillist message) custom ruleset (FTP link) that utilizes this functionality (see *BADRELAY rules section).
That's what about message header. Now let's take a close look at the message body. Ned Slider suggested (see SA maillist message) a body rule that fires on hyperlinks in ".ru
" domain. I have tested the following form of this rule:
uri URI_IN_RU /^(http(s?)\:\/\/)((([a-zA-Z0-9])|\.|\-)+)\.ru(\.?)($|\/)/i
(Thanks to Jens Schleusener for noting the bug in this rule and suggesting the fix)
If the message is UTF-8 encoded, we can also search for russian letters in it. The following rules check that either subject or body of the message (or any textual part of MIME-multipart message) are declared as UTF-8 and there is at least three cyrillic letters (possibly interspersed by latin letters, digits, blanks or punctuators) in the subject and all textual parts of the message concatenated together:
header __HDR_SUBJ_CYR_UTF8_CHARSET_DEFINITION Subject:raw =~ /=\?utf-8\?/i
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader
ifplugin Mail::SpamAssassin::Plugin::MIMEHeader
mimeheader __HDR_CONTENTTYPE_CYR_UTF8_CHARSET_DEFINITION Content-Type:raw =~ /text\/(.*)charset=utf-8/i
else
header __HDR_CONTENTTYPE_CYR_UTF8_CHARSET_DEFINITION Content-Type:raw =~ /text\/(.*)charset=utf-8/i
endif
meta __CONTENT_CYR_UTF8_CHARSET_DEFINITION __HDR_SUBJ_CYR_UTF8_CHARSET_DEFINITION || __HDR_CONTENTTYPE_CYR_UTF8_CHARSET_DEFINITION
body __CONTENT_CYR_UTF8 /(((\xD0[\x81\x90-\xBF])|(\xD1[\x80-\x8F\x91]))([A-Za-z[:digit:][:blank:][:punct:]])?){3}/
meta CONTENT_CYR_UTF8 __CONTENT_CYR_UTF8_CHARSET_DEFINITION && __CONTENT_CYR_UTF8
score CONTENT_CYR_UTF8 0.01
All the rules described above are gathered up in 99_no_russian_mail.cf ruleset file. You can download and add it to your ruleset. Be aware that default scores in this file are set to 0.01, you may need to increase them according to your setup (no bulk test has been performed to find optimal ones).
Despite the above list of signs aimed to be as exhaustive as possible, I'm at present seeing increasing number of spam messages in Russian that are:
For such messages your only hope may be language detection by TextCat plugin (unless you use my rules to detect russian spam by message text matching).
There are some more rules to detect russian spam - e.g. rules that fire on often-abused "The Bat!" MUA. I avoid them rule since they are heavily prone to false-positives. These rules are present in 99_no_russian_mail.cf but are commented out.