What you get when you buy a spam cd

Afbeelding: eMail van Esparta Palma | Licentie: CC BY

In the first half of 2003 two particular spamruns attracted my attention – this incidentally coincided with the start of the non-profit anti-spam foundation spamvrij.nl by several people, including myself. These runs were spamvertizing two sets of CD’s with millions of Dutch and Belgian addresses. I happened to be able to get one of these CD’s. So, here’s my analysis, with numbers and notes. I can’t give a clear answer whether these can be consider as a breach of the Dutch Data Protection Act.

The spamruns spamvertizing the CD’s

On March 16th and May 5th 2003 spamruns were sent by the notorious Dutch spammer Patrick de Bruin.

Some details for the first spamrun:

  • It was sent via Net In Net / Supermail and was spamvertizing the website emailcd.nu. At first this website was referring to odinsrage.com, a link that was cancelled shortly thereafter. After that the site was referring to a page at patrickdebruin.nl. It was also spamvertizing the e-mail adress info@emailcd.nu.
  • It was spamvertizing a set of two CD’s:
    1. The first CD contains what the spammer calls “generated addresses”
    2. The second CD contains “addresses found using search engines”.
  • According to the spam, about 80% of the addresses were addresses of individuals, 20% of companies. The CD’s also contains a couple of applications that would make it easier for the buyer to categorise the addresses and to send bulk e-mail.
  • This set of two CD’s was priced 49 euro.

And for the second spamrun:

  • It was sent via ISD Holland and was spamvertizing the e-mail address emailcd@beer.com only. No website.
  • It spamvertized three CD’s. The spam claims:
    1. The first CD contains about 500.000 business addresses and millions of addresses of individuals. These addresses are available randomly in textfiles of about 25.000 addresses each. The CD also contains a couple of applications to send out bulk e-mail fast and easy.
    2. The second CD contains 573 files with 10.000 unsorted addresses each. All e-mailaddresses have been acquired using “Dutch search queries”. Doubles have been removed, and therefore it has five million addresses.
    3. The third CD contains several applications to comfortably manage your e-maillists and to easily send e-mail to 10.000 addresses from within Outlook (Express). All applications are supplied registration codes that “have been bought legally”. This CD also contains about 170.000 fresh, mostly Dutch, addresses, harvested in April 2003.
    4. The spam also offers you customized lists of addresses, focused particular segments (e.g. car shops, clothing shops, etc).
  • This set of three CD’s was priced at 69 euro.

According to Patrick de Bruin himself:

  • Both runs were sent to about 1,000 to 2,000 addresses. The second run have been sent to a different set of addresses than the first one, but there easily could have been doubles.
  • For each of the runs about two or three sets of CD’s were sold.

I guess the number of 1,000 to 2,000 addresses is underestimated. Apart from that, Patrick de Bruin has evolved in less than eight months from a small spammer to one of the few Dutch professional spammers by the end of 2003. He is now using bulletproof hosting in India, open proxies in Asia, forgeries of addresses and is doing spamruns on commission from time to time. The spamruns in the last couple of months of 2003 were massive, especially for Dutch standards.

In one of his more recent runs he sells CD’s claiming to have 275,000 business address and 4,000,000 addresses of individuals, for 399 euro for both or 299 euro for each of the CD’s.

Analysis

I have been sent one of these CD’s by Patrick de Bruin for what I dubbed “research purposes”. He sent me this one himself (and I didn’t pay for it).

One would expect to receive a CD with lists of addresses that are cleansed of role-accounts, doubles, spamtraps, spamblocks and whatsoever in order to make a good impression. None of this cleaning was performed and it proves how polluted a collection of email addresses one would receive when taking up a spammer on such an offer. These CD’s with addresses are of an extremely low quality.

The CD contains (listing) three directories with a number of textfiles containing addresses and three applications for sending bulk e-mail. The names of the directories tell what type of addresses to expect.

The 1526490BEL directory contains 32 files with evenly spread amongst them the total of 1,526,418 addresses. The 9103445NED contains 138 files with, evenly spread amongst them, a total of 9,103,439 addresses. The B2BGRATI contains 8 files, with varying number of addresses, a total of 366,772. Which makes a grand total of 10,996,629 addresses – which does not imply 10,996,629 unique addresses.

Yes, remember: Rule #1 applies. When all doubles (and tripples, and …) are removed from the lists, only 6,220,454 unique addresses remain. Which is 57% of the number of addresses the spammers claims. Two addresses even appear 14 times on the CD.

[It seems some people need some clarification on the counting of the addresses.]

Another noteable thing is that there are more addresses that on the CD’s twice than addresses with a single appearance. Over 60% of all addresses appear twice, while only 28% appears only once.

And while we are at it, Rule #3 applies as well. Spammers love to dig their own graves. You can find a number of addresses you don’t want to send spam to. The spammer didn’t even remove the abuse@ and postmaster@ addresses, 175 and 561 respectively. Both of them have doubles themselves. These role accounts include respectable providers that have a widely known anti-spam policy: the abuse desk of XS4ALL appears 5 times, the abuse desk of Planet three times and their postmasters will receive the spam three times each.

Role accounts not only encompass abuse desks or Network Operation Centers, but also operational accounts like ‘hostmaster’ and ‘postmaster’, who have to deal with requests from customers and feedback from key institutions like ARIN and RIPE or domain registrars. Spamming those accounts has several drawbacks for a spammer, their spam is most definitely not wanted by the recipient. It isn’t too farfetched to state that online businesses (that mostly have to rely on e-mail for direct customer contact) might be facing increasing difficulties coping with the loss incurred by spam, both technically and financially.

When looking at the addresses themselves, other stuff comes up. Why would you send e-mail to addresses that are invalid, or have chance of not working as a result of faulty mailservers? Why would you add addresses to your database that do not conform to the regular datamodel? Especially if some of these problems could be fixed with some basic search and replace. The CD includes 1,739 addresses that are invalid and 360 with an unusual syntax.

The addresses with the unconventional syntax are most likely a result of bad harvesting. Probably, these addresses have been used on websites at the end of a line. The harvester picked up all the words with an @ in them from space to space.

The non-existent toplevel domains are sometimes also the result of bad harvesting (anything with an “@” must be an e-mailaddress) and self-imposed limits in the maximum number of characters an e-mailaddress can have. I can’t explain the fact that there a quite a lot of addresses with their toplevel domain stripped.

A couple of the addresses with non-existent toplevel domains:

irene.veerman@id.bib.wau
02@01n.pdf
mwd.avisser@chello.nl--
e.schultz@bwd.rws.minvenw
ikautostelen@van.jouw
nightmare@elmstreet.666
heb@ik.niet
i.m.tieken@let.leidenuniv
you@your.address
ap@159.148.109.88
interkabel@wu.html
--abigail@ny.fnx.com--
--devet@iaehv.nl--
w32.parol@mm.html
w32.assarm@mm.1.gif
w32.chir.b@mm.html
w32.yaha.e@mm.html
list-handler@k9.dds
bs.diest.centrum@gemeenschapsonderwijs.b
bs.sintagathaberchem.gkoornstraat@rago.b
buitenschoolse.kinderopvang@stad.antwerp
d.saelens@autostar-ieper.mercedes-benz.b
danny.de_beuckelaer.debeuck@dealer.renau
directiesecretares@ka2-sporthumaniora.te
directiesecretares@ka2sporthumaniora.tel
notarissen@cornelissens.jongenelen.knb.n
notarissen@dijkstra.jansen.bergman.knb.n
notarissen@vankeulen.destigter.capelle.k
marloc@tref.nl0314391626
metamorf@xs4all.nl0355380974
f74dew.3p1@a3.xs4all
f35s3k.e44@a3.xs4all
reageer.in@de.nieuwsgroep
958426909.4192@x86.local
7af978d9@cc182023a.gif
rob@wavedata.demon.nl................

And some with a unconventional syntax:

webmaster@groenlinks.nl.
webmaster@fom.nl.
info@laboratorium.nl.
transtec@transtec.nl.
info@mediation.nl.
c.vankoeverden@amnesty.nl.
info@denhelder.nl.
solliciteren@brandweer.amsterdam.nl.
majordomo@hamnet.demon.nl.
account@xs4all.nl.
fax@uwdomein.nl.

And of course there are addresses with spamblocks on the CD. Spammers must be too clueless to do a search and replace or to remove them completely:

REMOVETHISargtango@euronet.nl
REMOVETHISh.kerker@cable.a2000.nl
famtie@remove-this-xs4all.nl
geenspam@vet.uu.nl
rr@k9.dds.nl.ReMoVeThIs
rcameszREMOVETHIS@dds.nl
onkiNOSPAM@linux.gelrevision.nl
tjoen@dds.NOSPAM.nl
tomvh@RemoveThisSpamBlock.edc.xs4all.nl

I was thinking dictionary attacks were no longer being used – haven’t seen one of them recently – but I guess that’s just misperception. Among the addresses on the list, there are many with username like these:

[…], institute, instituted, instituter, instituters, institutes, instituting, institution, institutional, institutionalize, institutionalized, institutionalizes, institutionalizing, institutionally, institutions, instruct, instructed, instructing, instruction, instructional, instructions, instructive, instructively, instructor, instructors, instructs, instrumental, instrumentalist, instrumentalists, instrumentally, instrumentals, instrumentation, instrumented, instrumenting, instruments, insubordinate, insubstantial, insufferable, insufficient, insufficiently, insular, insulate, insulated, insulates, insulating, insulation, insulator, insulators, insulin, insult, insulted, insulting, insults, […]

Other addresses are interesting as well. One address in one of my (sub)domains which appears on the CD is “F74DEw.3p1@sisterray.xs4all.nl“. It’s definately a (part of a) Usenet message-Id, not an e-mailaddress. But there are more addresses that I don’t expect to be used to subscribe to mailinglists, nor to be interested in unsollicited bulk e-mail at that particular address. They include the addresses of a couple of embassies…

[…], info AT danishembassy.nl, info AT ghanaembassy.nl, info AT hungarianembassy.nl, consulate AT iranianembassy.nl, postbus AT irish.embassy.demon.nl, info AT jordanembassy.nl, info AT nigerianembassy.nl, […]

a couple of the Dutch airports…

[…], info AT denhelderairport.nl, info AT eindhovenairport.nl, info AT lelystad-airport.nl, info AT rotterdam-airport.nl, info AT teuge-airport.nl, […]

other famous Dutch spammers…

[…], anton AT abfab.nl, artur AT abfab.nl, cobben AT abfab.nl, e-mailjuliette AT abfab.nl, hostmaster AT abfab.nl, info AT abfab.nl, jan AT abfab.nl, jean AT abfab.nl, jeanet AT abfab.nl, jessica AT abfab.nl, juliette AT abfab.nl, laurens AT abfab.nl, mark AT abfab.nl, reacties AT abfab.nl, spam AT abfab.nl, thijs AT abfab.nl, webmaster AT abfab.nl, yerk AT abfab.nl, […]

and 471 addresses of Dutch politicians…

[…], bpeper AT tweedekamer.nl, wim.kok AT tweedekamer.nl, wimkok AT tweedekamer.nl, […], 070-3183649d.leuvelink AT tk.parlement.nl, 070-3183653r.baatenburgdejong AT tk.parlement.nl, 070-3183654m.veen AT tk.parlement.nl, 070-3183659m.flierman AT tk.parlement.nl, 20b.dittrich AT tk.parlement.nl, 20tijsma AT tk.parlement.nl, abgroenlinksa.harrewijn AT tk.parlement.nl, a.cdaa.mosterd AT tk.parlement.nl, a.dswart AT tk.parlement.nl,i a.duivesteijn AT tk.parlement.nl, […]

These addresses are interesting. Apart from the fact that Wim Kok is no longer active in the Dutch politics, he’s widely known as being computer illiterate. His mailbox must be full by now. The same goes for Ab Harrewijn, who passed away a year before these CD’s were sold. Another interesting thing is that some of the addresses are clearly harvesting mistakes. The numbers in front of some of the usernames are the telephone numbers of these politicians. This makes it more than clear that spammers do not work very accurate and with decency. I hope Dutch politicians will draw the same conclusions that I drew.

When all addresses with an invalid domain or syntax, all administrative role accounts and all the duplicates are removed, only 56.55% of all address remain. That does not imply one will end up with 6,218,344 working addresses. Email to a significant number of these accounts will probably bounce due to a variety of reasons – most likely: mailbox does no longer exist.

This leaves the question whether this CD is a breach of the Dutch Data Protection Act. I simply don’t know, although I do have a suspicion.

Numbers

Grand total

For the grand total, all three directories put together:

 

number of addresses
total addresses 10,996,629
unique addresses 6,220,454 56,56%

 

doubles, triples and more (addresses appearing more than just once)
14 times 2 0.00%
13 times 2 0.00%
12 times 2 0.00%
11 times 9 0.00%
10 times 4 0.00%
9 times 9 0.00%
8 times 47 0.00%
7 times 97 0.00%
6 times 697 0.01%
5 times 1,830 0.03%
4 times 27,191 0.44%
3 times 287,685 4.62%
2 times 4,107,246 66.03%
1 time 1,795,633 28.87%

 

role-accounts (minimum 100 addresses)
info 126,017 1.14%
webmaster 3,950 0.03%
sales 3,799 0.03%
support 641 0.00%
postmaster 561 0.00%
abuse 175 0.00%
ftp 147 0.00%
noc 127 0.00%
other role-accounts 114 0.00%
non-role-accounts 10,035,531 91.26%

 

addresses with obvious spamblocks
with obvious spamblocks 64 0.00%
without obvious spamblocks 10,996,565 100.00%

 

top level domain names (minimum 1,000 addresses)
nl 7,060,715 64.21%
com 2,110,212 19.19%
be 1,430,859 13.01%
nu 200,588 1.82%
net 182,359 1.66%
org 1,682 0.02%
de 1,505 0.01%
other valid TLD’s (less than 1,000 addresses) 6,610 0.06%
non-existent TLD’s 1,739 0.02%
unconventional syntax (address ending in dot) 360 0.00%

 

Belgian addresses

For the 1526490BEL directory (“the Belgian addresses”):

 

number of addresses
total addresses 1,526,418
unique addresses 858,195 -43.77%

 

doubles, triples and more (addresses appearing more than just once)
4 times 1,152 0.13%
3 times 41,355 4,81%
2 times 582,057 67,82%
1 times 233,631 27,22%

 

role-accounts (minimum 100 addresses)
other role-accounts 153 0.01%
non-role-accounts 1,526,265 99,99%

 

addresses with obvious spamblocks
with obvious spamblocks 0 0.00%
without obvious spamblocks 1,526,418 100.00%

 

top level domain names (minimum 1,000 addresses)
be 1,356,816 89.89%
net 169,602 11,13%

 

Dutch addresses

For the 9103445NED directory (“the dutch addresses”):

 

number of addresses
total addresses 9,103,439
unique addresses 5,190,269 -42.98%

 

doubles, triples and more (addresses appearing more than just once)
14 times 2 0.00%
13 times 1 0.00%
12 times 2 0.00%
11 times 8 0.00%
10 times 2 0.00%
9 times 7 0.00%
8 times 28 0.00%
7 times 33 0.00%
6 times 406 0.01%
5 times 951 0.02%
4 times 18,849 0.36%
3 times 243,681 4.69%
2 times 3,362,819 64.79%
1 time 1,563,480 30.12%

 

role-accounts (minimum 100 addresses)
info 21,510 0.23%
webmaster 3,941 0.04%
sales 997 0.01%
postmaster 543 0.00%
support 529 0.00%
abuse 157 0.00%
ftp 129 0.00%
noc 109 0.00%
other role-accounts 95 0.00%
non-role-accounts 9,075,429 99,69%

 

addresses with obvious spamblocks
with obvious spamblocks 64 0.00%
without obvious spamblocks 9,103,375 100.00%

 

top level domain names (minimum 1,000 addresses)
nl 6,813,466 74.84%
com 2,072,753 22.76%
nu 200,177 2.19%
be 4,306 0.04%
net 2,854 0.03%
org 1,681 0.01%
de 1,271 0.01%
other valid TLD’s (less than 1,000 addresses) 6,716 0.01%
non-existent TLD’s 4,833 0.05%
unconventional syntax (address ending in dot) 360 0.00%

 

Business addresses

For the B2BGRATI directory (“the business addresses”):

 

number of addresses
total addresses 366,772
unique addresses 182,059 -50,36%

 

doubles, triples and more (addresses appearing more than just once)
6 times 11 0.00%
5 times 50 0.03%
4 times 1,525 0.84%
3 times 798 0.44%
2 times 178,287 97.93%
1 time 1,388 0.76%

 

role-accounts (minimum 100 addresses)
info 104,480 28.49%
sales 2,766 0.75%
other role-accounts 95 0.03%
non-role-accounts 259,431 70.73%

 

addresses with obvious spamblocks
with obvious spamblocks 0 0.00%
without obvious spamblocks 366,772 100.00%

 

top level domain names (minimum 1,000 addresses)
nl 247,249 67.41%
be 69,737 19.01%
com 37,459 10.21%
net 9,903 2.70%
other valid TLD’s (less than 1,000 addresses) 1,596 0.44%
non-existent TLD’s 828 0.23%

 

Notes

Regular expressions that have been used:

  • role-accounts: “^(info|sales|((host|post|web|news)master)|abuse|noc|support|ftp|usenet|uucp)@”
  • spamtraps: “((remove-?(this|me))|((no|geen)[-._]?spam|spamb))”

All addresses have been converted from upper to lowercase before counting, otherwise toplevel domains as .be and .BE would have been counted as two seperated domains. Strictly speaking the LHS (the username) is case sensitive and the RHS case insensitive. I have neglected the case sensitivesness of the LHS as I assume that practically there’s no difference. If some of you are interested, I will redo the counting.

The addresses ending in one dot are technically valid adresses. If handled correctly by the software that is used, they should cause no problems. However, when sending bulk e-mail your goal would be to reach as many as possible and one would prefer to play at safe.

Thanks to #linux.nl, #spamvrij.nl and especially JPV and Niels Vestergaard Jensen for their contributions.