Jason R. Bowling, Priscilla Hope, Kathy J. Liszka
The University of Akron
Akron, Ohio 44325-4003
{bowling, ph11, liszka}@uakron.edu
Abstract
We propose a method for identifying image spam by training an artificial neural network. A detailed process for preprocessing spam image files is given, followed by a description on how to train an artificial neural network to distinguish between ham and spam. Finally, we exercise the trained network by testing it against unknown images.
1. Introduction
Select – delete – repeat. It’s what we spend the first ten minutes of every day doing -- purging spam from our inboxes. In the first month after the National Do Not Call
Registry went into affect, we noticed about a 30% increase in spam. No, that’s not backed by scientific process, just personal observation. And then, it got worse.
Clearly spam isn’t going away, at least not in the foreseeable future. People still respond to it, buy products from it, and are scammed by it.
Filters are available to combat these unsolicited nuisances. But spammers continually develop new techniques to avoid detection by filters. See [1] for a current and comprehensive list of spam techniques. This paper focuses on one specific category of unsolicited bulk email – image spam. This is a fairly recent phenomenon that has appeared in the past few years. In 2005, it comprised roughly 1% of all emails, then grew to an estimated 21% by mid 2006 [2]. They come as image attachments that contain text with what looks like a legitimate subject and from address. They are successfully getting by traditional spam filters and optical character recognition (OCR) systems. As a result, they are often referred to as OCR-evading spam images. A common example is shown in Figure 1. These come in many forms by way of file type, multipart images where the image is split into multiple images, and even angled, or twisted.
References: vol. 2, pp. 914-918, 2005. Conference on Email and Anti-Spam (CEAS), 2007. Li, “Filtering Image Spam with Near-Duplicate Detection,” Fourth Conference on Email and AntiSpam (CEAS), 2007. 7, 2699–2720, 2006.