SANS DFIR Summit in Austin, TX

Sarah Edwards will be presenting two topics at the summit on June 26 & 27.

  • “When Macs Get Hacked”
  • “Analysis & Correlation of Macintosh Logs”

Rumor is the presentations will be streamed live if you can’t make it to the summit!

Come see us at CEIC in Vegas!

Next week, from May 21-24, Paul Nichols and Brian Hussey will be presenting at the CEIC Forensic Conference in Summerlin, Nevada speaking on the topic of Dynamic Malware Analysis of a current banking Trojan. Join them for their session where they will dissect a common credential stealing Trojan belonging to the Ursnif family. In the lab they will walk the participants through monitoring the file system, registry and network activity to determine the functionality of this kernel-level rootkit and how it hides itself from the user and the operating system, using only freely available tools.

Reading Mac BSM Audit Logs

By: Sarah Edwards

The audit trail logs provide security related information, in particular user login/logoff data. By default, these logs record a user logging in and logging off via the login screen, SSH, user credential authentication for a software program, or failed logins. They will also record when a user is created or removed from a system.

McAfee created the OpenBSM implementation that OS X implements, using these logs for compliance in the Common Criteria standards. The audit log formats are based on the Basic Security Module developed by Sun Microsystems.

The logs are located in /private/var/audit and are only accessible on a live system (if you enable the root user), or extracted from a forensic image. The logs are identified as StartTime.Endtime (in UTC) in the format: YYYYMMDDHHMMSS.YYYYMMDDHHMMSS (See Figure 1). Each of these files is known as a “trail file.”

Figure 1 – Audit Log Files in /private/var/audit/

Other files might have the following labels in their filename:
.crash_recovery – Log file not terminated due to crash, and recovered. The following audit file will have a “Audit recovery” record as its first record.
current – Symlink to currently active trail file.
.not_terminated – Active audit trail file, or auditd was not shutdown gracefully.

While audit log expiration can be set in the audit_control file (see Configuration Files below) with the expire-after setting, this is not configured by default on OS X. It is unknown how and when these log files are removed, but appears to keep the past six months worth of log files.

Carving for Audit Logs

Carving free space for these files may be accomplished by keyword searching. There will be vital data before and after these keywords (see Manual File Parsing below), as these are the starting and ending records of a trail file. If the trail file is one that ends in .not_terminated or .crash_recovery these will not have the “Audit shutdown” file end record. A file may also begin with “launchctl::Audit recovery”, if the file recovered from a crash.
• File Start (Figure 2) – launchctl::Audit startup
• File End (Figure 3) – launchctl::Audit shutdown

Figure 2 – Audit startup Record

Figure 3 – Audit shutdown Record

Reviewing Audit Logs Using praudit

The log files are in a binary format, which are not easily human-readable. This is where the command line tool praudit comes in handy. praudit allows output of these files in a variety of formats.

The default format is shown in below, this contains one record in the log.

header,139,11,user authentication,0,Sat Apr 21 22:02:14 2012, + 940 msec
subject,oompa,oompa,staff,root,staff,69,100005,69,0.0.0.0
text,Verify password for record type Users 'oompa' node '/Local/Default'
return,success,0
trailer,139

Audit Log File Format

Each record contains tokens. In the example above, there are five tokens.

  • Header
  • Subject
  • Text
  • Return
  • Trailer

Each log record may contain a variety of tokens, detailed information about the tokens can be found in the man page for audit.log. In general each record starts with a ‘header’ token and ends with a ‘trailer’ token.

The ‘header’ token contains data such as number of bytes in the record (139), event type (user authentication), and timestamp.

The ‘subject’ and ‘subject_ex’ tokens are also of value as these contain data about the user account performing the action.

  • Audit ID
  • Effective User ID
  • Effective Group ID
  • Real User ID
  • Real Group ID
  • Process ID
  • Session ID
  • Terminal Port ID
  • Terminal Machine Address

praudit Output Formats

As stated above, praudit has the ability to output in different formats, the man page is available here:

The output of the -l option, which prints out each record to its own line, delimited by a comma.

header,139,11,user authentication,0,Sat Apr 21 22:02:14 2012, + 940 msec,subject,oompa,oompa,staff,root,staff,69,100005,69,0.0.0.0,text,Verify password for record type Users 'oompa' node '/Local/Default',return,success,0,trailer,139,

The output of the -r option, which prints out each record in raw format.

20,139,11,45023,0,1335060134,940
36,501,501,20,0,20,69,100005,69,0.0.0.0
40,Verify password for record type Users 'oompa' node '/Local/Default'
39,0,0
19,139

The output of the -s option, which prints out each record in short format.

header,139,11,AUE_auth_user,0,Sat Apr 21 22:02:14 2012, + 940 msec
subject,oompa,oompa,staff,root,staff,69,100005,69,0.0.0.0
text,Verify password for record type Users 'oompa' node '/Local/Default'
return,success,0
trailer,139

The output of the -x option, which prints out each record in XML format.


<record version="11" event="user authentication" modifier="0" time="Sat Apr 21 22:02:14 2012" msec=" + 940 msec" >
<subject audit-uid="oompa" uid="oompa" gid="staff" ruid="root" rgid="staff" pid="69" sid="100005" tid="69 0.0.0.0" />
<text>Verify password for record type Users &apos;oompa&apos; node &apos;/Local/Default&apos;</text>
<return errval="success" retval="0" />
</record>

The output using the –x and –n options, this prints out each record in XML format and does not resolve the user and group names. This option should be used if not doing analysis from the original system (i.e.: extracted audit logs from forensic image). I find the XML format to be the easiest to read if you are not familiar with the token formats.


<record version="11" event="user authentication" modifier="0" time="Sat Apr 21 22:02:14 2012" msec=" + 940 msec" >
<subject audit-uid="501" uid="501" gid="20" ruid="0" rgid="20" pid="69" sid="100005" tid="69 0.0.0.0" />
<text>Verify password for record type Users &apos;oompa&apos; node &apos;/Local/Default&apos;</text>
<return errval="success" retval="0" />
</record>

To output all audit files in a directory to a file called audit_log_output.txt.
(XML format without user/group names resolved), use this command:

praudit –xn /example/directory/path/* >audit_log_output.txt

Manual File Parsing

For those of us who like to parse these files by hand I would highly recommend the reviewing the audit.log man page for the format of each token record which contains the format for each token record.

Other files that may be of use are located in /usr/include/bsm/:

Token ID Types:
audit_record.h available here.

Event Types:
audit_kevents.h – Kernel Events
audit_uevents.h – User Events

Audit Configuration Files

The audit configuration files are located in /etc/security/. Each file has a specific purpose; specifics can be viewed by performing the man command on each filename listed below.

Other Tools

Apple had developed a tool called Audit Log Viewer (Figure 4) for analyzing these audit files, however it has not been developed since 10.5. It is available here. I should note that while it does install and work on 10.7, it limits what information is available. I should also warn that it overwrites a newer version of praudit and its associated man pages – I suggest installing it in a VM. (Yes, I found this out the hard way.)

Figure 4 – Audit Log Viewer

For $20, you can purchase Audit Explorer from the Mac App Store, which gives a nice GUI to analyze these files from.

References
http://www.nycbsdcon.org/2010/presentations/nycbsdcon-freebsd-audit.pdf
http://www.freebsd.org/doc/handbook/audit.html

File Type Identification and Its Application for Reversing XOR Encryption

By: John Ortiz

After reading Brian Hussey’s blog on “Decoding Data Exfiltration – Reversing XOR Encryption”, I wanted to share some basic statistical techniques for identifying the type of data that may have been exfiltrated and proceeding to decipher it. Data types that are easily statistically identifiable include:

  • plain text
  • html
  • compressed data
  • strongly encrypted data
  • weakly encrypted data (including XOR)
  • base64 encoded
  • base32 encoded
  • executable code
  • source code
  • wave files
  • 24-bit bitmap images
  • More …

The first step in identifying the file type of the suspected exfiltrated data is to open the file in a hex editor and look for the “magic” number. For instance, files zipped with WinZip have the letters “PK” as the first two bytes of the file. Microsoft Office 2007 and beyond also use this compression and have that same signature. Portable Executable files have the letters “MZ” and so forth. Some magic numbers are not letters at all, such as that for gzip files which has the three hexadecimal bytes: “1F 8B 08”. Text files have no magic number. You can easily find a more complete list with a simple Google search.

In some cases, the data to be exfiltrated may include custom encoding or obfuscation such that no magic number appears in the data. We can use histograms and entropy to attempt to identify the type of data which composes the file, or at the very least, verify that a file is not of a certain type.

I use a program I wrote called “Write Bitmap Histogram” or “WBH” to examine unknown file types or even to verify a particular file type.

WBH produces several useful outputs: graphical histogram, textual histogram, the byte entropy of the file, and a bitmapped image of the unknown file. All four data points are instrumental in determining a file type assuming there is sufficient data. For instance, given a single byte, is it compressed? Encrypted? There is no way to tell. We’ll leave the detailed analysis on the accuracy of the statistics vs. data size for another time, but for now, we’ll assume at least 4 KB.

So let’s take a look at a text file. Below, the left image is the graphical histogram of a text file (771 KB) and on the right is the bitmapped view of the same text file.

Figure 1 - Histogram and Image of a Text File (E=4.45665)

Figure 1 - Histogram and Image of a Text File (E=4.45665)

For each histogram, the vertical line represents the count (or frequency) of a particular byte value in the file. The left side is 0 and the right side is 255. For the text file, the largest count belongs to the “space” character which has a value of 32 (0x20). Note the two equal height lines just to the left of the “space” character – you got it, Carriage Return/Line Feed. It is typical for those two characters to match in a text file. (Note that some text formats use only a single LF character – 0x0A.) The other significant grouping is lowercase English text characters with “e” being the most prevalent and “t” not far behind. For an exact count, we can view the textual histogram.

The bitmapped image is dark because black represents a value of zero and white is a value of 255, and everything in between is a shade of gray. Text is concentrated in the lower half of that spectrum and so the image is dark gray. The entropy reported is 4.5 which is typical of English text. One thing you can determine from the bitmap is that this entire file looks like text.

In the images below, I concatenated the WBH executable to the end of the text file. Adding one file to the end of another is a common low-level obfuscation technique. You can see immediately that the histogram has some non-text characters such as zero and 255. Using the textual histogram (or the zoom feature), you could also see that the count of characters between 128 and 255 is significant, even though it is not readily apparent from the histogram below. (Looking closely you can see small values in the upper half, particularly for 255.) The reason is that the added data is small with respect to the text file size. (97K vs. 771K)

However, from the bitmap image, you can see that anomalous data occurs at the end of the file. (Note: bitmaps are displayed from bottom to top, so the beginning of the file is at the bottom of the image.)

Figure 2 - Histogram and Image of a Text File with a PE File Appended (E=5.08655)

Figure 2 - Histogram and Image of a Text File with a PE File Appended (E=5.08655)

Let’s take a quick peek at another text-based file type: HTML. Using histograms, you can identify the difference between HTML and a pure text file. Any type of source code tends to have similar properties in that there is a prevalence of paired values such as ‘{‘and ‘}’, ‘[‘ and ‘]’, ‘(‘ and ‘)’, and ‘<’ and ‘>’. Additionally, there tends to be a higher use of mathematical symbols such as ‘*’, ‘+’, ‘-‘, and ‘/’. There may also be a higher percentage of capital letters than is normally found in written text.

Figure 3 - Histogram and Image of an HTML File (E=4.70042)

Figure 3 - Histogram and Image of an HTML File (E=4.70042)

The images above are of 190K html file. It still has a text-like quality, but note that the characters “<” and “>” are prevalent and balanced. Parenthesis and braces have low counts but are also balanced. The image shows much more pattern than what you would find in a typical text-only file.

Compressed and encrypted files are also easily identified by visual inspection of the histogram. In many cases, they can even be distinguished from each other (which we will see is important in identifying XOR’d files).

Below is a picture (jpeg) I took of a Tasmanian devil in Hobart, Tasmania (562 KB), and the associated histogram files. The entropy is 7.98698 – fairly close to the maximum of 8.000.

Figure 4 - Tasmanian Devil in Jpeg Image

Figure 5 - Histogram and Image of Tasmanian Devil Jpeg File (E=7.98698)

Figure 5 - Histogram and Image of Tasmanian Devil Jpeg File (E=7.98698)

Except for the count of zeros, the histogram is fairly uniform. (I omitted the red box outline so you can clearly see the large number of zeros in this jpeg file.) This typical characteristic can be used to distinguish it from other compressed formats. From the bitmap image, you can see that the zeros (black) are uniformly distributed throughout the file.

Now taking a look at the AES encrypted version of this same file, you can see a clear difference. The entropy is 7.99968, which is also distinguishable.

Figure 6 - Histogram and Image of AES Encrypted File (E = 7.99968)

Figure 6 - Histogram and Image of AES Encrypted File (E = 7.99968)

There is no large count of zeros and the entropy is closer to the maximum by an order of magnitude. The bitmap image is good for recognizing that this is encrypted or compressed, but does not really help in discriminating between the two.

Since base32 and base64 encoding are sometimes used to obfuscate data, we’ll take a look at a few examples as well. Interestingly enough, not only are base32/base64 encoded files easily identified in their own right, but in some cases you can tell what type of data has been encoded without decoding them!

The respective entropies in the following images are 4.84594, 4.99830, and 4.99998. (Remember, for base 32 the entropy will approach 5 as opposed to 8.) So what do you think? Which one is text? Compressed? Encrypted?

Figure 7 - Base 32 Encoded Text (E=4.84594) File

Figure 7 - Jpeg (E=4.99830) File

Figure 7 - Encrypted (E=4.99998) File

Base 64 encoded files share a similar pattern except they have lowercase letters, carriage return/line feed, and a few other characters.

Finally, since we are dealing with malicious executables, and often times they are packed, you might wonder … can we use this tool to tell?  Yes! Next are the histograms and bitmapped representations of the WBH program itself and its UPX packed version.

Figure 8 - Histogram and Image of Standard PE File (E=6.58289)

Figure 8 - Histogram and Image of Standard PE File (E=6.58289)

Figure 9 - Histogram and Image of UPX Packed PE File (E=7.60086)

Figure 9 - Histogram and Image of UPX Packed PE File (E=7.60086)

With that background in place, let’s take a look at simple XOR encoding that is often found in malicious programs or data queued for exfiltration. This encryption technique is used because it is simple, fast, effective, and light-weight. We will explore 3 levels of XOR encryption: 1) XOR with single character, 2) XOR with short English word “hidden”, and 3) XOR with binary data (hex value 0xCA15DF9A), each progressively more difficult to decrypt.

A few observations to keep in mind:

  • Something XOR’d with itself is zero.
    • Whenever you find a zero in the target file, the original character is equal to the XOR key used.
  • Something XOR’d with zero will be itself.
    • Knowing that a file type has a large number of zeros, particularly if the location is known, can yield the key.
  • A letter XOR’d with the space character (0x20) will change the case
    • In an English text file, the space is typically the most common character
  • XORing with a single character will not affect the entropy

You have acquired 3 files which you suspect may have been exfiltrated. File A is 277 KB with an entropy of 4.700. Opened in a text editor, a snippet of it looks like:

ö¢¾§¦ôÇÀêêêö¢¯«®ôÇÀêêêêêêö‡ž‹ê¢¾¾ºç¯»¿£¼÷艥¤¾¯¤¾çž³º¯èê©¥¤¾¯¤¾÷辯²¾å¢¾§¦ñê©¢«¸¹¯¾÷ƒ™…çòòÿóçûèôÇÇÀêêêêêêö¦£¤¡ê¸¯¦÷è¹¾³¦¯¹¢¯¯¾èê¾³º¯÷辯²¾å©¹¹èꢸ¯¬÷踯¦¤¥¾¯¹ä©¹¹èôÇÇÀêêêêêêö¹¾³¦¯ôÇÀêêêê䦣¾

A snippet opened in a hex editor results in the following:

F6 A2 BE A7 A6 F4 C7 C0 EA EA EA F6 A2 AF AB AE F4 C7 C0 EA EA EA EA EA EA F6 87 8F 9E 8B EA A2 BE BE BA E7 AF BB BF A3 BC F7 E8 89 A5 A4 BE AF A4 BE E7 9E B3 BA AF E8 EA A9 A5 A4 BE AF A4 BE

You can tell right away that it is not strongly encrypted and likely not compressed since there are numerous repeating characters.  Is the alphabet limited to 32 or 64? It is hard to tell, particularly for the 64 character alphabet. (Hey, you can always write them down and count!)

So, you apply WBH to get the histogram and file picture shown next.

Figure 10 - Histogram and Image of XOR'd File (E=4.70042)

Figure 10 - Histogram and Image of XOR'd File (E=4.70042)

You can see that there are more than 32 characters, but not necessarily more than 64. (The textual histogram will tell you the exact counts – there are 95 different characters represented.) You can see there is one prevalent character. With this information, you can guess it is some type of text file.

The bitmap image is uniform throughout, so this file is likely entirely of a single type. As observed earlier, text tends to look very spotty but this has a texture, i.e. a pattern. Therefore it is likely some type of programming language, such as html.

You can also assume the prevalent character in an html file is a space. The value here is 0xEA which when XOR’d with 0x20 yields 0xCA. Use that to decrypt the rest of the file and you’re done.

File B is 771 KB with entropy of 5.27539. In this case, the hex editor view is quite revealing.

Figure 11 - Hex Editor View of Unknown File Type (E=5.27539)

The capital letters “I”, “E”, “D”, “H”, and “N” really stand out. The separation between these characters could easily coincide with English word lengths (you will not see that positional correlation using WBH).

The histogram confirms a similarity to a text file, but the character frequencies are displaced. There are 6 prevalent characters which might give an indication of the key length. It should be pretty apparent that this is a text file XOR encrypted with the keyword “hidden” – there are 5 different characters in that password so “6” is not the exact key length, but it’s close.

Figure 12 - Histogram of Unknown File (E=5.27539)

File C is 97 KB with entropy of 7.30216. This entropy fits the profile of a compressed file. The hex editor view is quite revealing in this case too. The hex values 0xCA, 0x15, 0xDF, 0x9A repeat quite frequently in the beginning which indicates that whatever characters are in those positions are the same as each other. However, the positional correlation does not match well with English text as these repeated characters are side-by-side.

Figure 13 - Hex Editor View of Unknown File (E=7.30216)

Figure 14 - Histogram and Image of Unknown File (E=7.30216)

Figure 14 - Histogram and Image of Unknown File (E=7.30216)

There are 4 characters with similar frequency and another 5 or 6 characters with a different similar frequency. Also, this type of distribution does not match a compressed file very well – you would expect the distribution to be more uniform than this.

It is difficult to notice on the bitmap at this size, but at the very bottom, you can see a slight pattern (that’s the repeating hex digits from the hex editor in the beginning of the file), then you see a large section with a different pattern, and finally at the top of the image, you see multiple small patterns. Portable Executables (PE) files have different sections that match this profile and they also tend to have zeros and 255 as prevalent characters. I’m making a bit of a leap hear based on experience, but in practice you will need to do that too!

So this is actually the WBH executable file XOR’d with 0xCA15DF9A.

We’ve seen that different file types often have distinct statistical signatures. By knowing the statistical characteristics, you can determine the likely content of unknown file types by comparing simple statistics such as entropy and histograms. All file types have some variability, and some more than others. For instance, executable files can be packed, contain various resources such as bitmaps and have multiple sections, so the statistics vary accordingly.

You might ask, “Can this be automated?”  Yes, and I have already done so, to a certain level of accuracy. There is still more work to be done on that front, but it is certainly possible.

We have also seen that while simple XOR encryption may provide obfuscation to bypass signature detectors and the human eye, it is not exceedingly difficult to crack with just a small set of tools.

On the Difficulty of Autonomous Pornography Detection

By: John Ortiz

INTRODUCTION:
I was watching the news the other day and saw a news report about a new product that claimed to be able to detect pornography on a PC. Fascinated, and knowing what a difficult task this actually is, I decided to buy it and check it out.

There is no computer that will ever be able to detect porn 100% of the time, since the definition is subjective to the person seeing it. One man’s porn is another man’s art, or medical image, etc. Clearly, there are depictions that most of the population would agree are pornographic in nature, and the goal is to identify these with the smallest possible number of false positives. The best we can hope to do is to reduce the volume of images that require further investigation. In other words, we want to eliminate images that definitely do NOT need additional scrutiny. Even this is not an easy problem.

The objective of the investigator is a critical consideration as well. For instance, a parent, employer, or school may need to detect only a single instance in order to be successful. This task is not difficult as in any cluster of illicit images at least one will likely meet some basic search criteria. Here, we will view the problem from the perspective of law enforcement, whose objective is to quickly find all images of interest given a large collection of storage media.

This discussion will focus on images; since that is what the product claimed to detect. Video support is on the way, but was not available in time for my evaluation. (Video can be modeled as a collection of still images, perhaps with some temporal motion correlation or audio/motion indicators, but that is for another day.)

There are a number of broad classifications of images as well: full color, black and white, grayscale, animated, and computer generated. Here we will limit the scope to full color images, to include Computer Generated Images (CGI). The U.S. Supreme Court has ruled that CGIs that do not depict real people (i.e. children), cannot be banned as that would violate artistic expression. The reason to include these generated images in this discussion is to illustrate the difficulty of autonomously distinguishing between CGIs and real images – that is a different problem space.

DETECTION APPROACHES:
After a brief literature review, there exist a number of papers addressing the challenges of skin detection, facial recognition, nudity detection, etc. I performed a cursory review to glean some detail on the basic approaches and their success rates.

While there are variations in detection approaches, the primary method is some form of skin detection. The second most prevalent technique is to detect human faces and limbs. Multiple color spaces were tried such as RGB, YCrCb, and HSV. Edge detection, neural network pattern analysis techniques, and even some contextual approaches such as checking for file names and visited websites were explored.

I could not find any technique or set of techniques that could identify human skin with 100% accuracy. There are simply too many variations in skin tone, texture, lighting, shadows, etc., and too many other objects that can have the appearance of skin and human limbs. Consider the inside of a cavern for instance.

There are four possible results to the question, “Is this image pornographic?” (Certainly, degrees of confidence can be introduced, but the 4 basic results are the same.)

True Positive: The image is correctly identified as porn
True Negative: The image is correctly not identified as porn
False Positive: The image is falsely identified as porn
False Negative: The image is not identified as porn, but it is

COMMERCIAL PRODUCT RESULTS:
So, looking to the commercial product and its claims, I can say that most of the claims are accurate. If you are a parent or employer just trying to fairly quickly find a single instance of illicit content, this product may work for you. After seeing the results, I am quite confident that if pornography exists in any significant amount on a computer, then it will detect at least some instances.

The accuracy claim, however, is misleading at best. It claims to detect facial features, body parts, etc., to include flesh-tone analysis, and achieve a false positive rate of less than 1%.

I was not able to ascertain how well it actually found pornographic content since, when I ran it on my laptop, it DOS’d itself by filling itself with images, becoming unusable. (I suspect this bug will be fixed in the future?) For those of you laughing, keep in mind that for my analysis, ALL the images it detected were false positives as I ran the software on a clean laptop. It divides the detected images into 3 categories: 1) Highly Suspect; 2) Suspect; 3) Low Suspect. Figure 1 shows 3 images identified as “Highly Suspect.”

Figure 1 - Three highly suspect images - Image 1

Figure 1 - Three highly suspect images - Image 2

Figure 1 - Three highly suspect images - Image 3

Clearly, you would expect image number 3 to be flagged. In order to discount that image, the algorithm would have to identify the texture of the clothing, eliminate it as a valid skin texture, realize it is covering critical areas, and discount it. But what about the other two highly suspect images? Where are the body part identifications in picture 2? And the following image, which was also identified as “Highly Suspect”, made me wonder.

Figure 2 - A highly suspect baboon

For this analysis I converted the images to 256-color paletted bitmaps so that the broad color composition could easily be evaluated via the palette colors. This will reduce the actual number of colors, but since very similar colors will be represented as a single color it will ease the analysis without affecting the objective.

Using a tool I developed, I randomized the image pixel indices. (In a paletted image, each pixel is an index into a color table that identifies the actual 24-bit color to use in that position in the image.) The effect of this randomization is that each color in the palette will be represented more or less uniformly. It is important to emphasize that the pixels themselves were not scrambled, just the color to which each pixel refers. This does not result in a representation of the frequency of occurrence of an individual color, but if there are many shades of a color in the same range, it becomes readily apparent. Removing the context makes it easier for us to see the colors present. For instance, in the baboon image, our mind sees blurred brush in the background, yet just focusing on the actual color; you can see how it is a valid skin color.

The three randomized images in Figure 3 correspond to those in Figure 1. Now, on the basis of skin tone alone, it is much easier to understand why these images were identified.

Figure 3 - Randomized highly suspect images - Image 1

Figure 3 - Randomized highly suspect images - Image 2

Figure 3 - Randomized highly suspect images - Image 1

The same thing can be said for the randomized baboon as shown below in Figure 4. The majority of palette colors are valid skin tones. Looking at the baboon image again, discounting the black and green and everything else could be human skin.

Figure 4 - Randomized baboon image

There was a slightly larger collection of “Suspect” images, most of which were facial images. I would have to guess in the case of the Kenny G album cover (Image 1) that the detection algorithm considered the special grouping of the skin tones of his face, since there are many colors that are not skin.

Figure 5 - Image 1

Figure 5 - Image 1 - Randomized

Figure 5 - Image 2

Figure 5 - Image 2 Randomized

Figure 5 - Image 3

Figure 5 - Image 3 Randomized

But when looking at the “Suspect” images in Figure 6, particularly the lightning strike, this hypothesis does not hold up. The skin tone colors are not grouped in anything like the shape of a human face. Where is the human body part correlation and background noise elimination as claimed?

Figure 6 - Image 1

Figure 6 - Image 1 Randomized

Figure 6 - Image 2

Figure 6 - Image 2 Randomized

There were thousands of “Low Suspect” images which I was not able to retrieve due to the device locking up, but again, they were all false positives. Many were very small images that are automatically downloaded when browsing the web.

So here is a test for you: below are some actual randomized images of

  1. Pornography
  2. Computer generated pornography
  3. Everyday pictures (a falcon, a dog, a frog, and me eating a lobster)
  4. Beach scenes to include people and/or animals (non-pornographic)

By skin tone alone, can YOU determine which of the original images were pornographic? Which images were computer generated pornography? Or even, which are definitely NOT pornographic? (Ok, the green frog should be easy.) There are 3 from each category.

Example Image 1

Example Image 2

Example Image 3

Example Image 4

Example Image 5

Example Image 6

Example Image 7

Example Image 8

Example Image 9

Example Image 10

Example Image 11

Example Image 12

CONCLUSION:

The valid colors and textures of human skin are vast but not infinite. Skin color detection is easy, but actually confirming that a group of pixels belongs to human skin is not. And skin detection alone is not enough as there are many ways the skin colors can appear in images that are not in any way connected to human skin. Identifying human limbs and faces can help, but are also not a complete solution. Context could certainly help, but that is also difficult for a computer to discern.

Context that is orthogonal to the image such as location on the disk, file name, clustering with other like files, visited websites, etc., is currently much more likely to produce better results than autonomous identification.

The porn detection challenge will never be completely solved, but eventually technology may allow us to identify it accurately enough for quick forensic analysis. Based on my research we are not there yet.

Harris @ DoD Cyber Crime Conference 2012

For all those readers attending the DoD Cyber Crime Conference, please don’t forget to visit us at booth #509. We love to talk nerd and you can meet some of this blog’s authors. We’re always looking to hire smart people too! See our current job openings at http://www.harris.com/harris/careers/.

Our very own Brian Hussey and John Ortiz will be presenting:

Brian will present “Decoding Data Exfiltration – Techniques to Understand What Was Taken” on Wednesday at 9:30AM in Learning Center in the Forensics Track. If you enjoyed his post “Decoding Data Exfiltration – Reversing XOR Encryption” this presentation is for you!

For some stego goodness, John will present “An Introduction to More Advanced Steganography” on Wednesday at 9:30AM in Courtland in the Research and Development Track.

Port-Independent SSL Detection

By: Ben Williams

Many network-based applications are gaining support for native end-to-end Transport Layer encryption using Secure Sockets Layer (SSL). Secure web connections over HTTPS has been a standard for online merchants, webmail, and financial sites for a number of years. As SSL-based encryption becomes more common for legitimate uses, it has also become more common for malicious purposes. Practically any protocol can be placed in an SSL wrapper to help avoid detection and logging. Some examples of this include IRC and Jabber chat, reverse command shells, and malware beaconing activity.

When monitoring a network or sifting through a large amount of packet data, it is important to identify what network connections were made and what the underlying protocol was for every unique conversation. Many COTS products provide simple protocol detection by keying off the server’s port number and will not provide an indicator that SSL may have been used on a non-standard port.

This entry will focus on the current versions of SSL: SSLv3 and TLSv1. In my experience, these versions account for the majority of SSL traffic across numerous applications. So my usage of ‘SSL’ refers to SSLv3 and TLSv1 throughout this entry.

SSL payloads should always begin with a well-defined five byte header:

Byte 0: Content Type

  • 0x14: SSL_Change_Cipher_Spec
  • 0x15: SSL_Alert
  • 0x16: SSL_Handshake
  • 0x17: SSL_Application_Data

Once an SSL session is established, the majority of SSL content type should consist of SSL_Application_Data records. This simply indicates that SSL-encrypted data is being transferred between the authenticated hosts.

Bytes 1-2: SSL Version

  • 0x0300: SSLv3
  • 0x0301: TLSv1

Independent of SSL content type, the SSL version should always be indicated.

Bytes 3-4: SSL record length

This SSL record length excludes the five byte header.

For this scenario, the SSL record length is not relevant. I am only interested in determining whether SSL data exists within the packet capture, and specifically if it was used on a port other than the standard TCP 443. The search will match against the four valid SSL Content Types at offset 0 of a data payload, as well as the two valid SSL Version types at offset 1 of a data payload. The quickest and most direct way I’ve found to search for SSL data across all network ports is to utilize standard tshark filters.

To begin, start with a tshark filter that will match against the above conditions and print out the default one-line summary of each matching packet:

tshark -r dec20.pcap -R "data.data[0:1] >= 14 && data.data[0:1] <= 17 && (data.data[1:2] == 0300 || data.data[1:2] == 0301)"

An explanation of each option is as follows:

-r         Read packet data from infile
-R        Apply the following display filter before printing or writing the packets

Inspect one byte of each packet’s data payload, at offset 0, to see if the byte falls within the range 0x14 to 0x17 (valid SSL Content Types):

data.data[0:1] >= 14 && data.data[0:1] <= 17

Also inspect two bytes of each packet’s data payload, at offset 1, to see if the bytes are equal to 0x0300 or 0x0301 (valid SSL Versions):

data.data[1:2] == 0300 || data.data[1:2] == 0301

The result of this query against the above packet capture revealed the following:

This is a quick way of identifying specific TCP sessions that require further inspection. To view the full hex and ASCII content of these packets, add the –x flag to the tshark command:

tshark -r dec20.pcap -R "data.data[0:1] >= 14 && data.data[0:1] <= 17 && (data.data[1:2] == 0300 || data.data[1:2] == 0301)" -x

-x         Print a hex and ASCII dump of the packet data

The result of this query against the same packet capture revealed something interesting about the TCP session identified earlier:

If SSL data does exist within the packet capture, most of the content would of course appear to be garbage at this point. But, the above packet shows that 192.168.1.200 has sent what appears to be an SSL server certificate to 192.168.1.168 and provides multiple clues about what’s actually going on between these hosts:

  1. 192.168.1.200 is most likely an SSL server for something on TCP 6697
  2. 192.168.1.200 is likely running SSL-based IRC services for darknetz.irc
  3. 192.168.1.168 is likely running an SSL-enabled IRC client

A quick open-source lookup shows that TCP port 6697 isn’t the official port for IRC SSL, but it is often used as such. The combination of this information provides pretty high confidence that this is SSL-encrypted IRC. Depending on your environment, the next steps of analysis or investigation would greatly vary. If the IRC server’s private key file could be obtained, this traffic could then be properly decoded and decrypted into clear-text IRC traffic.

If used on a large corporate network, the last tshark filter would probably return a high volume of packet data for legitimate encrypted web traffic. To filter out the more common HTTPS traffic and begin the hunt for anomalous or unknown SSL connections, add another simple tshark read filter to the last query:

tcp.port != 443

This technique may lead to the identification of protocols or applications that were previously unknown to a security analysis, forensic, or engineering team. As stated before, it is critical to identify and understand the protocols in a network monitoring or pcap analysis situation. Advanced attackers and threats purposefully hide in ‘network noise’ under the notion that most organizations won’t recognize the single outbound encrypted connection that contains malicious activity.