Lab 09: Processing Captured Network Traffic

Goals

Credits

The material for this lab was created by Professor L. Felipe Perrone based on previous work by Professors Xiannong Meng (Bucknell University) and Ralph Droms (Cisco). Permission to reuse this material in parts or in its entirety is granted provided that the credits note is not removed. Additional students files associated with this lab, as well as any existing solutions can be provided upon request by e-mail to perrone[at]bucknell[dot]edu.

Setup

Go to your CSCI 363 lab directory. Copy all files for this lab by doing the following.

cp -r ~cs363/Spring16/student/labs/lab09 .

Then add the newly copied lab09 to your gitlab repository.

Background

Where and how does the data come from?

The data files contain packets captured from life network sessions using various software such as WireShark and Ethereal (previous generation of Wireshark) at different times. Most of these files were generated by the textbook authors. Some were generated locally at Bucknell. The captured file follows a known packet capture format called pcap, which stands, well, for packet capture. You will write program(s) to read and analyze these packet files.

The format of the packet traces

Packet trace files are binary files, unless someone has kindly parsed their contents and converted all the information to human-readable text. If you have ever tried to cat a binary file to your terminal, you know what kind of mess you get. (You end up having to reset the terminal with /usr/bin/reset, in Linux, or alternatively, you may kill the terminal and open another one.) The proper way to examine the contents of a binary file is to use an editor that supports that kind of format or, instead, use some kind of utility that converts each group of 4 bits (what is sometimes called a nibble) into a hexadecimal digit. By now you may have used the Linux hex-dump utility xxd several times (if you haven't, check out its man page.)

The directory for this lab, located at ~cs363/Spring16/student/labs/lab09 contains a trace file called trace-dec21-2005 which was generated by Ethereal in Dana in 2005. We will work with this file to explain what you will accomplish in this lab. Later on, after you have written your new application for analyzing frames, you can try your program in other trace files generated by the textbook authors in the directory of ~cs363/traces/, and try the program on the data you collected yourself.

Try to view the contents of trace-dec21-2005 with xxd. Note that xxd is nice enough to display on the right of hex-dump, the interpretation of the same binary data as ASCII text. This makes it particularly easy for a human to spot any occurrence of text within the binary file. The first column of the output contains a byte-address relative to the beginning of the file; this address is displayed in hexadecimal and advances in increments of 16 from line to line. The next four columns correspond to a group of 16 consecutive bytes in the binary file (remember that each two hexadecimal digits correspond to the value of a byte.) The last column, all the way on the right, shows the ASCII representation of those 16 bytes. If one byte value does not correspond to any printable ASCII character, it is displayed as a '.', that is, a period. Note that this hex-dump produced by xxd< makes it particularly easy for a human to spot any occurrence of text within a binary file. In the file trace-dec21-2005 you should see text strings such as fedora.redhat.com and www.cnn.com which probably means that the packets in this collection may have traveled to these websites. In addition, can you spot some HTTP protocol related text?

Try the following command at your terminal.

xxd trace-dec21-2005 | less

What you see should correspond to:

0000000: d4c3 b2a1 0200 0400 0000 0000 0000 0000 ................
0000010: ffff 0000 0100 0000 47c5 a943 fd14 0300 ........G..C....
0000020: 2a00 0000 2a00 0000 ffff ffff ffff 0001 *...*...........
0000030: 039c ffbd 0806 0001 0800 0604 0001 0001 ................
0000040: 039c ffbd c0a8 0166 0000 0000 0000 c0a8 .......f........
0000050: 0101 49c5 a943 5eed 0d00 4000 0000 4000 ..I..C^...@...@.
0000060: 0000 0001 039c ffbd 000c 41a1 f5da 0806 ..........A.....
0000070: 0001 0800 0604 0002 000c 41a1 f5da c0a8 ..........A.....
.
.
.

The format of this data file is the libpcap standard. Follow this link to find information for developers at the Wireshark project website. Instead of duplicating information from the project site here, we just ask you to get the instructions on how to understand the file from the project's website. The basic idea is that the file starts out with a global header, which is followed by a sequence of tuples <captured packet header, packet data>.

The key is to understand the structure of a captured packet file, which is represented in the following graphics (from http://wiki.wireshark.org/Development/LibpcapFileFormat)

Global Header

Packet Header

Packet Data

Packet Header

Packet Data

Packet Header

Packet Data

...

The Global Header structure is defined in the include file /usr/include/pcap/pcap.h which is extracted as follows.

The Packet Header structure is also defined in the include file /usr/include/pcap/pcap.h which is extracted as follows. Note that in the pcap.h definition, the packet header contains a struct timeval instead of two separate 32-bit integers. Do not use that version because the captured packet file only contains a total of 8 bytes (64 bits) for the time field, two 32-bit integers. If struct timeval is used, which occupies a total of 16 bytes for two values, the boundary of the header becomes incorrect.

The type definition of bpf_u_int32 in both header files is just an unsigned 32-bit integer.

The packet data, for our purposes in this lab, will be an Ethernet frame, which is structured as in the figure below.

Ethernet frame format

The preamble of the frame contains a fixed bit pattern that is used to mark the beginning of the frame during transmission. It is important to note that this preamble is used by Layer 2 to identify frame boundaries. The preamble is not part of the data captured for each frame by Wireshark.

The Ethernet destination and the source addresses are each a 48-bit long value. (Note that they are not IP addresses.) These values are stored in binary in the frame, obviously, but in order to make them more human-friendly, they are normally written as the hexadecimal representation of 6 bytes separated by colons as in:

00:0A:B7:4C:C1:33

One needs to determine the type of the protocol that generated the body of the Ethernet frame in order to understand this data; which is the 16-bit type field in the header. The values in this field were standardized. A commonly used collection can be found at <http://en.wikipedia.org/wiki/EtherType>, and a complete list can be found at <http://www.cavebear.com/archive/cavebear/Ethernet/type.html>.

The body of the frame contains the data in which we are interested, that is, the packet pushed down from protocols higher up in the protocol stack.

The CRC is used only for error detection by the Layer 2 protocol. This field isn't terribly interesting to one who captures frames for posterior analysis because after all, who would be interested in studying a corrupted frame? The CRC is not part of the data captured for each frame.

The data structures you will need to use in this assignment have already been defined in an include file in the Linux system. Look at /usr/include/pcap/pcap.h to find the definition of the following types among others:

pcap_file_header // a struct for the global file header
pcap_pkthdr // a struct for individual packet headers

If you read this header file carefully, you will notice that the library libpcap defines a number of functions, which can be useful to you in this or in future packet analysis assignments. Use your best resources to find documentation on these functions to understand what they do. The first resource to try is the manual pages; try doing man pcap. Though the libpcap contains a number of useful functions, we don't have to use the library. Instead, we can define our own functions, as long as we follow the structure of the captured traffic file, which is defined in pcap.h.

Other header files that may define data types that will be useful to you are listed below. You should browse them before you attempt to redefine a data type that is already defined in these standard header files.

Parsing a packet trace

You are given the skeleton of a C program to analyze the captured data, that is, to parse a packet trace. The file given is called etherTrace.c and needs to be copied from ~cs363/Spring16/labs/lab09/ along with other files. Note that there is almost nothing implemented in this file; it contains only type definitions and guidelines on how to flesh it out into a full program. Read it carefully before you proceed. After reading this file, please look it over another file called etherTrace-skeleton.c which implements a few steps beyond what's given in etherTrace.c. The purpose of giving you the second file is to show you some ideas how to proceed. You can rename ehterTrace-skeleton.c to be your etherTrace.c, or start revising the etherTrace.c based on what you see in etherTrace-skeleton.c. You will be asked to submit the completed etherTrace.c.

You will expand this skeleton program into a complete C program that prints the frame length as received (caplen in the packet trace header), the arrival time of each frame, the source and destination Ethernet address, and the protocol type, all in human readable form.

Reporting a frames arrival time

When we report the captured time of a frame, use the tv_sec field in the pcaprec_hdr_s structure along with the library call ctime(). If you don't remember how it works, check them out using the Linux manual pages. The printed time should be in a string format, as generated by the function ctime() as follows,

Wed Dec 21 16:12:39 2005

Reporting Ethernet addresses

Ethernet, or MAC, addresses are 48 bits (6 bytes) long. You could come up with several different ways to transform the 48-bits address in binary into a printable string of hexadecimal numbers, but there is a standard way to accomplish such a feat. Read the manual page for the library function ether_ntoa which stands for "ether number to ASCII" to discover what exactly you need to convert 48 bits in network byte order to a string in the standard hex-digits-and-colons notation (which omits leading zeros, if there are any). Note that you should copy the string returned by this function into your own buffer; since what is returned is a pointer to a statically allocated buffer, subsequent calls to the same function will overwrite it. Make sure you include proper header file to use the function ether_ntoa().

Alternatively, you could also use the following function to convert the Ethernet address into a readable character string. This is really how functions such as ether_ntoa() work.

What should be the proper value for the constant BUFFSIZE?

Reading binary files

The key to read any binary file is to be able to figure out the length of the appropriate chunk to read, read the data into a byte buffer (an array of unsigned char), and then cast what is read into the appropriate format. For example, if one knows the structure of the file is an Ethernet frame that has a header of the type struct cap_frame_header_t, followed by the data in a Ethernet packet, then one can read the frame and align the frame with Ethernet header using the following statements.

Access to information within a trace file

The Wireshark captured network traffic data file consists of a collection of <frame-header, frame-data> following a fixed size file header as defined in struct pcap_hdr_s. Each of the frame header has the format defined in struct pcaprec_hdr_s, while the frame data contains the actual Ethernet frame header and a portion of the payload (e.g., an IP header followed by a TCP header). The definitions of both header files can be seen in /usr/include/pcap/pcap.h, or in the program skeleton you copied (or about to copy) in etherTrace.c. To read the trace file, one would have to first read the file header, followed by a loop in which each cycle reads one frame header and one section of the frame data, as outlined in the program skeleton etherTrace.c. The amount of data to read as the file header is defined by the sizeof(struct pcap_hdr_s). For each frame, the frame header size is defined by sizeof(struct pcaprec_hdr_s), the amount of data in a frame can vary. The Ethernet header is defined as in /usr/include/net/ethernet.h. Once the data is aligned with an Ethernet header structure, one can access the information within the Ethernet header such as ether_type, ether_dhost, and ether_shost.

Your work

Complete the program etherTrace.c and run it on trace ~cs363/Spring16/student/labs/lab09/trace-dec21-2005. Your program should read the name of the trace file to work with as a command line parameter, then process the entire file including all frames included in the captured data file.

You will expand this skeleton program into a complete C program that prints the following pieces of information from the trace header.

For each Ethernet frame in the trace, you are asked to report the following:

If the Ethernet frame type is not IP, ignore it and continue to analyze the next frame. If the frame contains an IP packet (read /usr/include/net/ethernet.h to fined out the value of the type field for IP), print the following:

Remember the IP packet header is defined as follows in the file /usr/include/netinet/ip.h.

If the actual header file is a bit hard to read, here is a graphical representation of it.

IP Packet Header

Here is a sample output from a trace:

Once the program works properly, you can try any data sets in the directory of ~cs363/traces/. In particular, we made a copy of http1.dat in your lab09 directory that you can try. This trace is one of the traces provided by the textbook authors. Can you figure out the symbolic names of some of the hosts in the trace, for example, 63.240.76.19, or 128.119.245.12?

You should also run the program using the data you collected from your VM using Wireshark. Save and submit a copy of this output. Name the output file "mypacket-trace-output.txt."

After finishing all the work, submit all program files and your trace output file to the gitlab.

Congratulations, you have just finished the lab!