Small photo of Mark Meiss

Home Page for Mark Meiss

L597: Structural Data Mining and Modeling
Professor Katy Börner, Fall 2004

Mark Meiss
mmeiss@indiana.edu
Department of Computer Science
Ph.D. student, 4th year

Programming Experience | Software Experience | Data Mining Experience | Interest in Data Mining | Course Expectations | Structural Data Sets | Summary of Data Sets

New: My Class Presentation


Programming Experience

I've been programming since I was a child, and I've held a number of professional programming positions. My area of expertise has been in low-level systems programming, especially network applications, but I've done a fair amount of Web development and GUI design as well. My preferred languages are C, Java, Perl, and Python, though I can get by in a number of others, such as Matlab, Scheme, and various flavors of assembly. I'm a Unix coder at heart and don't have much developer experience with Microsoft Windows (other than porting Java code). I do have some practical experience with OpenGL and relational databases.

Software Experience

Because I am employed as a programmer and spend most of my time developing applications, there aren't a lot of existing applications that I really consider myself fluent with. I spend most of my computing life at a shell prompt. The closest candidates would probably be Matlab, Emacs, Word, Access, and Powerpoint. I've done some minimal animation in Flash (which was quite entertaining). I have very little practical experience with any spreadsheet or photo manipulation applications.

Data Mining Experience

The only data mining course I have taken is Professor Menczer's "Mining the World Wide Web" class, which I took in spring of 2004. In that course, Heather Roinestad and I developed a system called Miner/Event that used Google as a high-latency database for assigning relevant calendar years to arbitary queries.

My first taste of data mining came years earlier, when I was employed as a network applications developer at IU's University Information Technology Services. There I had to develop applications capable of sifting through large amounts of network usage data to determine the identities of abusive applications and users. That was my first exposure to multi-gigabyte data sets.

Interest in Data Mining

My intended thesis research is directly related to data mining. I am trying to analyze the properties of the networks implicitly defined by network traffic data coming from the border routers of Indiana University and the Abilene network. This data contains information about individual conversations on the network, such as the hosts involved, the protocol and port they used, and the duration and volume of the conversation -- but not any of the actual content exchanged. I hope that by examining the structural properties of this flow data, I will be able to distinguish different types of applications from their real-world behavior and detect more varieties of anomalous behavior than are visible to standard network intrusion detection systems.

In short, I'm interesting in structural data mining as a way of investigating data network usage from a behaviorial point of view. Though I'm interested in the actual physical structure of the Internet, I'm even more interested in the structure of the network implicitly defined by what people actually do when they're connected to the Internet. I believe that structural data mining can help me to answer many interesting questions about that operational network.

Course Expectations

I expect to learn a lot from this course. I would like to gain a solid understanding of the common metrics that various fields use to characterize the structure of networks and effective ways of calculating those metrics. I'd also like to learn about the ways in which structural data mining is being applied to fields other than my own so that I can draw analogies to problems I face in my own research. I would like to know approximately where the frontiers of research are at the moment so that I can have an intuition for whether a particular problem has already been solved years ago or it's time to roll up my sleeves and try to invent something new. Finally, I would like to develop an understanding of which tools and techniques are practical for application to very large data sets. The data sets I work with involve networks with hundreds of thousands of nodes and millions of edges, so I'm very sensitive to scalability issues.


Structural Data Sets

Several of the data sets I describe below are representative of a particular type of data that I know has been researched extensively, but are not the exact data that has been used in published results. I tried to pick a variety of problem domains in the hopes that I can pique the interests of more people.

  1. Network Flow Data

    Network flow data is generated by routers in the Internet and contains information about what computer talked to what other computer, with what protocol, and for how long, but does include any of the content of the information exchanged. This sort of data is the subject of a massive amount of ongoing research, especially in the domain of Internet security.

    For this project, I filtered out a fairly small set that contains all the connections made between Indiana University and Internet Relay Chat servers during a one-hour period in the early evening of September 5, 2004. This is a directional graph in which all edges point from a server system to a client system.

    Interesting structural properties:

  2. Internet AS Connectivity

    The Internet is both one large worldwide data network and a collection of many smaller networks. Each of these smaller networks may contain even smaller networks, but Internet routing protocols generally deal with only the top layer of this hierarchy. At this level, the networks are called "autonomous systems", or ASes. ASes can be vary large and are typically associated with a large organization or Internet Service Provider. For example, Indiana University has an AS, as do ISPs like AOL, UUNET, and Sprint.

    The Route Views Project at the University of Oregon regularly publishes data from which we can infer which ASes are connected directly to which other ASes. For example, Indiana University and Purdue University are directly connected, but the same is not true of Indiana University and Microsoft. I used a snapshot of the routing data from early 2004 to generate the data in the summary table.

    Interesting structural properties:

  3. Call Graphs

    The call graphs generated by an application are a source of perpetual research interest because when weighed with information about call frequency and duration, they provide invaluable information on how to optimize a program for speed or simply identify code paths. Virtually all standard profiling tools are able to generate a call graph for an application, but I haven't seen much discussion on structual properties of the graph itself.

    For an example data set, I compiled one of my more complex data analysis tools (a packet dump viewer) with profiling options enabled and extracted the call graph from a brief run. In this particular case, the graph is actually a tree, but that will not be true in general.

    Interesting structural properties:

  4. Message Forums

    Internet message forums are lively places, and forums that perform threading allow us to establish a clear picture of what people replied to what other people. Thus, a lively discussion thread actually gives us an operational view of a social network; if someone responds to someone else, they "know" them at least in the sense of having interacted with them on the Internet.

    Among message forums, it's difficult to find one more lively or contentious than Slashdot. I took a recent discussion from Slashdot and derived a directed graph from the thread structure. There is an edge from node A to node B in the graph if person B has responded to a post from person A. Thus, the edges imply that information has at some point passed from person A to person B, assuming that person B read before posting.

    Interesting structural properties:

  5. Path Topology

    I was interested in finding data sets on the Web that represented topology data for the streets of a town or the highways of a nation -- in short, the sort of data that may serve as a basis for one of the many route-finding programs now available online. Unfortunately, I wasn't able to find anything readily usable.

    On the other hand, it is easy to generate artificial data that simulates a network of roads simply by converting a simple two-dimensional maze into a network. Draw the map on a grid; each space in the grid is a node, and each node may have from zero to four neighbors depending on which sides there are walls. For this data set, I constructed a 10x10 maze and converted it into a network; the entrance is node 1 and the exit is node 100.

    Interesting structural properties:

Summary of Data Sets

(1) Flow Data (2) AS Connectivity (3) Call Graphs (4) Message Forums (5) Path Topology
Data: Here Here Here Here Here
# Nodes: 1,186 17,404 80 153 100
# Edges: 1,497 42,250 305 193 99
Directed? Yes No Yes Yes No
Connected? No Yes Yes No Yes
Other Attributes Edges can be weighed by volume of data Loops indicate use of BGP inside an AS Edges can be weighed by frequency, nodes by duration Edges can be weighed by repetition or content Edges can be weighed by distance or capacity
Diameter: -- -- -- -- 54.0
Average Degree: 1.2622 4.6058 3.8125 1.2614 1.9800
Characteristic Path Length: -- -- -- -- 19.2557
Scale Free Exponent: -- 0.0012 -- -- 0.0066
Mutual Connection Coefficient: 0.0 1.0 0.0 -- 1.0
Incoming Clustering Coefficient: -0.0020 0.6578 0.0000 0.6007 0.1800
Outgoing Clustering Coefficient: -- 0.6578 -- -- 0.1800

Last modified: Mon Sep 27 18:36:49 EST 2004