Why NetFlow Isn't A Web Usage Tracker

2013 Update:
As I mentioned in an update to the original post, HTTP tracking is now available via custom IPFIX exports in a variety of products. Cisco's ISR G2 and ASR 1K routers now have this export available as part of their MACE feature set (Data license required). You'll still need a collector capable of dealing with this export record, however. The content below still applies to NetFlow v5 and traditional NetFlow v9.

Here's a question I find myself answering frequently on the Solarwinds NetFlow forum:

How can I use NetFlow to track the websites being accessed from my network?

The short answer that I usually give on the forum is this: you can't, because NetFlow v5 doesn't track HTTP headers. With this blog post, though, I'll go into the answer in more detail so that I can refer people to it in the future.

First, a quick review of what NetFlow is, and how it works:
  • When NetFlow is enabled on a router interface, the router begins to track information about the traffic that transits the interface. This information is stored in a data structure called the flow cache.
  • Periodically, the contents of the flow cache can be exported to a "collector", which is a process running on an external system that receives and stores flow data. This process is called "NetFlow Data Export", or NDE. Typically the collector is tied into an "analyzer", which massages the flow data into something useful for human network analysts.
    •  NDE is optional. One can gather useful information from NetFlow solely from the command-line without ever using an external collector.
  • Data that can be tracked by NetFlow depends on the version. The most commonly deployed version today is NetFlow version 5, which tracks the following key fields:
    • Source interface
    • Source and destination IP address
    • Layer 4 protocol (e.g., ICMP, TCP, UDP, OSPF, ESP, etc.)
    • Source and destination port number (if the layer 4 protocol is TCP or UDP)
    • Type of service value
  • These "key fields" are used to define a "flow"; that is, a unidirectional conversation between a pair of hosts. Because flows are unidirectional, an important feature in NetFlow analysis software is the ability to pair the two sides of a flow to give a complete picture of the conversation.
  • Other "non-key" fields are also tracked. In NetFlow version 5, the other fields are as follows. Note that not all collector software preserves all the fields.
    • TCP flags (used by the router to determine the beginning and end of a TCP flow)
    • Egress interface
    • Timestamps
    • Packet and byte count for the flow
    • BGP origin AS and peer AS
    • IP next-hop
    • Source and destination netmask
  • NetFlow v9, Cisco Flexible NetFlow, and IPFIX (the IETF flow protocol, which is very similar to NetFlow v9) allow user-defined fields that can track any part of the packet headers. IPFIX offers enough flexibility to track information about HTTP sessions, and many vendors are starting to implement this capability.
  • Many vendors have defined other flow protocols that offer more or fewer capabilities, but virtually all of them duplicate at least the functions of NetFlow v5.
For reference, here's a snapshot from a packet capture of a NetFlow v5 export packet (the destination public IP address has been disguised as a RFC 1918 address):

    pdu 1/30
        SrcAddr: 203.79.123.118 (203.79.123.118)
        DstAddr: 10.118.218.102 (10.118.218.102)
        NextHop: 0.0.0.0 (0.0.0.0)
        InputInt: 1
        OutputInt: 0
        Packets: 3
        Octets: 144
        [Duration: 1.388000000 seconds]
            StartTime: 3422510.740000000 seconds
            EndTime: 3422512.128000000 seconds
        SrcPort: 3546
        DstPort: 445 <-- probably a port scan for open Microsoft services
        padding
        TCP Flags: 0x02
        Protocol: 6  <-- this is the layer 4 protocol; i.e. TCP
        IP ToS: 0x00
        SrcAS: 4768 <-- this particular router is tracking BGP Origin-AS
        DstAS: 0
        SrcMask: 22 (prefix: 203.79.120.0/22)
        DstMask: 30 (prefix: 10.118.218.100/30)
        padding
    Returning to our original question:

    NetFlow v5 isn't a good web usage tracker because nowhere in the list of fields above do we see "HTTP header".  The HTTP header is the part of the application layer payload that actually specifies the website and URL that's being requested. Here's a sample from another packet capture:

    GET / HTTP/1.1
    User-Agent: curl/7.21.6 (i686-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
    Host: www.ubuntu.com
    Accept: */*

     This is the request sent by the HTTP client (in this case the "curl" command-line HTTP utility) when accessing http://www.ubuntu.com. The header "GET / HTTP/1.1" command requests the root ("/") of the website referenced by the "Host:" field; i.e. www.ubuntu.com.

    The IP address used in this request was 91.189.89.88. However if we do a reverse lookup on this address, the record returned is different:

    $ dig -x 91.189.89.88 +short
    privet.canonical.com.

    A little search-engine-fu shows that several other websites are hosted at the same IP address:

    kubuntu.org
    canonical.com

    If we do the same trick with other websites (like unroutable.blogspot.com, hosted by Google), we can easily find cases in which there are dozens of websites hosted at the same IP address.

    Because NetFlow doesn't extract the HTTP header from TCP flows, we have only the IP address to go on. As we've seen here, many different websites can be hosted at the same IP address; there's no way to tell just from NetFlow whether a user visited www.canonical.com or www.ubuntu.com. Furthermore, with the most popular sites hosted on content distribution caches or cloud service providers, the reverse DNS lookups for high-bandwidth port 80 flows frequently resolve to names in networks like Akamai, Limelight, Google, Amazon Web Services, Rackspace, etc., even if those content distribution networks have nothing to do with the content of the actual website that was visited.

    The bottom line is this: if you want to track what websites are visited by users on a network, NetFlow v5 isn't the best tool, or even a good one. A web proxy (e.g., Squid) or a web content filter (e.g., Websense, Cisco WSA, etc.) is a probably the best tool, since they track not only HTTP host headers but also (usually) the Active Directory username associated with the request.

    Other tools that could do the job are security related tools like httpry or Bro-IDS, both of which have features for HTTP request tracking. These tools are both available in the excellent Security Onion Linux distribution.

    [Edited to add] The anonymous commenter below observes that nProbe exports HTTP header information via IPFIX, and notes that some vendors have firewalls that do so as well. nProbe is an excellent free tool that takes a raw packet stream and converts it to NetFlow or IPFIX export format.

    Published: April 05 2012

    • category: