2.1. lars - Introduction

A typical lars script opens some log source, typically a file, and uses the source and target wrappers provided by lars to convert the log entries into some other format (potentially filtering and/or modifying the entries along the way). A trivial script to convert IIS W3C style log entries into a CSV file is shown below:

import io
from lars import iis, csv

with io.open('webserver.log', 'r') as infile, io.open('output.csv', 'wb') as outfile:
    with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
        for row in source:
            target.write(row)

Going through this section by section we can see the following:

  1. The first couple of lines import the necessary modules that we’ll need; the standard Python io module for opening files, and the iis and csv modules from lars for converting the data.
  2. Using io.open we open the input file (with mode 'r' for reading) and the output file (with mode 'wb' for creating a new file and writing (binary mode) to it)
  3. We wrap infile (the input file) with IISSource to parse the input file, and outfile (the output file) with CSVTarget to format the output file.
  4. Finally, we use a simple loop to iterate over the rows in the source file, and the write() method to write them to the target.

This is the basic structure of most lars scripts. Most extra lines for filtering and manipulating rows appear within the loop at the end of the file, although sometimes extra module configuration lines are required at the top.

2.1.1. Filtering rows

The row object declared in the loop has attributes named after the columns of the source (with characters that cannot appear in Python identifiers replaced with underscores). To see the structure of a row you can simply print one and then terminate the loop:

import io
from lars import iis, csv

with io.open('webserver.log', 'r') as infile, io.open('output.csv', 'wb') as outfile:
    with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
        for row in source:
            print(row)
            break

Given the following input file (long lines indented for readability):

#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2002-05-24 20:18:01
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
    cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
    cs(Referrer)
2002-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm -
    200 7930 248 31
    Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server)
    http://64.224.24.114/

This will produce this output on the command line:

Row(date=Date(2002, 5, 24), time=Time(20, 18, 1), 
    c_ip=IPv4Address(u'172.224.24.114'), cs_username=None,
    s_ip=IPv4Address(u'206.73.118.24'), s_port=80, cs_method=u'GET',
    cs_uri_stem=Url(scheme='', netloc='', path=u'/Default.htm', params='',
    query_str='', fragment=''), cs_uri_query=None, sc_status=200,
    sc_bytes=7930, cs_bytes=248, time_taken=31.0,
    cs_User_Agent=u'Mozilla/4.0 (compatible; MSIE 5.01; Windows 2000
    Server)', cs_Referrer=Url(scheme=u'http', netloc=u'64.224.24.114',
     path=u'/', params='', query_str='', fragment=''))

From this one can see that field names like c-ip have been converted into c_ip (- is an illegal character in Python identifiers). Furthermore it is apparent that instead of simple strings being extracted, the data has been converted into a variety of appropriate datatypes (Date for the date field, Url for the cs-uri-stem field, and so on). This significantly aids in filtering rows based upon sub-attributes of the extracted data.

For example, to filter on the year of the date:

if row.date.year == 2002:
    target.write(row)

Alternatively, you could filter on whether or not the client IP belongs in a particular network:

if row.c_ip in datatypes.network('172.0.0.0/8'):
    target.write(row)

Or use Python’s string methods to filter on any string:

if row.cs_User_Agent.startswith('Mozilla/'):
    target.write(row)

Or any combination of the above:

if row.date.year == 2002 and 'MSIE' in row.cs_User_Agent:
    target.write(row)

2.1.2. Manipulating row content

If you wish to modify the output structure,the simplest method is to declare the row structure you want at the top of the file (using the row() function) and then construct rows with the new structure in the loop (using the result of the function):

import io
from lars import datatypes, iis, csv

NewRow = datatypes.row('date', 'time', 'client', 'url')

with io.open('webserver.log', 'r') as infile, io.open('output.csv', 'wb') as outfile:
    with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
        for row in source:
            new_row = NewRow(row.date, row.time, row.c_ip, row.cs_uri_stem)
            target.write(new_row)

There is no need to convert column data back to strings for output; all datatypes produced by lars source adapters have built-in string conversions which all target adapters know to use.

Project Versions

Table Of Contents

Previous topic

2. API Reference

Next topic

2.2. lars.apache - Reading Apache Logs

This Page