2. Introduction¶
A typical lars script opens some log source, typically a file, and uses the source and target wrappers provided by lars to convert the log entries into some other format (potentially filtering and/or modifying the entries along the way). A trivial script to convert IIS W3C style log entries into a CSV file is shown below:
import io
from lars import iis, csv
with io.open('webserver.log', 'r') as infile, \
io.open('output.csv', 'wb') as outfile:
with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
for row in source:
target.write(row)
Going through this section by section we can see the following:
- The first couple of lines import the necessary modules that we’ll need; the
standard Python
io
module for opening files, and theiis
andcsv
modules from lars for converting the data. - Using
io.open()
we open the input file (with mode'r'
for reading) and the output file (with mode'wb'
for creating a new file and writing (binary mode) to it) - We wrap
infile
(the input file) withIISSource
to parse the input file, andoutfile
(the output file) withCSVTarget
to format the output file. - Finally, we use a simple loop to iterate over the rows in the source file,
and the
write()
method to write them to the target.
This is the basic structure of most lars scripts. Most extra lines for filtering and manipulating rows appear within the loop at the end of the file, although sometimes extra module configuration lines are required at the top.
2.1. Filtering rows¶
The row object declared in the loop has attributes named after the columns of the source (with characters that cannot appear in Python identifiers replaced with underscores). To see the structure of a row you can simply print one and then terminate the loop:
import io
from lars import iis, csv
with io.open('webserver.log', 'r') as infile, \
io.open('output.csv', 'wb') as outfile:
with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
for row in source:
print(row)
break
Given the following input file (long lines indented for readability):
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2002-05-24 20:18:01
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
cs(Referrer)
2002-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm -
200 7930 248 31
Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server)
http://64.224.24.114/
This will produce this output on the command line:
Row(date=Date(2002, 5, 24), time=Time(20, 18, 1),
c_ip=IPv4Address(u'172.224.24.114'), cs_username=None,
s_ip=IPv4Address(u'206.73.118.24'), s_port=80, cs_method=u'GET',
cs_uri_stem=Url(scheme='', netloc='', path=u'/Default.htm', params='',
query_str='', fragment=''), cs_uri_query=None, sc_status=200,
sc_bytes=7930, cs_bytes=248, time_taken=31.0,
cs_User_Agent=u'Mozilla/4.0 (compatible; MSIE 5.01; Windows 2000
Server)', cs_Referrer=Url(scheme=u'http', netloc=u'64.224.24.114',
path=u'/', params='', query_str='', fragment=''))
From this one can see that field names like c-ip
have been converted into
c_ip
(-
is an illegal character in Python identifiers). Furthermore it
is apparent that instead of simple strings being extracted, the data has been
converted into a variety of appropriate datatypes
(Date
for the date
field,
Url
for the cs-uri-stem
field, and so on). This
significantly aids in filtering rows based upon sub-attributes of the extracted
data.
For example, to filter on the year of the date:
if row.date.year == 2002:
target.write(row)
Alternatively, you could filter on whether or not the client IP belongs in a particular network:
if row.c_ip in datatypes.network('172.0.0.0/8'):
target.write(row)
Or use Python’s string methods to filter on any string:
if row.cs_User_Agent.startswith('Mozilla/'):
target.write(row)
Or any combination of the above:
if row.date.year == 2002 and 'MSIE' in row.cs_User_Agent:
target.write(row)
2.2. Manipulating row content¶
If you wish to modify the output structure,the simplest method is to declare
the row structure you want at the top of the file (using the
row()
function) and then construct rows with the new
structure in the loop (using the result of the function):
import io
from lars import datatypes, iis, csv
NewRow = datatypes.row('date', 'time', 'client', 'url')
with io.open('webserver.log', 'r') as infile, \
io.open('output.csv', 'wb') as outfile:
with iis.IISSource(infile) as source, csv.CSVTarget(outfile) as target:
for row in source:
new_row = NewRow(row.date, row.time, row.c_ip, row.cs_uri_stem)
target.write(new_row)
There is no need to convert column data back to strings for output; all datatypes produced by lars source adapters have built-in string conversions which all target adapters know to use.