This article needs additional citations for
verification. (April 2011) |
Developer | |
---|---|
First appeared | 2003 |
License | Apache License 2.0 |
Website |
code |
Sawzall is a procedural domain-specific programming language, used by Google to process large numbers of individual log records. Sawzall was first described in 2003, [1] and the szl runtime was open-sourced in August 2010. [2] However, since the MapReduce table aggregators have not been released, [3] the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in Go) for most purposes within Google. [4]
Google's server logs are stored as large collections of records ( Protocol Buffers) that are partitioned over many disks within GFS. In order to perform calculations involving the logs, engineers can write MapReduce programs in C++ or Java. MapReduce programs need to be compiled and may be more verbose than necessary, so writing a program to analyze the logs can be time-consuming. To make it easier to write quick scripts, Rob Pike et al. developed the Sawzall language. A Sawzall script runs within the Map phase of a MapReduce and "emits" values to tables. Then the Reduce phase (which the script writer does not have to be concerned about) aggregates the tables from multiple runs into a single set of tables.
Currently, only the language runtime (which runs a Sawzall script once over a single input) has been open-sourced; the supporting program built on MapReduce has not been released. [3]
Some interesting features include:
collection
saves every value emittedsum
saves the sum of every emitted valuemaximum(n)
saves only the highest n values on a given weight.sample(n)
gives a random sample of n values from all the emitted valuesquantile(n)
calculates a cumulative probability distribution of the given numbers.top(n)
gives n values that are probably the most frequent of the emitted values.unique(n)
estimates the number of unique values emitted.Sawzall's design favors efficiency and engine simplicity over power:
This complete Sawzall program will read the input and produce three results: the number of records, the sum of the values, and the sum of the squares of the values.
count: table sum of int; total: table sum of float; sum_of_squares: table sum of float; x: float = input; emit count <- 1; emit total <- x; emit sum_of_squares <- x * x;