While there are tools like Baobab that help analyze disk usage by file size, it’s not as simple to analyze disk usage by file type. Here’s how to do it using GNU tools ‘du’, ‘gawk’, and OpenOffice.
Create File List
This command will generate a list of files, sizes, and types for the current directory and all sub directories:
du -a –time | gawk ‘BEGIN {FS=”\t”}{where = match($0, /\.([^\.\/]+)$/)
if (where !=0)
print $0 “\t” tolower(substr($0, where+1))
else
print $0
‘ > filelist.csv
Alternatively, create a gawk script (eg. disk.awk):
BEGIN {FS=”\t”}{where = match($0, /\.([^\.\/]+)$/)
if (where !=0)
print $0 “\t” tolower(substr($0, where+1))
else
print $0
Then execute it as:
du -a –time | gawk -f disk.awk > filelist.csv
The resulting file is a tab separated list (size in kb, date, path, extension) that can be imported into OpenOffice. The script feeds the output of the disk usage utility (du) into an awk (gawk) script, which reads the file extension and outputs an additional case insensitive column for the file type. The regular expression reads:
Search each line looking for a period, followed by 1 or more characters that aren’t periods, followed by the line end.
Since files can contain more than one period, the [^\.]+ ensures that only the last period is considered as the start of the file’s extension. Awk then strips the period from the extension, converts it to lower case, and prints it out in a new tab-delimited column. The script isn’t perfect and will no doubt generate some file-type anomalies.
Manipulate the Data
Import the data into OpenOffice. The output file has a CSV (comma separated values) extension so it will show up in the OpenOffice filelist, even though the file is TSV (tab separated). Create headers for each column (Size, Date, Path, Type). The Path column can be deleted to speed things up if you don’t want to filter data by filename. Create a pivot table to display the number of files and the total size for each file type. Go to Data -> Data Pilot -> Create -> Current Selection.  Then:
- drag ‘Type’ into the Column Fields and ‘Size’ in the Data Fields.
- drag a second ‘Type’ in the Data Fields, then click Options and change the Function to Count
- expand the More button
- change the Results To to -new sheet-
- tick Ignore Empty Rows (important so that directory names are ignored)
- click OK
The resulting pivot table summarizes each file type (based on extension) by size and count. This data can be filtered or graphed depending on the analysis.
Graph the Data
A pie-chart similar to Baobab can be produced this way from the previous OpenOffice file:
- Insert -> Chart. Select a type of chart (3D Pie)
- Data range – highlight the pivot table
- Select “Data series in rows”
- Select “First row as label”
- Data range name – set to the row containing the names of the file extensions / types
- Y-Values – set to the row containing the sum size
Once the chart is drawn, resize it so both the chart and the legend are visible. Hovering over any pie slice will show a tooltip that indicates the total size and file type.

Gah! I cannot get it to run.
I’ve tried it in a file with a #!/bin/sh and I’ve tried pasting it straight into bash.
Are you using any particular shell? Is that a single backtick before gawk?
Thanks anyway.
You’re right – the command as shown had a missing single quote and a forward slash in the regular expression was not escaped. The command has been updated and tested. It won’t work if you cut-n-paste, but entering each line as shown will produce the desired result. I recommend using the alternative method: download the disk.awk script and run it that way.