I have just uploaded version 0.4.0 of TxtSushi with the following changes:
- General code improvements: fixed any -Wall warnings, improved help messages and SQL parser error messages
- External sort: added an -external-sort option to tssql which tells tssql to sort on disk for JOIN and ORDER BY. This was tested on a 10G CSV file and it worked but it took 10 hours on my macbook! I think that this can be improved.
- Nested select: TxtSushi now allows nested select statements
- Classic joins: added support for older style join syntax. These joins are still unoptimized though so their time complexity is row_count^2 rather than the row_count(log row_count) complexity when using the INNER JOIN syntax
- Transpose utilities: added transposecsv and transposetab utilities which will print a transposed version of the given table
I think TxtSushi fills a unique niche in the *NIX toolchain for tabular data. On one side you have python, perl, sed, awk which are all great for general purpose text processing but it takes a lot of hand rolled code to do table filtering/joining/transformation and it gets even worse if you want to be able to deal correctly with quoted fields. On the other side you have real databases which allow you to do all of these things but which tend to be heavyweight solutions that require data import/export and don't want anything to do with your other *NIX commands. TxtSushi is different because it takes your text tables (or event STDIN) as they are and doesn't want to be anything more than an SQL link in your *NIX tool chain.
The two long term goals that I have for TxtSushi are to improve it to the point where it is generally available on *NIX distro package managers and that it is "owned" and developed by a group instead of just me. These two goals really go together since I may not be able to do all of the work required to mature TxtSushi by myself. I thought about keeping these goals to myself until I could get to the point where the core architecture are solidified and the remaining work would amount to developing and integrating modules into the existing architecture, but I think things are just moving too slowly with me hacking alone. So, if you are interested in hacking TxtSushi with me let me know!
Off the top of my head here are some relatively self-contained contributions that can be made:
- Extensible SQL Functions: This one should be fun. The way I implemented SQL functions isn't very pretty. The function parsers are in SQLParser.hs toward the bottom and the SQL Function execution code is in a big function guard toward the bottom of SQLExecution. I think we should have an SQLFunction type which would bundle up both the parsing and execution for a single SQL function/operator. The real payoff is that we could then have an XMonad-like configuration file that allows users to define their own SQL functions (Eg: if a user wants to define a STD_DEV aggregate function, now they can as long as they know haskell).
- Improve table parsing: the table parsing code was the first thing I wrote and it shows. It should be replaced by something more like what you find in Real World Haskell: Using Parsec which is much more concise and tolerant of any end-of-line encoding
- External sort: The external sort algorithm works but could probably use a more expert Haskell hacker's eye to become efficient (there were a couple of reasons I couldn't just use the external-sort cabal library as-is).
- Efficient classic joins: The classic style WHERE joins should be just as efficient as the INNER JOIN joins. If you're interested in this I have an idea for what I think may be the easiest way to do this. See issue 7
- Support for SQL Case Statement: See issue 8
- Implement window functions like RANK() and ROW_NUMBER(): See issue 11
- Test cases? Bug reports? Some other idea you have? ...
If you're curious, the simplest way to browse the source is here http://patch-tag.com/r/keithshep/txt-sushi.
6 comments: