I have just uploaded version 0.4.0 of TxtSushi with the following changes:
- General code improvements: fixed any -Wall warnings, improved help messages and SQL parser error messages
- External sort: added an -external-sort option to tssql which tells tssql to sort on disk for JOIN and ORDER BY. This was tested on a 10G CSV file and it worked but it took 10 hours on my macbook! I think that this can be improved.
- Nested select: TxtSushi now allows nested select statements
- Classic joins: added support for older style join syntax. These joins are still unoptimized though so their time complexity is row_count^2 rather than the row_count(log row_count) complexity when using the INNER JOIN syntax
- Transpose utilities: added transposecsv and transposetab utilities which will print a transposed version of the given table
I think TxtSushi fills a unique niche in the *NIX toolchain for tabular data. On one side you have python, perl, sed, awk which are all great for general purpose text processing but it takes a lot of hand rolled code to do table filtering/joining/transformation and it gets even worse if you want to be able to deal correctly with quoted fields. On the other side you have real databases which allow you to do all of these things but which tend to be heavyweight solutions that require data import/export and don't want anything to do with your other *NIX commands. TxtSushi is different because it takes your text tables (or event STDIN) as they are and doesn't want to be anything more than an SQL link in your *NIX tool chain.
The two long term goals that I have for TxtSushi are to improve it to the point where it is generally available on *NIX distro package managers and that it is "owned" and developed by a group instead of just me. These two goals really go together since I may not be able to do all of the work required to mature TxtSushi by myself. I thought about keeping these goals to myself until I could get to the point where the core architecture are solidified and the remaining work would amount to developing and integrating modules into the existing architecture, but I think things are just moving too slowly with me hacking alone. So, if you are interested in hacking TxtSushi with me let me know!
Off the top of my head here are some relatively self-contained contributions that can be made:
- Extensible SQL Functions: This one should be fun. The way I implemented SQL functions isn't very pretty. The function parsers are in SQLParser.hs toward the bottom and the SQL Function execution code is in a big function guard toward the bottom of SQLExecution. I think we should have an SQLFunction type which would bundle up both the parsing and execution for a single SQL function/operator. The real payoff is that we could then have an XMonad-like configuration file that allows users to define their own SQL functions (Eg: if a user wants to define a STD_DEV aggregate function, now they can as long as they know haskell).
- Improve table parsing: the table parsing code was the first thing I wrote and it shows. It should be replaced by something more like what you find in Real World Haskell: Using Parsec which is much more concise and tolerant of any end-of-line encoding
- External sort: The external sort algorithm works but could probably use a more expert Haskell hacker's eye to become efficient (there were a couple of reasons I couldn't just use the external-sort cabal library as-is).
- Efficient classic joins: The classic style WHERE joins should be just as efficient as the INNER JOIN joins. If you're interested in this I have an idea for what I think may be the easiest way to do this. See issue 7
- Support for SQL Case Statement: See issue 8
- Implement window functions like RANK() and ROW_NUMBER(): See issue 11
- Test cases? Bug reports? Some other idea you have? ...
If you're curious, the simplest way to browse the source is here http://patch-tag.com/r/keithshep/txt-sushi.
Would an in-memory SQLite DB fulfil your SQL (and efficiency) needs? http://hackage.haskell.org/package/sqlite
ReplyDeleteI've been using sqlite3 on the command line on a binary SQLite DB over my usual Unix plumbing on plain text DBs lately, and my god it's such a joy to work with. (Insomuch as one can be joyful when dealing with SQL.)
As for the question of efficiency: frankly, when one is dealing with a sufficiently large dataset where efficiency becomes an issue, using a tabular textual format for intermediate data isn't perhaps the best of ideas.
I do support your notion of extensible SQL functions though.
A pragmatic implementation of the relational calculus, anyone? (Rather than the bastardisation that is SQL.)
Where by ‘the relational calculus’, I—of course—am in fact referring to the relational algebra.
ReplyDeleteHi LiYang
ReplyDelete> Would an in-memory SQLite DB fulfil your SQL (and efficiency) needs? http://hackage.haskell.org/package/sqlite
I don't think so. Mostly because my goal is to be able to work with any valid rectangular CSV input stream with no other restrictions on the input. I have to work with pretty wacky (and huge) CSV files on a regular basis.
Here's an example that I think would be very painful in a real DB: http://patch-tag.com/r/keithshep/txt-sushi/snapshot/current/content/pretty/examples/query-imputed-SNPs.bash
> I've been using sqlite3 on the command line on a binary SQLite DB over my usual Unix plumbing on plain text DBs lately, and my god it's such a joy to work with. (Insomuch as one can be joyful when dealing with SQL.)
SQLite sounds cool! I use hsqldb for my java stuff and it sounds pretty similar. Real DBs are a great way to go when you're in control of the data format and you're keeping the data around for a while, but I love being able to use command-line directly on flat files in many cases.
> As for the question of efficiency: frankly, when one is dealing with a sufficiently large dataset where efficiency becomes an issue, using a tabular textual format for intermediate data isn't perhaps the best of ideas.
Except that the time isn't dominated by the CSV parsing step (I'm very willing to accept the parsing performance hit)... it's the sorting step which is using a binary on disk representation for the sort which is killing me. If the implementation of external-sort was good enough it could be as efficient as any other DB I think. I hope no one is holding their breath for that though :-)
Mathematics of Linear Algebra
ReplyDeleteKeith, TxtSushi has been a big help to me in a recent project! Keep up the good work. If I knew haskel I would help. Don't you want to rewrite it in your favorite language, Lisp?
ReplyDeleteAwesome :-) Thanks for the encouragement G. Lisp... isn't that the language they use to torture poor AI students at UMaine who don't know what they're getting themselves into ;-)
ReplyDelete