Archive for the ‘ tech ’ Category

HBase vs Cassandra: why we moved (via Bits and Bytes | Dominic Williams)

Passing along an interesting post from Bits and Bytes; Dominic’s take is (in part) that the two take different approaches to Big Data: Cassandra is more amenable to online, interactive data operations while Hadoop is geared more towards data warehousing, offline index building and analytics.

My team is currently working on a brand new product – the forthcoming MMO This has given us the luxury of building against a NOSQL database, which means we can put the horrors of MySQL sharding and expensive scalability behind us. Recently a few people have been asking why we seem to have changed our preference from HBase to Cassandra. I can confirm the change is true and that we have in fact almost completed porting our c … Read More

via Bits and Bytes | Dominic Williams


What Would the Holy Grail of ORM Look Like?

Recent experiences and articles I’ve read have got me thinking about ORM again, and trying to conceive what the perfect one would look like (when it’s not custom matched to a specific set of patterns that I control).

The Microsoft Data Access Block was one of the first frameworks I used to make boilerplate data operations easier. Incidentally, it also led me down the evil path of exposing data access methods as static methods. I evaluated both the Entity Framework and LINQ to SQL for a large green-field project and neither were up to snuff at the time. I’ve recently migrated to Java development on Linux and gotten my fingers into Hibernate — enough to conclude that I hate it with a vengeance. Come to think of it, I’ve never seen an ORM framework that I’ve thought fully did the job, so I ended up going with the roll-your-own approach on that last project. That’s fine — maybe even superior — when you fully control the access/retrieval/update/delete patterns. But what criteria would make a new tool stand out for general adoption?

First, the short shopping list: Stored procedure support is a biggie for me, as are batch saves, client-side filtering/sorting, awareness of new/clean/dirty/delete objects (optimize wire traffic by not sending clean objects, and let me process multiple insert/update/delete operations in the same batch), intelligent awareness and automatic management of datetime properties like created/modified, and the ability to do soft deletes (set a ‘deleted’ or persistence status property, and omit those from standard fetches).

A few additional things I look for, some a little unorthodox:

Put nullability checks in get methods that return nullable collection types. I’m sick of seeing null reference exceptions when people try to render a child list that’s not populated — they’re ugly and they disrupt debugging sessions.

Let me generated extended enums (Java) or enum-type classes (.NET, bad idea to inherit from enums there) from stored data (e.g. in tables somehow flagged as being an application enum). Look at the classic Java “Planets” enum example for a use-case. This helps keep typo-prone string-based lookups out of the codebase.

Don’t push me into the entity:table paradigm. Maybe some entities are more easily used by exposing a few foreign properties on them (like names/labels that correspond to foreign keys). That facilitates much terser code and reduced IO. It’s not that hard to handle this, either; make those properties read-only and omit them from saves. Voila!

Give me smart “GetBy” parameter inference. Good candidates are primary keys, foreign keys,indexes/unique keys (including compound ones), and primary keys of other entities that have a foreign key to this. Bonus points for letting me browse the ancestor hierarchy and create GetBy methods for, e.g. grandchildren by grandparent, without having to fetch the intermediate (parent) first if I’m not going to show it. Similarly, give me delete by id and delete by instance methods.

Add “stale instance” checks to prevent overwriting more recent changes by others. (Huge bonus points if you can actually fetch the newer remote changes and merge them with the local ones when no conflicts exist.)

Provide an easily-swapped out data provider interface – don’t tie me to any specific backing store. This is a tall order, since it requires multi-way type mapping, plus decoupling and isolation of all provider-specific options and settings, and a backing-store agnostic controller layer on top of the data layer. Controllers deal with business intentions, but often must translate those into provider-specific language. This means controllers must pluggably or dynamically support data providers, without built-in knowledge of all of the types or options they use (probably via the mediator or Adapter patterns.)

Do not introduce any dependencies into POCOs/POJOs – for example, Hibernate forces its own annotations/attributes into the persistable classes, which makes them unusable in, e.g., GWT client code. Now I need to duplicate entity code in DTOs, and to create converter classes, for no other reason than to have a dependency-free clone of my entities. It’s wasteful, it promotes code bloat, and it introduces opportunity for error.

Similarly, facilitate serialization-contract injection – I’m sick of being unable to use the same entity for e.g. XML, binary, JSON and protobuf just because I need to serialize it in different ways (e.g. deep vs shallow, or using/skipping setters that contain logic). Why do my serialization preferences need to be written in stone into my entities? (Nobody does this well yet, IMO, and it’s not easy either.)

Those last two are biggies: Putting control statements into annotations/attributes is an egregious violation of SOC. Serialization, data access and RPC frameworks all want you to embed their control flags into your entity layer. Enough already! My entity layer is just that… a collection of dumb objects. Give me an imperative way to tell your framework what to do with my objects, or go home.

All code generation should be done at design time (as opposed to during build or at runtime) – for Pete’s sake, stop slowing down my builds and adding more JIT operations to my running app. (Do I need to mention that dynamically generated SQL is evil? And have you seen what ugly dynamic SQL Hibernate spits out?) Also, give me code where I can see the fetch/save/ID-generation/default-value-on-instantiation semantics without looking through 8 different files to trace it. The longer I code and the bigger the projects & teams I work on, the more I favor imperative approaches over declarative or aspect-based ones; whether I want the 3rd generation descendants to be fetched — whether lazily or eagerly — is a function of where I am in the app and what I’m doing, not of the entities themselves.

Don’t force a verbose new configuration syntax on me; use enumerations and flags that are in visible, static code, and write them with inline documentation so that explanations are visible in javadoc popups and Visual Studio mouseover tips. Pass those enum/flag values to DAO constructors and methods to control, for example, whether to re-fetch after save,what descendants to save or fetch along with the parent, etc.

Am I being too demanding? Am I missing some biggies? Programmers, let me know your thoughts!

Discarding or Rolling Back Changes in Git

When moving to Git for version control, I was amazed at how much trouble people have trying to revert a file or project to a previous state, and even more so at the variety of solutions I saw. People try (and recommend) everything from surgical to nuclear approaches to this — e.g. git checkout …, git rebase …, git revert… git stash or branch & then discard…, or even delete your entire working directory and re-clone the repository! Yet with many of these, people would still end up with unwanted changes left in their working copy! One problem is that certain commands are only appropriate for changes that have been committed to your current index, while others are for those that have not.

When I have a version I want to roll back to, I don’t like having to sort through what’s committed and what’s uncommitted; I just want to get back to that version. I’m all about finding something that works reliably and repeatedly in a way that I understand. git checkout <i>start_point</i> <i>path</i> is the “something” that seems easiest to me for reverting specific files back to specific previous states, and so far this approach has never left me with undesired changes remaining in my working copy.

Here’s the skinny…

First, get a simple list of the last few commits (7 in this example) to the file in question:

~/projects/myproj$ git log -7 --oneline src/main/java/settings/datasources.xml

Output (newest to oldest):

74106b9 Renamed PROD database
db05364 Changed root password
0d56c8b Renamed QA database
efc7eb0 Changed some hibernate mappings
97e68fe Added comments
a2c492f Fixed xml indentation
c1b0310 Wrecked xml indentation

Let’s say those last two commits were erroneous. Then using the syntax “git checkout <start_point> <path>” you would just do:

~/projects/myproj$ git checkout 0d56c8b src/main/java/settings/datasource-context.xml

All done!

Have other tricks for making “rollbacks” easier? Let me know in the comments!

Happy coding.

Commandline Fu: Find and Replace strings in files without opening them using ‘sed’

One thing you discover when moving from Windows to Linux is just how much you can accomplish from the console [aka commandline / terminal] in Linux. There are withdrawal pains at first, of course. Things seem arduous and difficult, you have to look up the the syntax of different commands over and over, and  you want your GUI back. Little by little though, it strikes you just how much time you’re saving.

Consider this scenario: You’re working on an app to automate data migration for a MySQL database, for example to update a QA database with data from the production instance. You need to extract the data as ‘INSERT’ statements, probably using another command-line tool, mysqldump. Some of that same data will already exist in the QA copy though, causing conflicts when you try load the PROD extract. Fortunately the MySQL developers thought of this sort of thing and provided a ‘REPLACE INTO’ command; it works just like ‘INSERT INTO’ except that it updates any data that already exists in the destination instead of trying to insert it again. However, mysqldump writes out ‘INSERT’ statements, not ‘REPLACE’ statements.

Enter the ‘sed‘ command in bash. sed is a stream editor for filtering and transforming text. Using sed in conjunction with mysqldump and bash’s powerful piping and redirection capabilities, you can do all 3 of these things in one fell swoop:

  1. Use mysqldump to extract your data in a format that’s easily loaded into another database;
  2. Find every occurrence of the phrase ‘INSERT INTO’ in the extract and replace it with ‘REPLACE INTO’ using sed;
  3. Redirect the modified output from sed into a file (it would normally go to the screen, which is probably less than useful).

Using these commands you can do this all with one line of text at the command prompt (ignore wrapping and type on a single line):

$ mysqldump --raw=true --skip-opt --column-names=false -n -t -e -c --hex-blob
  | sed -e 's/INSERT INTO/REPLACE INTO/g' > data_extract.sql;

Pretty cool, huh?

Here’s what’s going on. First, mysqldump extracts the data (I’ll explain all the switches further down). Next, bash’s pipe operator ( “|” ) tells the command interpreter to send the output of the preceding command to another program before displaying it on the console. We sent it to sed, and gave sed an expression telling it to replace every ‘INSERT INTO’ occurrence with ‘REPLACE INTO’. Lastly, bash’s redirect operator ( “>” ) sends the output of everything leading up to it into a file named data_extract.sql instead of showing it on the screen. Voilà! You have a file you can import conflict-free into your QA database.

Using ‘-e’ with sed means an expression will immediately follow. The pattern for find-and-replace expressions is ‘s/pattern/replacement/[flags]’. We used ‘/g’ for flags, which means replace all occurrences of pattern with replacement. (See here for a more in-depth tutorial on sed.)

Lastly, here’s a bit of explanation on what all those arguments to mysqldump were all about. mysqldump can extract a database’s structure, data, or both. You control the specifics with arguments, some examples being:

# -c = complete insert (insert using explicit column names)
# -d = nodata
# -e = extended inserts (multiple rows per INSERT statement,
       instead of one by one INSERTs)
# -n = --no-create-db - don't create db in destination
       (i.e. use existing)
# -t = --no-create-info = skip create table statements
# -p = ask for password, -psecret = --password=secret
# --skipt-opt: see below, gets rid of MyISAM only "diable keys"
#		(ALWAYS put BEFORE -c and -e!!!)
# --skip-triggers
# -q = quick stream, don't buffer entire dataset (good for large tables)
# -uroot = switch to root user
# --hex-blob = convert binary to 0xHEX notation
	same format as:
        select CONCAT('0x', HEX(UNHEX(REPLACE(UUID(), '-', ''))));
#  --single-transaction is a much better option than locking for InnoDB,
	because it does not need to lock the tables at all.
	To dump big tables, you should combine this option with --quick.
# --opt, --skip-opt (PUT BEFORE -c and -e)
	This option is shorthand; it is the same as specifying
	--add-drop-table --add-locks --create-options --disable-keys
	--extended-insert --lock-tables --quick --set-charset. It should
	give you a fast dump operation and produce a dump file that can be
	reloaded into a MySQL server quickly.  As of MySQL 4.1, --opt is on
	by default, but can be disabled with --skip-opt. To disable only
	certain of the options enabled by --opt, use their --skip forms; for
	example, --skip-add-drop-table or --skip-quick.

Happy terminals!

Finally, a Linux alternative for Jing and!

I finally found a reasonably complete Linux replacement for Jing — at least for still caps. A few not-too-difficult setup steps and you get easy hotkey based rectangular screenshots with two-click short-URL uploads.

Try it… for more info, but here’s the Shutter Quickstart for Kubuntu (will vary for Gnome users):

$ sudo add-apt-repository ppa:shutter/ppa
$ sudo apt-get update && sudo apt-get install shutter
$ sudo apt-get install shutter

Run the Shutter app, dink with preferences as you see fit, then…

Gnome: Shutter preferences can set your keybindings
KDE: K menu -> System Settings -> Shortcuts and Gestures -> Custom Shortcuts

  • Right click “Preset Actions” -> New -> Global Shortcut -> Command URL
  • Trigger = your key combo preference (I used Ctrl+Shift+J since it’s the same as Jing)
  • Action = shutter -s (for Selection based capture, i.e. rectangular region, RTFM if you want a different default)


  • Create a Ubuntu One account at
  • Install the Ubuntu One client (available in KPackage Manager)
  • Run the ” ” and enter your account details
  • Find the tab with the “Connect” button, click it, tell it to share/sync files at least

now you’re ready…

  • Take a Shutter screen cap using your previously configured hotkey
  • Right click the image in the Shutter window that follows, select Export
  • Select the Ubuntu One tab
  • “Choose folder dropdown” -> “other…” ->
  • navigate to ~/Ubuntu One/
  • Create ‘img’ or ‘pic’ or ‘My Beautiful Digital Pictures’ or whatever you want to call your shared pics directory
  • Save in that folder

The upload will happen automatically*, and when complete a short URL will be on your clipboard (you’ll get a toaster message).

*as it will with any content placed underneath ‘~/Ubuntu One’

There are other sharing options available, but the configuration for them in Shutter is still rough around the edges (to be polite).

%d bloggers like this: