Friday, February 17, 2012

Language-in-Language and Extensible Literal Types

It is a pretty common scenario these days to create an application using a general purpose programming language (like java or Dart). And then sprinkle in more specialized languages like sql or html as needed.
Martin Fowler uses the terms host language and domain specific language (or DSL). He wrote a whole book on this topic.

DSLs are a fascinating topic, and much has been written on the subject. But what interests me is the question of how we integrate all of these DSLs into our application. That's was this post is about.

DSL as String Literal

Probably the most common way to embed a DSLs into a Java app is with string literals, where the stuff inside the quotes is the DSL. Here are a few examples:
sql
String q = "SELECT id,firstName from person";
html
String h = "<div>Hello</div>");
regex
Pattern.compile("[dD]ave");
jpa
String q = "SELECT p.id,p.firstName from Person p";
css in html
String h = "<div style='color:red'>Hello</div>");
The problem with putting a language inside of a string is that the IDE (or other tools) cannot help you very much, particularly in terms of:
  1. catching your errors
  2. refactoring
  3. auto-completion
  4. syntax highlighting
  5. debugging
Now, if your host language has no types (like JavaScript) or you don't use an IDE then you are probably accustomed to not getting help in these areas.
But my primary host language does have types (Java) and I do use an IDE (Intellij). So I aways try to code in a way that maximizes the IDEs ability to help me.  But for string literal DSLs this is difficult.

DSL as API

Another solution is to replace the string literal DSL with an api. This is what jooq does:

This addresses some of the problems mentioned above with "DSL in String Literal". But it is usually more verbose.

APIs that look like a DSL

In some languages (particularly Ruby) there is great latitude in the way api's can be defined and called. So much so that a Ruby API can very much look like a a DSL. And its not in a string literal. It's actually part of the language. And the distinction between API and DSL starts to become blurry. Martin Fowler calls these internal DSLs. Here is an sql select statement in a Ruby internal DSL:
statement = Select[:id, :firstName, :age].from[:people].where do
    equal :lastName, 'Ford'
    greater_than :age, 21
end
An Internal DSL is really just an API that looks like a DSL.

IDEs and DSL String Literals

Some modern IDEs are smart enough to figure out that a string literal is actually a DSL - and provide some extra help. This is a really cool feature! Below are screen shots of Intellij's awesomeness with string literal DSLs:
sql
jpa
regex
css/html
I can't overstate how useful this functionality is. It is so useful that, in my opinion, this should be a major consideration in any new general purpose programming (GPL) language. That is: how well does the GPL (and it's tooling) deal with language-in-language.

Multi-line String Literals

Many languages allow string literals to span multiple lines. For example, in Dart you can use the triple quote:
String html = '''
  <table>
    <tr><td>First Name</td></tr>
    <tr><td>Last Name</td></tr>
    </tr>
  </table>
'''

String Templates/Interpolation

Most languages that support multi-line string literals also support string interpolation. For example, in the Dart multi-line string example above, you can embed a Dart expression inside the HTML using the ${  } syntax:
Person p = getPersonFromSomewhere();
String html = '''
  <table>
    <tr><td>First Name</td><td>${p.firstName}</td></tr>
    <tr><td>Last Name</td><td>${p.larstName}</td></tr>
    </tr>
  </table>
'''
In this case, what we really have is a String template.

DSL Inversion

Another solution to this problem is to flip it. For example, instead of embedding html in java,  embed java in html. This is what JSP is all about. HTML is the host language and java is the embedded language.

DSL Literals

Java Literal

In JSP, java is not embedded like this:
'''
int x = 4;
int y = 2;
int z = x + y;
'''

Rather, JSP supports a java literal:
<% 
int x = 4;
int y = 2;
int z = x + y;
%>

Reg Ex Literal

JavaScript supports the regular expression literal:
var r = /[dD]ave/;

Extensible Literal Types

Given the prevalence of language-in-language, I think any new language should have some mechanism for supporting this. I propose a modifier to the triple quote syntax that adds some suggestion (to tooling and readers) as to the type of string contained:
Sql q = sql'''
  SELECT id,firstName 
  from person 
  where id = 7
'''
Html h = html'''<div>Hello</div>'''
Regex r = regex'''[dD]ave [fF]ord'''
Jpa q = jpa'''
  SELECT p.id,p.firstName 
  from Person p
  where p.age > 21
'''
Html h = html'''<div style='color:red'>Hello</div>'''

Templating

I want the above mentioned extensible literals. But i also want to use embedded ${} expressions:

Sql q = sql'''
  SELECT id,firstName 
  from person 
  where id=${p.id}'''
Html h = html'''<div>Hello ${msg}</div>'''
Regex r = regex'''[dD]ave ${lastName}'''

Jpa q = jpa'''
  SELECT p.id,p.firstName 
  from Person p
  where p.age > ${p.id}
'''
Html h = html'''<div style='color:${color}'>${msg}</div>'''


This creates some extra complexities. For example, should the string be parsable before the embedded Dart expressions are evaluated? I think so. Otherwise, IDEs and tools would not be able to help much, defeating the whole point. Therefore, there must be some restrictions placed on where in the string, ${} expressions are allowed. And this would be different for each DSL.    

For example, the following would not be a valid template, even though it evaluates to a valid SQL statement:

var s = 'ECT';
var q = sql'''
  SEL${S} id,firstName 
  from person

Bottom Line

A few things are certain:
  • Language-in-language will always be needed, especially on the web. Whether is called templates or DSL or polyglot programming. Its here to stay.
  • We need something more than just a multi-line string with interpolation for these embedded DSL's. We want our tools to help us find errors early, refactor, auto-completion, syntax highlighting, debugging. We want self documenting code.
The two possible solutions i can think of are:
  1. Extensible literals as proposed above
  2. Internal DSL's
I prefer the extensible literal approach, primarily based on how awesome this is in IntelliJ with Java. And that's without any special language support at all. 


6 comments:

Eric Leese said...

Another problem with these languages in strings is that they are a frequent source of security problems -- SQL injection, XSS, etc. So having interpolation in these cases do context appropriate escaping would be beneficial.

But this still isn't quite right for me. When you say regex'''\w+''' do you want the result to be a String or a RegEx? Also, if you interpolate an Element in the middle of some HTML, do you want it to be turned into a string, or do you want it inserted into a DOM tree and still be the same Element that you already have a variable bound to that you might be referring to in handlers?

How about this: why don't we add a special constructor syntax that uses these literals? I'm thinking something along these lines:

SQLQuery q = new SQLQuery'''SELECT id,firstName from person WHERE lastName=${lastName} ORDERBY firstName''';

would be a shorthand way of writing:

q = new SQLQuery.fromStringLiteral( ['''SELECT id,firstName from person WHERE lastname=''', null, '''ORDERBY firstName'''], [lastName]);

So a special constructor can be written that gets the string in parts representing the literal template (the null above represents where an interpolated value goes -- the constructor could use the first argument to compile a template and cache it) and also gets the raw interpolated values separately, so it can parse according to the literal parts and then insert the interpolated parts in a context-appropriate way.

Anyone can add new literals. IDEs could recognize certain constructors and provide syntax highlighting, and they wouldn't have to guess about what strings might end up being used as literals because if they weren't immediately preceded by new Type then they won't be understood as code they will be understood as a value being inserted into the code.

Of course if you really need to build up code through string manipulation, you can always call new SQLQuery.fromStringLiteral(...) directly.

Dave Ford said...

Eric: When you say regex'''\w+''' do you want the result to be a String or a RegEx?

A RegEx object.

Dave Ford said...

Eric said: if you interpolate an Element in the middle of some HTML, do you want it to be turned into a string, or do you want it inserted into a DOM tree

Ideally, the system would determine that the expression evaluated to an object of type Element and somehow make use of that fact. It would depend on where within the template the ${} is used. Only certain places within a template string would allow parameters. And depending on the place, it may be only a specific type of parameter.

As an example, in the following JSP snippet, productList must evaluate to a List* where as msg must evaluate to an String*:

<c:forEach items="${productList}">
...
</c:forEach>

<p>${msg}</p>

* Technically, in JSP, it is bit more involved than List and String, the point is the same.



However, if it only supported strings, that would still be useful.

Eric Leese said...

Okay, with your post clarified it's clear we're thinking along about the same lines. But who gets to define the meaning of html, sql, and all the other string prefixes? Do they all have to be part of the DART language spec? Especially when they're so close to the names of the classes they're compiling to? As many DSLs as there are, just create a special constructor syntax so that anyone can extend the language and define what happens to interpolated values for their DSL.

Lukas Eder said...

Nice overview over domain specific languages. Might be worth mentioning Scala with its internalised XML DSL...

Ruudjah said...

Very nice. Imho, a serious downside to the solution of a combined multiline string denotifier with a language abbreveation is that it makes code more ugly. A possible solutin for this might be to include icons of languages into a new unicode table (like a monochrome favicon), and then allow to use this unicode character interchangeably with the multiline denotifier/lang abbreveatin combo.