Regex magic for JAXB mappings

I am working on a project with 412 classes generated from a bunch of XSDs. This works well, but the names annoy me a bit. I don’t like most of the classes having names ending in Type, for example.

binding.xjb file solves the problem, but I didn’t want to write 412 <bindings> tags. Regular expressions to the rescue!

The problem

Here are (some of) the generated classes and their corresponding element names:

Java name XSD element type name Desired Java name
AbodeDataType abodeDataType AbodeData
RequestBase requestBase RequestBase
AppInfo appInfo AppInfo
AppType appType App
CapacityValueUnitEnumeration capacityValueUnitEnumeration CapacityValueUnit
ChainageType chainageType Chainage
ChainageUnitEnumeration chainageUnitEnumeration ChainageUnit
ClientHeaderLines clientHeaderLines ClientHeaderLines

As you can see, several of the generated classes end in Type or Enumeration (marked in red), which is what I want to remove.

The general binding.xjb forumla for doing this is:

(Yes, of course I’m using SCD and not XPath here!)

So, I need a list of all the generated classes, to put as the class name bit, and then I need to “reverse-generate” the XSD element type names to put as the scd attribute. Very hacky, and very fun!

Getting a list of all the generated classes was easy enough: I just did ls and removed the .java bit 🙂

Now I have a text file containing (using the names from the table):

Solution

Using your favourite editor, a regular expression search and replace is the way to go here.

The regex needs to do three things:

  1. Change the first character to lower case.
  2. Check if the string ends with Type or Enumeration.
  3. Output the full name and the name without those suffixes.

Changing the first character is easy, you just need to put it in its own capture group: ^(\w)
Then, in the output, you put \l\1. \l transforms the following group to all lower case, while \L would transform all the following groups to lower case. So we want \l here.

Matching

Now for the fun bit!

To capture either a word followed by Type or Enumeration, or any other word, we do this:
(\w+(?=(Type|Enumeration))|\w+)

What?! Why not just use (\w+)(Type|Enumeration)?, or something similar?

Well, the problem with that is that the \w+ bit would greedily gobble up the Type suffix as well, and we need it separate so we can handle it properly.

No, doing (\w+?)(Type|Enumeration) (a non-greedy match) does not work either. Well, it sort of works, since it catches the names ending with the suffix properly, but it does not match the other strings, the ones not ending with the suffix. Of course, we could handle those separately, but where is the fun in that™?

So, we need to catch (and split into groups) both all strings with a suffix and all strings without a suffix. At the same time.

This is a perfect match (pun intended) for a lookahead group!

The result of the lookahead (the (?=…) stuff) is not put in the capture group, just the first bit (e.g. AppType will have only App in the capture group). However, since the expression inside the lookahead has () around it, that in itself becomes a capture group.

This is important, because we need to get rid of that bit too. This is done like so:
(\3)?

This means “take any occurrence of whatever is in the third group – if anything – and put it in its own group”, or in plain language: “eat the suffix you just found” 🙂

But we don’t want to create yet another capture group, we just want to get rid of it, so instead we do this:
(?:\3)?

Same thing, but no group created. It just matches the text. This is necessary because we are doing a search and replace, and if it weren’t matched it would be left as-is in the text file!

Now, we have this expression:
^(\w)(\w+(?=(Type|Enumeration))|\w+)(?:\3)?

This expression can be illustrated like this (image made with Debuggex):
Expression

Results

The capture groups we get from this look like follows:

Example Group 1 Group 2 Group 3
AbodeDataType A bodeData Type
RequestBase R equestBase
AppInfo A ppInfo
AppType A pp Type
CapacityValueUnitEnumeration C apacityValueUnit Enumeration
ChainageType C hainage Type
ChainageUnitEnumeration C hainageUnit Enumeration
ClientHeaderLines C lientHeaderLines

Perfect!

Replacement string

We need a replacement string too, and it looks like this:

Or, as a one-liner:
<bindings scd="/~\l\1\2\3">\n\t<class name="\1\2"/>\n</bindings>

Final results

Et voilà, the end results:


There are probably alternate ways to do this, but this is the way I did it and it’s easy enough 🙂

Yes, I admit: I used this method to auto-generate the tables in this blog post too, but this was a bit too meta to include above 🙂 Here is the replacement string I used:
<tr><td><code>\1\2<span style="color:red">\3</span></code></td><td><code>\l\1\2\3</code></td><td><code>\1\2</code></td></tr>

What are computers for, after all?!

Leave a Reply

Your email address will not be published. Required fields are marked *

Please answer this amazingly complicated math question to prove that you are not a spam bot: