Hidden costs of using Java String methods
What can be easier than to replace a fragment of a string with another fragment in Java? Right, this sounds extremely simple. But, as it frequently happens, it turns out to be not that simple. Especially if you imagine that you need to perform the same replacement hundreds of thousands times in row. Like with any operation that is repeated many times in row, there are two important aspects of the quality:
- The performance of the operation. Can it be done 10% faster? Sometimes relatively long fragments of code work much faster than something that looks very short and simple.
- Heap usage. Of course, this kind of the operation must not cause any memory leaks. However, the creation of every temporary object in the heap is not free - it is not only about creating it but also about the eventual garbage collection.
Lets consider the following task: I want to replace a single quote in the string with a double one. Here are 5 different solutions I can offer:
String.replaceAll()
newString = original.replaceAll("\"", "\"\"");
What can be simple? Not so fast. Simple to write does not mean simple to run. Lets look under the hood of the implementation of java.lang.String class (JDK 1.5):
public String replaceAll(String regex, String replacement) { return Pattern.compile(regex).matcher(this).replaceAll(replacement); }
Actually, the javadocs do clearly say that this method does and how. As you can see, it creates two additional objects (Pattern and the Matcher) in order to perform the replacement. Unless you really need to replace a regular expression, this is definitely not our choice. Even if the performance is acceptable, the amount of objects created will impact the performance.
String.replace()
newString = original.replace("\"", "\"\"");
Unlike in case of replaceAll(), the javadocs for String.replace() do not say anything about the regular expressions. But, in fact….
public String replace(CharSequence target, CharSequence replacement) { return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( this).replaceAll(Matcher.quoteReplacement(replacement.toString())); }
Not that much different from the replaceAll(), except that the compilation time of the regular expression (and most likely the execution time of the matcher) will be a bit shorter when in the case of replaceAll().
Using the precompiled Pattern
Since we are talking about the same replacement operation repeated many times, it is much better to compile the pattern once and then just use it. It saves us one temporary object per operation (minus one, strictly speaking :) ). However we still need a Matcher for each sample.
Pattern pattern = Pattern.compile("\"", Pattern.LITERAL); ... Matcher m = pattern.matcher(original); newString = m.replaceAll("\"\"");
Own method
How is about writing the method performing the replacement ourselves? It may look a bit over-complicated, but in fact (see the end of the post) it is the fastest option! The code below is probably not the best but it explains the idea:
int pos = 0; StringBuffer sb = new StringBuffer(original); while((pos = sb.indexOf("\"", pos)) != -1) { sb.insert(pos + 1, '\"'); pos += 2; } newString = sb.toString();
No need to reinvent the wheel - StringUtils.replace() from Apache Commons Lang
Apache Commons Lang library offers a class called StringUtils. This class offers a bunch of static methods for manipulating the strings in various ways. One of them allows you to replace all occurrences of a substring with another one:
newString = org.apache.commons.lang.StringUtils.replace(sample, "\"", "\"\"");
The implementation of this method looks similar to our own one using the StringBuffer.
Even faster?
Yes, we can make it even faster. Assuming that not all the input strings contain the target fragment, we can put a condition that does not perform the replacement if the input string does not contain the target fragment. Potentially, we could combine this call with the custom implementation using the StringBuffer, that would be the ideal implementation (saving both the CPU cycles and the heap).
Conclusion
The table below summarizes the result of a sample test program that processes 100,000 input strings half of which contain the target pattern. Although the absolute numbers do not mean anything, the preferred approach becomes obvious.
Applied to all samples (time, s) | Applied only to matching samples (time, s) | |
---|---|---|
String.replaceAll() | 0.514 | 0.429 |
String.replace() | 0.509 | 0.425 |
Precompiled pattern | 0.417 | 0.402 |
Java implementation | 0.324 | 0.251 |
Commons-Lang StringUtils.replace() | 0.266 | 0.267 |
As you can see, it is always advantageous to test the input string first (except the StringUtils - the implementation of their replace() method actually does it itself and returns the original sample if it does not contain the substring). Java implementation that does not touch the regular expressions works faster. And even if the custom implementation is a bit faster, personally I prefer to use the one from Commons Lang along with other useful objects and helper methods offered by this package.
blog comments powered by Disqus