The Oft Forgotten Middle Trim
Two Most Popular Ways to Trim
It has become ubiquitous to trim whitespace from data. Data should almost never have whitespace at the front or at the end. This fact is nearly ubiquitous throughout the industry.
- Front Trim (also called left trim) = Remove leading whitespace, whitespace (space, tab, new line, carriage return) at the front of text.
- Back Trim (also called left trim) = Remove trailing whitespace, whitespace (space, tab, new line, carriage return) from the back of data. Trailing whitespace.
What does this mean? Look at the following data example:
" White space at front" <-- space " White space at front" <-- tab " White space at front" <-- new line or carriage return "White space at back " <-- space "White space at back " <-- tab "White space at back " <-- new line or carriage return
When extra white space is added to the front or back of data, it should almost always be trimmed.
The Third Way to Trim – Middle Trim
There is a third type of trimming that should be done for many fields. It is not as popular and many developers forget about it. (Marked in green below.)
- Front Trim (also called left trim) = Remove whitespace (space, tab, new line, carriage return) from the front of data.
- Back Trim (also called right trim) = Remove whitespace (space, tab, new line, carriage return) from the back of data.
- Middle Trim (also called center trim) = Remove extra whitespace (space, tab, new line, carriage return) from between words of data.
Note: Extra whitespace could mean different things depending on the field. In this post, it means more than one space. However, if we were dealing with names of objects in code that should not have any middle spaces at all, then even one middle space could be considered an extra space.
Perhaps “Middle Trim” is not something you have heard of before. Front and back trim involves only removing characters if they exist. Middle Trim involves either removing or replacing characters if they exist. Because of this, some might argue that Middle Trim is an incorrect phrase. From a certain point of view, I would agree. However, to properly link the task to front trim and back trim, the phrase Middle Trim makes a lot of sense.
"Extra white space in middle" <-- space "Extra white space in middle" <-- tab "Extra white space in middle" <-- new line or carriage return
This one actually takes some thought. Because it doesn’t apply to every field as often as front trim and back trim do. However, for many fields, middle trim is just as valid.
- Address Lines (When there is one field per line)
- City
- Country
- Name (Pretty much any type of name)
- Account
- Business
- Contact
- Company
- Course
- Customer
- First
- Last
- Middle
- Part
- Partner
- Product
- School
- Spouse
- Street
- User
- Order Identifiers
- State
- etc…
Names should not have extra whitespace at the front, end, or middle. State or Country names should never have extra whitespace at the front, middle, or end. Many types of input should be cleaned of extra whitespace in the front, middle, or end.
"Awesome Company LLC" <-- space "Washtington D.C." <-- tab "United States of America" <-- new line or carriage return
All of the above are wrong. I could quote First Normal Form to you, but really common sense should be enough. These spaces make the data wrong.
Now, each field may be different. You may not want middle trim if your field is a blob of text, that has paragraphs. In that case, you certainly want to leave carriage returns.
Implementing Middle Trim in C#
Middle trim isn’t exactly easy to implement. Some languages have features, such as Regex, which make it easy. Others do not.
Why isn’t Middle Trim extremely common and more easily implemented? Perhaps middle trim is forgotten because there isn’t a clear method for it like there is with String.Trim() and so it is often left out?
Many languages, like C#, make front and back trimming easy. In C#, you can simply call String.Trim() and it will trim whitespace from the front and back. However, it doesn’t clean up extra whitespace in the middle.
Doing all three trims in C# is most easily done with Regex and an extension method.
Note: Get the Rhyous.StringLibrary from NuGet or check out the Rhyous.StringLibrary project on GitHub.
public static class StringExtensions { public static string TrimAll(this string value) { var newstring = value; newstring = myString.Trim(); // This removes extra whitespace from the front and the back. newstring = Regex.Replace(LastName, @"\s+", " "); // Replaces all whitespace with a single space } }
If you want to avoid regex, you could roll your own like this:
public static class StringExtensions { public static string TrimAll(this string value) { var trimmedValue = new StringBuilder(); char previousChar = (char)0; foreach (char c in value) { if (char.IsWhiteSpace(c)) { previousChar = c; continue; } if (char.IsWhiteSpace(previousChar) && trimmedValue.Length > 0) { trimmedValue.Append(' '); } trimmedValue.Append(c); previousChar = c; } return trimmedValue.ToString(); } }
You would use either method the same way.
var newstring = " This string has extra whitespace in the front, middle and the end. " newstring = nestring.TrimAll();
Implementing Middle Trim in MSSQL
MSSQL also has LTRIM (left trim) and RTRIM (right trim), but middle trim doesn’t exist. Middle Trim is even harder to write in MSSQL because there is no Regex. So you have to replace whitespaces characters with spaces, then remove multiple spaces.
Here is what it looks like to add a name to a person and to do all three trims: front, back, middle. Wow! It is ugly.
INSERT INTO PERSON (NAME) VALUES ( REPLACE( REPLACE ( REPLACE( REPLACE( REPLACE( REPLACE( LTRIM(RTRIM(@str)) , char(9), ' ' ), char(10), ' ' ), char(13), ' ' ),' ',' '+CHAR(7) ), CHAR(7)+' ','' ), CHAR(7),'' ) )
This does right trim, left trim. Then it replaces tabs, new line, and carriage returns with spaces. Then it uses the bell character (because bell is basically never used) to replace any double spaces, char(32)+Char(32), with space bell, char(32)+char(7). Then it replaces any instance of char(7)+char(32) with ”, an empty string. Then that might leave a few space bell sequences, so we only need one more replace of bell, char(7), with ”, an empty string.
How to know which type of trimming you need?
This is very simple. Just ask questions:
- Front trim – Will extra whitespace at the front ever be valid?
- Back trim – Will extra whitespace at the back ever be valid?
- Middle trim – Will extra whitespace in the middle ever be valid? Are middle spaces allowed? If so, should they always be a single space?
If the answer to any of those questions is “no,” then you need to do that type of trim. However, it is clear that Middle Trim has more questions as it is more complex.