Converting HTML E-mail To Plain Text
★★★★★
★★★★
★★★
★★
★
The Battle Of Evermore…
OK, I admit it. I’ve caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.
After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text – Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn’t need, I came up with the following code which forms the basis of this plug-in.
Private Function ConvertHTMLToText(ByVal Source As String) As String
Dim result As String = Source
' Remove formatting that will prevent regex from running reliably
' \r - Matches a carriage return \u000D.
' \n - Matches a line feed \u000A.
' \f - Matches a form feed \u000C.
' For more details see http://msdn.microsoft.com/en-us/library/4edbef7e.aspx
result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
' replace the most commonly used special characters:
result = Replace(result, "<", "<", RegexOptions.IgnoreCase)
result = Replace(result, ">", ">", RegexOptions.IgnoreCase)
result = Replace(result, " ", " ", RegexOptions.IgnoreCase)
result = Replace(result, """, """", RegexOptions.IgnoreCase)
result = Replace(result, "&", "&", RegexOptions.IgnoreCase)
' Remove ASCII character code sequences such as &#nn; and &#nnn;
result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
' Remove all other special characters. More can be added - see the following for more details:
' http://www.degraeve.com/reference/specialcharacters.php
' http://www.web-source.net/symbols.htm
result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from the <head> tag
result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
' Remove all whitespace from the </head> tag
result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
' Delete everything between the <head> and </head> tags
result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from all <script> tags
result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
' Remove all whitespace from all </script> tags
result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
' Delete everything between all <script> and </script> tags
result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from all <style> tags
result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
' Remove all whitespace from all </style> tags
result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
' Delete everything between all <style> and </style> tags
result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
' Insert tabs in place of <td> tags
result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
' Insert single line breaks in place of <br> and <li> tags
result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
' Insert double line breaks in place of <p>, <div> and <tr> tags
result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
' Remove all reminaing html tags
result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
' Replace repeating spaces with a single space
result = Replace(result, " +", " ")
' Remove any trailing spaces and tabs from the end of each line
result = Replace(result, "[ \t]+\r\n", vbCrLf)
' Remove any leading whitespace characters
result = Replace(result, "^[\s]+", String.Empty)
' Remove any trailing whitespace characters
result = Replace(result, "[\s]+$", String.Empty)
' Remove extra line breaks if there are more than two in a row
result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
' Thats it.
Return result
End Function
All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the “DeliverPromote” event, whereas any incoming e-mail handled by the e-mail router triggers the “DeliverIncoming” event. Interestingly enough, the “Create” event was also called as a child pipeline for these events, but modifying the message here didn’t have any effect, even in the pre-processing stage.
Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:
- is running on the ‘DeliverPromote’ or ‘DeliverIncoming’ messages
- is running synchronously
- is running against the ‘Email’ entity
- is running in the ‘pre-processing’ stage of the pipeline
- is running in a ‘Parent’ pipeline
Public Class ConvertHtmlToText
Implements IPlugin
Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
' Exit if any of the following conditions are true:
' 1. plug-in is not running synchronously
' 2. plug-in is not running against the 'Email' entity
' 3. plug-in is not running in the 'pre-processing' stage of the pipeline
' 4. plug-in is not running in a 'Parent' pipeline
If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
Exit Sub
End If
If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
For Each item In context.InputParameters.Properties
If (item.Name = "Body") Then
context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
End If
Next
End If
End Sub
End Class
As always, I have include the source code to my project here. Please do bear in mind that I haven’t included any error handling or logging, so it’s not production-ready. However, it should provide you with a good head-start.
This posting is provided “AS IS” with no warranties, and confers no rights.
A Plain Text Editor
ReplyDeletePlain Text files
That's right, if you're writer on a budget, you don't need to spend any money buying expensive writing software or apps. Instead, you can use the text editor that comes free with your operating system.
Just open up Notepad on Windows or TextEdit on a Mac. I like plain text editors for writing something short quickly and easily, without thinking much about it. I wrote a blog post about the benefits of using plain text editors as writing software.
Use for: writing whatever, wherever