Thursday, April 28, 2016

Converting HTML E-mail To Plain Text in MSCRM


Converting HTML E-mail To Plain Text



The Battle Of Evermore…
OK, I admit it. I’ve caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.
After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text – Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn’t need, I came up with the following code which forms the basis of this plug-in.
Private Function ConvertHTMLToText(ByVal Source As String) As String
 
    Dim result As String = Source
 
    ' Remove formatting that will prevent regex from running reliably
    ' \r - Matches a carriage return \u000D.
    ' \n - Matches a line feed \u000A.
    ' \f - Matches a form feed \u000C.
    ' For more details see http://msdn.microsoft.com/en-us/library/4edbef7e.aspx
    result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
 
    ' replace the most commonly used special characters:
    result = Replace(result, "&lt;", "<", RegexOptions.IgnoreCase)
    result = Replace(result, "&gt;", ">", RegexOptions.IgnoreCase)
    result = Replace(result, "&nbsp;", " ", RegexOptions.IgnoreCase)
    result = Replace(result, "&quot;", """", RegexOptions.IgnoreCase)
    result = Replace(result, "&amp;", "&", RegexOptions.IgnoreCase)
 
    ' Remove ASCII character code sequences such as &#nn; and &#nnn;
    result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all other special characters. More can be added - see the following for more details:
    ' http://www.degraeve.com/reference/specialcharacters.php
    ' http://www.web-source.net/symbols.htm
    result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from the <head> tag
    result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from the </head> tag
    result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
    ' Delete everything between the <head> and </head> tags
    result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <script> tags
    result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </script> tags
    result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
    ' Delete everything between all <script> and </script> tags
    result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <style> tags
    result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </style> tags
    result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
    ' Delete everything between all <style> and </style> tags
    result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Insert tabs in place of <td> tags
    result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
 
    ' Insert single line breaks in place of <br> and <li> tags
    result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
 
    ' Insert double line breaks in place of <p>, <div> and <tr> tags
    result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
 
    ' Remove all reminaing html tags
    result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Replace repeating spaces with a single space
    result = Replace(result, " +", " ")
 
    ' Remove any trailing spaces and tabs from the end of each line
    result = Replace(result, "[ \t]+\r\n", vbCrLf)
 
    ' Remove any leading whitespace characters
    result = Replace(result, "^[\s]+", String.Empty)
 
    ' Remove any trailing whitespace characters
    result = Replace(result, "[\s]+$", String.Empty)
 
    ' Remove extra line breaks if there are more than two in a row
    result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
 
    ' Thats it.
    Return result
 
End Function
All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the “DeliverPromote” event, whereas any incoming e-mail handled by the e-mail router triggers the “DeliverIncoming” event. Interestingly enough, the “Create” event was also called as a child pipeline for these events, but modifying the message here didn’t have any effect, even in the pre-processing stage.
Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:
  1. is running on the ‘DeliverPromote’ or ‘DeliverIncoming’ messages
  2. is running synchronously
  3. is running against the ‘Email’ entity
  4. is running in the ‘pre-processing’ stage of the pipeline
  5. is running in a ‘Parent’ pipeline
Public Class ConvertHtmlToText
    Implements IPlugin
 
    Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
 
        ' Exit if any of the following conditions are true:
        '  1. plug-in is not running synchronously
        '  2. plug-in is not running against the 'Email' entity
        '  3. plug-in is not running in the 'pre-processing' stage of the pipeline
        '  4. plug-in is not running in a 'Parent' pipeline
        If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
            Exit Sub
        End If
 
        If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
 
            For Each item In context.InputParameters.Properties
 
                If (item.Name = "Body") Then
                    context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
                End If
 
            Next
 
        End If
 
    End Sub
 
End Class
As always, I have include the source code to my project here. Please do bear in mind that I haven’t included any error handling or logging, so it’s not production-ready. However, it should provide you with a good head-start.
This posting is provided “AS IS” with no warranties, and confers no rights.

1 comment:

  1. A Plain Text Editor
    Plain Text files
    That's right, if you're writer on a budget, you don't need to spend any money buying expensive writing software or apps. Instead, you can use the text editor that comes free with your operating system.
    Just open up Notepad on Windows or TextEdit on a Mac. I like plain text editors for writing something short quickly and easily, without thinking much about it. I wrote a blog post about the benefits of using plain text editors as writing software.
    Use for: writing whatever, wherever

    ReplyDelete