Unwanted HTML elements in your Sitecore richtext fields

Custom save events to the rescue!

I have been planning this blog for years maybe but kept postponing, but finally it is here after similar requests from our great Sitecore community.

The problem I have solved in my projects in the past is the following:
When editing rich text fields, the default behavior of the Sitecore Telerik control is to insert a bit too many HTML elements around your content. For example <p> tags you don't want there, <p> tags or divs around image that you embed in the rich text field etc. 
At some point I was so annoyed by this and the bad HTML it produced, I decided to customize the item save event to clean up the resulting HTML. Below you will find my code, I hope you find it useful. Be aware this code has only been tested on Sitecore 7.x. While I expect it to also work for 8+, you better test for yourself whether nothing breaks.
Also, extend to your own need, to create your own neat and clean HTML.

/// Don't forget following config file patch for Sitecore:
///  <configuration>
///   <sitecore>  
///     <events>
///      <event name="item:saved">
///        <handler type="YourNameSpace.Sitecore.Events.RichTextEditorSaveEvent, YourNameSpace.Sitecore" method="OnItemSaving">
///          <database>master</database>
///        </handler>
///      </event>
///    </events>
///   </sitecore>
///  </configuration>

using Sitecore.Data.Items;
using Sitecore.Events;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Sitecore.Collections;
using Sitecore.SecurityModel;
using Sitecore.Data.Fields;
using System.Collections;
using System.Xml;
using Sitecore.Data;
using HtmlAgilityPack;

namespace YourNameSpace.Sitecore.Events
    /// <summary>
    /// This on save event fires for all items, but only for rich text fields it executes logic. In this case it cleans up the rich text field's html.
    /// </summary>
    public class RichTextEditorSaveEvent
        private static readonly SynchronizedCollection<ID> MProcess = new SynchronizedCollection<ID>();

        public string Database

        public void OnItemSaving(object sender, EventArgs args)

            var item = Event.ExtractParameter(args, 0) as Item;
            if (item == null)
            if ((item.Database != null && String.Compare(item.Database.Name, this.Database) != 0) || MProcess.Contains(item.ID))

                foreach (Field field in item.Fields)
                    if (!field.TypeKey.Equals("rich text", StringComparison.InvariantCultureIgnoreCase))

                    var content = field.Value;

                    if (!string.IsNullOrEmpty(content))
                        content = content.Trim();

                            HtmlDocument htmlDocument = new HtmlDocument();
                            content = htmlDocument.DocumentNode.InnerHtml;

                        catch (Exception)
                            //Apparently no html or not valid, in this case wrap in <p>
                            if (!content.Contains("<p>"))
                                content = "<p>" + content + "</p>";

                        using (new SecurityDisabler())
                            field.Value = content;
            catch (Exception)

        /// <summary>
        /// Moves span attributes to parent <p></p> tag moves span contents to parent as well, then removes span. This to get cleaner HTML.
        /// </summary>
        /// <param name="content"></param>
        private void MoveSpanClassAndStyleToPTagIfIsDirectAndOnlyChild(HtmlDocument content)
            HtmlNodeCollection spanNodes = content.DocumentNode.SelectNodes("//span");
            if (spanNodes != null && spanNodes.Count > 0)
                foreach (HtmlNode spanTag in spanNodes)
                    if (spanTag.ParentNode != null && spanTag.ParentNode.Name == "p" && spanTag.ParentNode.ChildNodes.Count == 1)
                        string spanContents = spanTag.InnerHtml;
                        foreach (HtmlAttribute attr in spanTag.Attributes)
                            if (spanTag.ParentNode.Attributes != null && spanTag.ParentNode.Attributes[attr.Name] != null)
                                spanTag.ParentNode.Attributes[attr.Name].Value += !string.IsNullOrEmpty(spanTag.ParentNode.Attributes[attr.Name].Value) ? " " + attr.Value : attr.Value;
                                 spanTag.ParentNode.Attributes.Add(attr.Name, attr.Value);

                        HtmlNode parent = spanTag.ParentNode;
                        parent.InnerHtml += spanContents;

        /// <summary>
        /// Images can get empty <p></p> tags around them when content editor inputs a linebreak after inserting the image. This is undesirable. This method removes such tags and sets image on parent
        /// </summary>
        /// <param name="content"></param>
        private void RemovePTagAroundImages(HtmlDocument content)
            HtmlNodeCollection imgNodes = content.DocumentNode.SelectNodes("//img");
            if (imgNodes != null && imgNodes.Count > 0)
                foreach (HtmlNode imgTag in imgNodes)
                    if (imgTag.ParentNode.Name == "p")
                        string xmlToPreserve = imgTag.ParentNode.InnerHtml;
                        HtmlNode grantParent = imgTag.ParentNode.ParentNode;
                        grantParent.InnerHtml += xmlToPreserve;

        /// <summary>
        /// Cleans html with empty <p></p> tags. Content editor should use <br/> line breaks instead, or spacing should be realized with CSS.
        /// </summary>
        /// <param name="content"></param>
        private void RemoveEmptyPTags(HtmlDocument content)
            HtmlNodeCollection pNodes = content.DocumentNode.SelectNodes("//p");
            if (pNodes != null && pNodes.Count > 0)
                foreach (HtmlNode pTag in pNodes)
                    if (string.IsNullOrWhiteSpace(pTag.InnerHtml) || pTag.InnerHtml == "&nbsp;" || pTag.InnerHtml == "\n" || pTag.InnerHtml == "\n\n")

Have extensions you want to share to get better HTML? Let me know!


Populaire posts van deze blog

I Robot - Sitecore JSS visitor identification

Get off my lawn - A journey from Sitecore MVC to the JSS world

Sitecore JSS - Sitecore first