Announcement

Collapse
No announcement yet.

Weeding out Word formatting codes from product descriptions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Weeding out Word formatting codes from product descriptions

    Is there a way to weed out Word formatting codes from product descriptions?

    We've have many products descriptions that were copied from Word and pasted in to product description wysywig editor field which resulted in inadvertently copying all of Word's formatting codes.

    This causes all sorts of problems, bloated field, makes it impossible to export a file wherein the description field is properly parsed (fragments of data intermittently divided among multiple columns and rows).

    It took a while trying to figure out a way of identifying which products had been contaminated with Word formatting codes since export files were useless, but we eventually did.

    The problem now is, how do we cleanup those particular product descriptions with Word formatting codes without loosing certain HTML formatting code such as <br> <p>, etc?

    Also, is there a way to create a hidden category so I can temporarily ad these products to this category as well, making it easier to export a group versus the entire catalog?
    Thank you, Bill Davis

    #2
    Hey William,

    You can create a category and set it's active flag to unchecked (false), then assign whatever products you want to that and export them from the Export Products to Flat File export. You might have to change the "Filters" dropdown in the "Category Lookup" dialog from that export screen to show "All" categories, rather than just active.

    Once you've exported them, you can open that file and search/replace whatever broken characters are there with the correct versions (I'd open it in some sort of editor like Sublime Text that will work on plain text, rather than showing you formatted data). Then you can import that file back in to update the products.

    I would test it first by adding only a single product to that category, fixing, re-importing, and making sure the flow works as you expect. Post back here if you have anymore questions or if it doesn't work out right.
    Ryan Guisewite
    Lead UI Developer / Miva, Inc.
    www.miva.com

    Comment


      #3
      The only way I know of converting word docs into useful content for a Miva RTF editor is to copy the text into a text-file editor (like Note Pad), then copy that and paste it into the Miva RTF box. This works for my sing Notepad++ or TextPad as the intermediate editor. I preserves the paragraphs while removing everything else.
      Bruce Golub
      Phosphor Media - "Your Success is our Business"

      Improve Your Customer Service | Get MORE Customers | Edit CSS/Javascript/HTML Easily | Make Your Site Faster | Get Indexed by Google | Free Modules | Follow Us on Facebook
      phosphormedia.com

      Comment


        #4
        You might find some of this information helpful: Rich Text Editor Nightmare

        I still had to do quite a bit of manual work, but Bruce's suggestions helped me get at least a little closer to a workable import file.

        Comment


          #5
          Well, we first learned their was an issue when trying to import data to Excel, but had no idea what was causing it for a very long time, no one did, not even Miva.

          How did we identify which products where contaminated with the Word formatting codes? We identified a common Word formatting code contained in the description field <xml>.

          Since Word formatting codes was rendering Miva export files useless, we the used old Man Weiland's "Find and Replace" to locate every product that contained <xml> Word formatting code in the product description field. We then simply copied and pasted the entire search results to Excel and kept the product code column only, then added a second column and concatenated "https://www.site.com/[ProdCode].html to create a list product-corrupt-desc.csv file.

          We then use this list of corrupted product descriptions to scrape affected product pages using a data scrapping app for Chrome called Data Miner (amazing) to build us a new file (e.g.: Product Code, Product Title, Product Description with w/o HTML Codes and separate file with HTML codes. With the exception of looking for a great tool to scrape affected pages on our site frontend, everything else came together fast an easy, up to here.

          The problem then became finding another batch tool that would surgically remove unwanted Word formatting codes while retaining others so I can import them back into Miva. That is where I am currently stuck. Found some hit and miss tools online, others work great, but no batch solution.

          Suggestions anyone?
          Thank you, Bill Davis

          Comment


            #6
            Are you including html formatting in your descriptions? We banned 90% of it in our store from our admin and everything is handled via css. I occasionally search the database for codes we don't allow.

            We also require that smart quotes are turned off in excel and word if they like to work in those tools. Turning that off keeps the weird junk out.

            There are lists of those Microsoft codes that will allow you to search your database and replace with an ascii character that works better.

            We have non html developers who were trying to make the descriptions look "nice". But nice buggers things up.

            So now we have an allowed list of tags they can use.

            Simplifying what the data entry peeps are allowed to do has really improved how our our site looks in foreign languages (google translate) etc. And since we provide our product data to our resellers it sure has reduced their complaints.

            Imagine a wholesaler giving you a bunch of garbage that looks horrible in your store.. we did that until we got our data entry under control.

            Now we an make it look awesome with css and our customers can make it look completely different.

            we do not allow font/size/underline/etc. for lists... they are to allowed to use anything but a <BR> and an html entity like cdot.
            h3 for sub headings..
            and I don't allow paragraph tags... I only allow them <BR> tags. They have to <BR><BR> if they want to make it look like a paragraph break.

            And if they mess up. I make them fix it. If I fix it all the time, they rely on me... and that takes up too much valuable time I can do other things.

            Comment


              #7
              Oh another thought...

              make sure you have codeset defined in your global header.
              do a search on this forum for smartquotes
              you will get a ton of info and those older posts will answer a lot of your questions.

              Comment

              Working...
              X