Announcement

Collapse
No announcement yet.

Diacritic Character removal. Those strange non text characters...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RayYates
    replied
    For Posterity, I've written a JavaScript function that converts unicode and diacritic characters to readable content.

    Code:
    function utoh(text, min = 127, max = 255) {
        /*     Ray Yates 3/2022
    
        utoh() = Unicode to Html
            Translate diacritic and unicode characters to printable characters.
            Diacritic like 'áàâäãéèëêíìïîóòöôõúùüûñçăşţ' etc.
            and Unicode like ©
    
        Given a string that contain special characters,
        converts characters to plain text or html code. (hex version)
        Examples:
            á becomes a, ç becomes c
            \u0092 becomes &#x92 ( displays the curled single quote ’ );
            \u00A9 becomes &#xA9 ( displays the copyright symbol © );
    
        Usage:
            1.    let domObject = document.querySelector("#tab-Products");
                domObject.innerHTML = utoh("&mvt:product:descrip;");
    
            2.    $("#tab-Products").html( utoh(data.description) );
    
            3.    $("#tab-Products").html( utoh(data.description, 150, 160) );
                Limit unicode characters to replace.
    
        Paramiters:
            text: the string to clean up.
            min, max: optional unicode characters to search for.
            If omitted, defaults set to 127, 255
    
        See: https://www.htmlsymbols.xyz/unicode for unicode character set.
        */
    
        let norm_text = text.normalize("NFKD");
        for (let index = min; index <= max; index++) {
            norm_text = norm_text.replaceAll( String.fromCodePoint(index), `&#x${index.toString(16).toUpperCase()};` );
        }
        return norm_text;
    }
    Last edited by RayYates; 03-30-22, 07:48 AM.

    Leave a comment:


  • dcarver
    replied
    You could iterate the string and use the "isprint" builtin and remove non-printable characters that way. However that will not replace the characters with their closet "ascii" character. If the text is UTF-8 you should be able to detect that a multi-byte sequence and know how many bytes to remove too.

    Leave a comment:


  • Kent Multer
    replied
    Just out of curiosity, why do you want to remove them? Any business that wants to support a language other than English may need them. Are they causing any specific problems?

    Leave a comment:


  • Diacritic Character removal. Those strange non text characters...

    Has anyone ever come up with a universal way to removed all control and special characters from stings in Miva Script?

    Removing characters below char 32 and above char 127 does not work because some, but not all, of the codes are two bytes long.

    Every example I find uses Regex expressions or functions built into the language.

    Code:
    Here are JavaScript examples:
    
    const str = "Crème Brulée"
    str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
    >"Creme Brulee"
    
    Or the more modern version:
    str.normalize("NFD").replace(/\p{Diacritic}/gu, "")
    Last edited by RayYates; 07-27-21, 06:21 AM.
Working...
X