{ "data": { "question": { "questionId": "393", "questionFrontendId": "393", "boundTopicId": null, "title": "UTF-8 Validation", "titleSlug": "utf-8-validation", "content": "

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding.

\n\n

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

\n\n
    \n\t
  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. \n\t
  3. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.
  4. \n
\n\n

This is how the UTF-8 encoding would work:

\n\n
\n   Char. number range  |        UTF-8 octet sequence\n      (hexadecimal)    |              (binary)\n   --------------------+---------------------------------------------\n   0000 0000-0000 007F | 0xxxxxxx\n   0000 0080-0000 07FF | 110xxxxx 10xxxxxx\n   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx\n   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx\n
\n\n

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

\n\n

 

\n

Example 1:

\n\n
\nInput: data = [197,130,1]\nOutput: true\nExplanation: data represents the octet sequence: 11000101 10000010 00000001.\nIt is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.\n
\n\n

Example 2:

\n\n
\nInput: data = [235,140,4]\nOutput: false\nExplanation: data represented the octet sequence: 11101011 10001100 00000100.\nThe first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.\nThe next byte is a continuation byte which starts with 10 and that's correct.\nBut the second continuation byte does not start with 10, so it is invalid.\n
\n\n

 

\n

Constraints:

\n\n\n", "translatedTitle": null, "translatedContent": null, "isPaidOnly": false, "difficulty": "Medium", "likes": 364, "dislikes": 1500, "isLiked": null, "similarQuestions": "[]", "exampleTestcases": "[197,130,1]\n[235,140,4]", "categoryTitle": "Algorithms", "contributors": [], "topicTags": [ { "name": "Array", "slug": "array", "translatedName": null, "__typename": "TopicTagNode" }, { "name": "Bit Manipulation", "slug": "bit-manipulation", "translatedName": null, "__typename": "TopicTagNode" } ], "companyTagStats": null, "codeSnippets": [ { "lang": "C++", "langSlug": "cpp", "code": "class Solution {\npublic:\n bool validUtf8(vector& data) {\n \n }\n};", "__typename": "CodeSnippetNode" }, { "lang": "Java", "langSlug": "java", "code": "class Solution {\n public boolean validUtf8(int[] data) {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Python", "langSlug": "python", "code": "class Solution(object):\n def validUtf8(self, data):\n \"\"\"\n :type data: List[int]\n :rtype: bool\n \"\"\"\n ", "__typename": "CodeSnippetNode" }, { "lang": "Python3", "langSlug": "python3", "code": "class Solution:\n def validUtf8(self, data: List[int]) -> bool:\n ", "__typename": "CodeSnippetNode" }, { "lang": "C", "langSlug": "c", "code": "\n\nbool validUtf8(int* data, int dataSize){\n\n}", "__typename": "CodeSnippetNode" }, { "lang": "C#", "langSlug": "csharp", "code": "public class Solution {\n public bool ValidUtf8(int[] data) {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "JavaScript", "langSlug": "javascript", "code": "/**\n * @param {number[]} data\n * @return {boolean}\n */\nvar validUtf8 = function(data) {\n \n};", "__typename": "CodeSnippetNode" }, { "lang": "Ruby", "langSlug": "ruby", "code": "# @param {Integer[]} data\n# @return {Boolean}\ndef valid_utf8(data)\n \nend", "__typename": "CodeSnippetNode" }, { "lang": "Swift", "langSlug": "swift", "code": "class Solution {\n func validUtf8(_ data: [Int]) -> Bool {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Go", "langSlug": "golang", "code": "func validUtf8(data []int) bool {\n \n}", "__typename": "CodeSnippetNode" }, { "lang": "Scala", "langSlug": "scala", "code": "object Solution {\n def validUtf8(data: Array[Int]): Boolean = {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Kotlin", "langSlug": "kotlin", "code": "class Solution {\n fun validUtf8(data: IntArray): Boolean {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Rust", "langSlug": "rust", "code": "impl Solution {\n pub fn valid_utf8(data: Vec) -> bool {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "PHP", "langSlug": "php", "code": "class Solution {\n\n /**\n * @param Integer[] $data\n * @return Boolean\n */\n function validUtf8($data) {\n \n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "TypeScript", "langSlug": "typescript", "code": "function validUtf8(data: number[]): boolean {\n\n};", "__typename": "CodeSnippetNode" }, { "lang": "Racket", "langSlug": "racket", "code": "(define/contract (valid-utf8 data)\n (-> (listof exact-integer?) boolean?)\n\n )", "__typename": "CodeSnippetNode" }, { "lang": "Erlang", "langSlug": "erlang", "code": "-spec valid_utf8(Data :: [integer()]) -> boolean().\nvalid_utf8(Data) ->\n .", "__typename": "CodeSnippetNode" }, { "lang": "Elixir", "langSlug": "elixir", "code": "defmodule Solution do\n @spec valid_utf8(data :: [integer]) :: boolean\n def valid_utf8(data) do\n\n end\nend", "__typename": "CodeSnippetNode" } ], "stats": "{\"totalAccepted\": \"66.7K\", \"totalSubmission\": \"170.5K\", \"totalAcceptedRaw\": 66689, \"totalSubmissionRaw\": 170487, \"acRate\": \"39.1%\"}", "hints": [ "All you have to do is follow the rules. For a given integer, obtain its binary representation in the string form and work with the rules given in the problem.", "An integer can either represent the start of a UTF-8 character, or a part of an existing UTF-8 character. There are two separate rules for these two scenarios in the problem.", "If an integer is a part of an existing UTF-8 character, simply check the 2 most significant bits of in the binary representation string. They should be 10. If the integer represents the start of a UTF-8 character, then the first few bits would be 1 followed by a 0. The number of initial bits (most significant) bits determines the length of the UTF-8 character. \r\n\r\n

\r\nNote: The array can contain multiple valid UTF-8 characters.", "String manipulation will work fine here. But, it is too slow. Can we instead use bit manipulation to do the validations instead of string manipulations?", "We can use bit masking to check how many initial bits are set for a given number. We only need to work with the 8 least significant bits as mentioned in the problem.\r\n\r\n
\r\nmask = 1 << 7\r\nwhile mask & num:\r\n    n_bytes += 1\r\n    mask = mask >> 1\r\n
\r\n\r\nCan you use bit-masking to perform the second validation as well i.e. checking if the most significant bit is 1 and the second most significant bit a 0?", "To check if the most significant bit is a 1 and the second most significant bit is a 0, we can make use of the following two masks.\r\n\r\n
\r\nmask1 = 1 << 7\r\nmask2 = 1 << 6\r\n\r\nif not (num & mask1 and not (num & mask2)):\r\n    return False\r\n
" ], "solution": { "id": "596", "canSeeDetail": true, "paidOnly": false, "hasVideoSolution": false, "paidOnlyVideo": true, "__typename": "ArticleNode" }, "status": null, "sampleTestCase": "[197,130,1]", "metaData": "{\r\n \"name\": \"validUtf8\",\r\n \"params\": [\r\n {\r\n \"name\": \"data\",\r\n \"type\": \"integer[]\"\r\n }\r\n ],\r\n \"return\": {\r\n \"type\": \"boolean\"\r\n }\r\n}", "judgerAvailable": true, "judgeType": "large", "mysqlSchemas": [], "enableRunCode": true, "enableTestMode": false, "enableDebugger": true, "envInfo": "{\"cpp\": [\"C++\", \"

Compiled with clang 11 using the latest C++ 17 standard.

\\r\\n\\r\\n

Your code is compiled with level two optimization (-O2). AddressSanitizer is also enabled to help detect out-of-bounds and use-after-free bugs.

\\r\\n\\r\\n

Most standard library headers are already included automatically for your convenience.

\"], \"java\": [\"Java\", \"

OpenJDK 17 . Java 8 features such as lambda expressions and stream API can be used.

\\r\\n\\r\\n

Most standard library headers are already included automatically for your convenience.

\\r\\n

Includes Pair class from https://docs.oracle.com/javase/8/javafx/api/javafx/util/Pair.html.

\"], \"python\": [\"Python\", \"

Python 2.7.12.

\\r\\n\\r\\n

Most libraries are already imported automatically for your convenience, such as array, bisect, collections. If you need more libraries, you can import it yourself.

\\r\\n\\r\\n

For Map/TreeMap data structure, you may use sortedcontainers library.

\\r\\n\\r\\n

Note that Python 2.7 will not be maintained past 2020. For the latest Python, please choose Python3 instead.

\"], \"c\": [\"C\", \"

Compiled with gcc 8.2 using the gnu99 standard.

\\r\\n\\r\\n

Your code is compiled with level one optimization (-O1). AddressSanitizer is also enabled to help detect out-of-bounds and use-after-free bugs.

\\r\\n\\r\\n

Most standard library headers are already included automatically for your convenience.

\\r\\n\\r\\n

For hash table operations, you may use uthash. \\\"uthash.h\\\" is included by default. Below are some examples:

\\r\\n\\r\\n

1. Adding an item to a hash.\\r\\n

\\r\\nstruct hash_entry {\\r\\n    int id;            /* we'll use this field as the key */\\r\\n    char name[10];\\r\\n    UT_hash_handle hh; /* makes this structure hashable */\\r\\n};\\r\\n\\r\\nstruct hash_entry *users = NULL;\\r\\n\\r\\nvoid add_user(struct hash_entry *s) {\\r\\n    HASH_ADD_INT(users, id, s);\\r\\n}\\r\\n
\\r\\n

\\r\\n\\r\\n

2. Looking up an item in a hash:\\r\\n

\\r\\nstruct hash_entry *find_user(int user_id) {\\r\\n    struct hash_entry *s;\\r\\n    HASH_FIND_INT(users, &user_id, s);\\r\\n    return s;\\r\\n}\\r\\n
\\r\\n

\\r\\n\\r\\n

3. Deleting an item in a hash:\\r\\n

\\r\\nvoid delete_user(struct hash_entry *user) {\\r\\n    HASH_DEL(users, user);  \\r\\n}\\r\\n
\\r\\n

\"], \"csharp\": [\"C#\", \"

C# 10 with .NET 6 runtime

\\r\\n\\r\\n

Your code is compiled with debug flag enabled (/debug).

\"], \"javascript\": [\"JavaScript\", \"

Node.js 16.13.2.

\\r\\n\\r\\n

Your code is run with --harmony flag, enabling new ES6 features.

\\r\\n\\r\\n

lodash.js library is included by default.

\\r\\n\\r\\n

For Priority Queue / Queue data structures, you may use datastructures-js/priority-queue and datastructures-js/queue.

\"], \"ruby\": [\"Ruby\", \"

Ruby 3.1

\\r\\n\\r\\n

Some common data structure implementations are provided in the Algorithms module: https://www.rubydoc.info/github/kanwei/algorithms/Algorithms

\"], \"swift\": [\"Swift\", \"

Swift 5.5.2.

\"], \"golang\": [\"Go\", \"

Go 1.17.6.

\\r\\n\\r\\n

Support https://godoc.org/github.com/emirpasic/gods library.

\"], \"python3\": [\"Python3\", \"

Python 3.10.

\\r\\n\\r\\n

Most libraries are already imported automatically for your convenience, such as array, bisect, collections. If you need more libraries, you can import it yourself.

\\r\\n\\r\\n

For Map/TreeMap data structure, you may use sortedcontainers library.

\"], \"scala\": [\"Scala\", \"

Scala 2.13.7.

\"], \"kotlin\": [\"Kotlin\", \"

Kotlin 1.3.10.

\"], \"rust\": [\"Rust\", \"

Rust 1.58.1

\\r\\n\\r\\n

Supports rand v0.6\\u00a0from crates.io

\"], \"php\": [\"PHP\", \"

PHP 8.1.

\\r\\n

With bcmath module

\"], \"typescript\": [\"Typescript\", \"

TypeScript 4.5.4, Node.js 16.13.2.

\\r\\n\\r\\n

Your code is run with --harmony flag, enabling new ES2020 features.

\\r\\n\\r\\n

lodash.js library is included by default.

\"], \"racket\": [\"Racket\", \"

Run with Racket 8.3.

\"], \"erlang\": [\"Erlang\", \"Erlang/OTP 24.2\"], \"elixir\": [\"Elixir\", \"Elixir 1.13.0 with Erlang/OTP 24.2\"]}", "libraryUrl": null, "adminUrl": null, "challengeQuestion": null, "__typename": "QuestionNode" } } }