{ "data": { "question": { "questionId": "393", "questionFrontendId": "393", "categoryTitle": "Algorithms", "boundTopicId": 1862, "title": "UTF-8 Validation", "titleSlug": "utf-8-validation", "content": "

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

\n\n

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

\n\n
    \n\t
  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. \n\t
  3. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.
  4. \n
\n\n

This is how the UTF-8 encoding would work:

\n\n
\n     Number of Bytes   |        UTF-8 Octet Sequence\n                       |              (binary)\n   --------------------+-----------------------------------------\n            1          |   0xxxxxxx\n            2          |   110xxxxx 10xxxxxx\n            3          |   1110xxxx 10xxxxxx 10xxxxxx\n            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx\n
\n\n

x denotes a bit in the binary form of a byte that may be either 0 or 1.

\n\n

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

\n\n

 

\n

Example 1:

\n\n
\nInput: data = [197,130,1]\nOutput: true\nExplanation: data represents the octet sequence: 11000101 10000010 00000001.\nIt is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.\n
\n\n

Example 2:

\n\n
\nInput: data = [235,140,4]\nOutput: false\nExplanation: data represented the octet sequence: 11101011 10001100 00000100.\nThe first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.\nThe next byte is a continuation byte which starts with 10 and that's correct.\nBut the second continuation byte does not start with 10, so it is invalid.\n
\n\n

 

\n

Constraints:

\n\n\n", "translatedTitle": "UTF-8 编码验证", "translatedContent": "

给定一个表示数据的整数数组 data ,返回它是否为有效的 UTF-8 编码。

\n\n

UTF-8 中的一个字符可能的长度为 1 到 4 字节,遵循以下的规则:

\n\n
    \n\t
  1. 对于 1 字节 的字符,字节的第一位设为 0 ,后面 7 位为这个符号的 unicode 码。
  2. \n\t
  3. 对于 n 字节 的字符 (n > 1),第一个字节的前 n 位都设为1,第 n+1 位设为 0 ,后面字节的前两位一律设为 10 。剩下的没有提及的二进制位,全部为这个符号的 unicode 码。
  4. \n
\n\n

这是 UTF-8 编码的工作方式:

\n\n
\n   Char. number range  |        UTF-8 octet sequence\n      (hexadecimal)    |              (binary)\n   --------------------+---------------------------------------------\n   0000 0000-0000 007F | 0xxxxxxx\n   0000 0080-0000 07FF | 110xxxxx 10xxxxxx\n   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx\n   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx\n
\n\n

注意:输入是整数数组。只有每个整数的 最低 8 个有效位 用来存储数据。这意味着每个整数只表示 1 字节的数据。

\n\n

 

\n\n

示例 1:

\n\n
\n输入:data = [197,130,1]\n输出:true\n解释:数据表示字节序列:11000101 10000010 00000001。\n这是有效的 utf-8 编码,为一个 2 字节字符,跟着一个 1 字节字符。\n
\n\n

示例 2:

\n\n
\n输入:data = [235,140,4]\n输出:false\n解释:数据表示 8 位的序列: 11101011 10001100 00000100.\n前 3 位都是 1 ,第 4 位为 0 表示它是一个 3 字节字符。\n下一个字节是开头为 10 的延续字节,这是正确的。\n但第二个延续字节不以 10 开头,所以是不符合规则的。\n
\n\n

 

\n\n

提示:

\n\n\n", "isPaidOnly": false, "difficulty": "Medium", "likes": 169, "dislikes": 0, "isLiked": null, "similarQuestions": "[]", "contributors": [], "langToValidPlayground": "{\"cpp\": true, \"java\": true, \"python\": true, \"python3\": true, \"mysql\": false, \"mssql\": false, \"oraclesql\": false, \"c\": false, \"csharp\": false, \"javascript\": false, \"ruby\": false, \"bash\": false, \"swift\": false, \"golang\": false, \"scala\": false, \"html\": false, \"pythonml\": false, \"kotlin\": false, \"rust\": false, \"php\": false, \"typescript\": false, \"racket\": false, \"erlang\": false, \"elixir\": false}", "topicTags": [ { "name": "Bit Manipulation", "slug": "bit-manipulation", "translatedName": "位运算", "__typename": "TopicTagNode" }, { "name": "Array", "slug": "array", "translatedName": "数组", "__typename": "TopicTagNode" } ], "companyTagStats": null, "codeSnippets": [ { "lang": "C++", "langSlug": "cpp", "code": "class Solution {\npublic:\n bool validUtf8(vector& data) {\n\n }\n};", "__typename": "CodeSnippetNode" }, { "lang": "Java", "langSlug": "java", "code": "class Solution {\n public boolean validUtf8(int[] data) {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Python", "langSlug": "python", "code": "class Solution(object):\n def validUtf8(self, data):\n \"\"\"\n :type data: List[int]\n :rtype: bool\n \"\"\"", "__typename": "CodeSnippetNode" }, { "lang": "Python3", "langSlug": "python3", "code": "class Solution:\n def validUtf8(self, data: List[int]) -> bool:", "__typename": "CodeSnippetNode" }, { "lang": "C", "langSlug": "c", "code": "\n\nbool validUtf8(int* data, int dataSize){\n\n}", "__typename": "CodeSnippetNode" }, { "lang": "C#", "langSlug": "csharp", "code": "public class Solution {\n public bool ValidUtf8(int[] data) {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "JavaScript", "langSlug": "javascript", "code": "/**\n * @param {number[]} data\n * @return {boolean}\n */\nvar validUtf8 = function(data) {\n\n};", "__typename": "CodeSnippetNode" }, { "lang": "Ruby", "langSlug": "ruby", "code": "# @param {Integer[]} data\n# @return {Boolean}\ndef valid_utf8(data)\n\nend", "__typename": "CodeSnippetNode" }, { "lang": "Swift", "langSlug": "swift", "code": "class Solution {\n func validUtf8(_ data: [Int]) -> Bool {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Go", "langSlug": "golang", "code": "func validUtf8(data []int) bool {\n\n}", "__typename": "CodeSnippetNode" }, { "lang": "Scala", "langSlug": "scala", "code": "object Solution {\n def validUtf8(data: Array[Int]): Boolean = {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Kotlin", "langSlug": "kotlin", "code": "class Solution {\n fun validUtf8(data: IntArray): Boolean {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "Rust", "langSlug": "rust", "code": "impl Solution {\n pub fn valid_utf8(data: Vec) -> bool {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "PHP", "langSlug": "php", "code": "class Solution {\n\n /**\n * @param Integer[] $data\n * @return Boolean\n */\n function validUtf8($data) {\n\n }\n}", "__typename": "CodeSnippetNode" }, { "lang": "TypeScript", "langSlug": "typescript", "code": "function validUtf8(data: number[]): boolean {\n\n};", "__typename": "CodeSnippetNode" }, { "lang": "Racket", "langSlug": "racket", "code": "(define/contract (valid-utf8 data)\n (-> (listof exact-integer?) boolean?)\n\n )", "__typename": "CodeSnippetNode" }, { "lang": "Erlang", "langSlug": "erlang", "code": "-spec valid_utf8(Data :: [integer()]) -> boolean().\nvalid_utf8(Data) ->\n .", "__typename": "CodeSnippetNode" }, { "lang": "Elixir", "langSlug": "elixir", "code": "defmodule Solution do\n @spec valid_utf8(data :: [integer]) :: boolean\n def valid_utf8(data) do\n\n end\nend", "__typename": "CodeSnippetNode" } ], "stats": "{\"totalAccepted\": \"34.2K\", \"totalSubmission\": \"78K\", \"totalAcceptedRaw\": 34245, \"totalSubmissionRaw\": 78008, \"acRate\": \"43.9%\"}", "hints": [ "All you have to do is follow the rules. For a given integer, obtain its binary representation in the string form and work with the rules given in the problem.", "An integer can either represent the start of a UTF-8 character, or a part of an existing UTF-8 character. There are two separate rules for these two scenarios in the problem.", "If an integer is a part of an existing UTF-8 character, simply check the 2 most significant bits of in the binary representation string. They should be 10. If the integer represents the start of a UTF-8 character, then the first few bits would be 1 followed by a 0. The number of initial bits (most significant) bits determines the length of the UTF-8 character. \r\n\r\n

\r\nNote: The array can contain multiple valid UTF-8 characters.", "String manipulation will work fine here. But, it is too slow. Can we instead use bit manipulation to do the validations instead of string manipulations?", "We can use bit masking to check how many initial bits are set for a given number. We only need to work with the 8 least significant bits as mentioned in the problem.\r\n\r\n
\r\nmask = 1 << 7\r\nwhile mask & num:\r\n    n_bytes += 1\r\n    mask = mask >> 1\r\n
\r\n\r\nCan you use bit-masking to perform the second validation as well i.e. checking if the most significant bit is 1 and the second most significant bit a 0?", "To check if the most significant bit is a 1 and the second most significant bit is a 0, we can make use of the following two masks.\r\n\r\n
\r\nmask1 = 1 << 7\r\nmask2 = 1 << 6\r\n\r\nif not (num & mask1 and not (num & mask2)):\r\n    return False\r\n
" ], "solution": null, "status": null, "sampleTestCase": "[197,130,1]", "metaData": "{\r\n \"name\": \"validUtf8\",\r\n \"params\": [\r\n {\r\n \"name\": \"data\",\r\n \"type\": \"integer[]\"\r\n }\r\n ],\r\n \"return\": {\r\n \"type\": \"boolean\"\r\n }\r\n}", "judgerAvailable": true, "judgeType": "large", "mysqlSchemas": [], "enableRunCode": true, "envInfo": "{\"cpp\":[\"C++\",\"

\\u7248\\u672c\\uff1aclang 11<\\/code> \\u91c7\\u7528\\u6700\\u65b0C++ 17\\u6807\\u51c6\\u3002<\\/p>\\r\\n\\r\\n

\\u7f16\\u8bd1\\u65f6\\uff0c\\u5c06\\u4f1a\\u91c7\\u7528-O2<\\/code>\\u7ea7\\u4f18\\u5316\\u3002AddressSanitizer<\\/a> \\u4e5f\\u88ab\\u5f00\\u542f\\u6765\\u68c0\\u6d4bout-of-bounds<\\/code>\\u548cuse-after-free<\\/code>\\u9519\\u8bef\\u3002<\\/p>\\r\\n\\r\\n

\\u4e3a\\u4e86\\u4f7f\\u7528\\u65b9\\u4fbf\\uff0c\\u5927\\u90e8\\u5206\\u6807\\u51c6\\u5e93\\u7684\\u5934\\u6587\\u4ef6\\u5df2\\u7ecf\\u88ab\\u81ea\\u52a8\\u5bfc\\u5165\\u3002<\\/p>\"],\"java\":[\"Java\",\"

\\u7248\\u672c\\uff1aOpenJDK 17<\\/code>\\u3002\\u53ef\\u4ee5\\u4f7f\\u7528Java 8\\u7684\\u7279\\u6027\\u4f8b\\u5982\\uff0clambda expressions \\u548c stream API\\u3002<\\/p>\\r\\n\\r\\n

\\u4e3a\\u4e86\\u65b9\\u4fbf\\u8d77\\u89c1\\uff0c\\u5927\\u90e8\\u5206\\u6807\\u51c6\\u5e93\\u7684\\u5934\\u6587\\u4ef6\\u5df2\\u88ab\\u5bfc\\u5165\\u3002<\\/p>\\r\\n\\r\\n

\\u5305\\u542b Pair \\u7c7b: https:\\/\\/docs.oracle.com\\/javase\\/8\\/javafx\\/api\\/javafx\\/util\\/Pair.html <\\/p>\"],\"python\":[\"Python\",\"

\\u7248\\u672c\\uff1a Python 2.7.12<\\/code><\\/p>\\r\\n\\r\\n

\\u4e3a\\u4e86\\u65b9\\u4fbf\\u8d77\\u89c1\\uff0c\\u5927\\u90e8\\u5206\\u5e38\\u7528\\u5e93\\u5df2\\u7ecf\\u88ab\\u81ea\\u52a8 \\u5bfc\\u5165\\uff0c\\u5982\\uff1aarray<\\/a>, bisect<\\/a>, collections<\\/a>\\u3002\\u5982\\u679c\\u60a8\\u9700\\u8981\\u4f7f\\u7528\\u5176\\u4ed6\\u5e93\\u51fd\\u6570\\uff0c\\u8bf7\\u81ea\\u884c\\u5bfc\\u5165\\u3002<\\/p>\\r\\n\\r\\n

\\u6ce8\\u610f Python 2.7 \\u5c06\\u57282020\\u5e74\\u540e\\u4e0d\\u518d\\u7ef4\\u62a4<\\/a>\\u3002 \\u5982\\u60f3\\u4f7f\\u7528\\u6700\\u65b0\\u7248\\u7684Python\\uff0c\\u8bf7\\u9009\\u62e9Python 3\\u3002<\\/p>\"],\"c\":[\"C\",\"

\\u7248\\u672c\\uff1aGCC 8.2<\\/code>\\uff0c\\u91c7\\u7528GNU99\\u6807\\u51c6\\u3002<\\/p>\\r\\n\\r\\n

\\u7f16\\u8bd1\\u65f6\\uff0c\\u5c06\\u4f1a\\u91c7\\u7528-O1<\\/code>\\u7ea7\\u4f18\\u5316\\u3002 AddressSanitizer<\\/a>\\u4e5f\\u88ab\\u5f00\\u542f\\u6765\\u68c0\\u6d4bout-of-bounds<\\/code>\\u548cuse-after-free<\\/code>\\u9519\\u8bef\\u3002<\\/p>\\r\\n\\r\\n

\\u4e3a\\u4e86\\u4f7f\\u7528\\u65b9\\u4fbf\\uff0c\\u5927\\u90e8\\u5206\\u6807\\u51c6\\u5e93\\u7684\\u5934\\u6587\\u4ef6\\u5df2\\u7ecf\\u88ab\\u81ea\\u52a8\\u5bfc\\u5165\\u3002<\\/p>\\r\\n\\r\\n

\\u5982\\u60f3\\u4f7f\\u7528\\u54c8\\u5e0c\\u8868\\u8fd0\\u7b97, \\u60a8\\u53ef\\u4ee5\\u4f7f\\u7528 uthash<\\/a>\\u3002 \\\"uthash.h\\\"\\u5df2\\u7ecf\\u9ed8\\u8ba4\\u88ab\\u5bfc\\u5165\\u3002\\u8bf7\\u770b\\u5982\\u4e0b\\u793a\\u4f8b:<\\/p>\\r\\n\\r\\n

1. \\u5f80\\u54c8\\u5e0c\\u8868\\u4e2d\\u6dfb\\u52a0\\u4e00\\u4e2a\\u5bf9\\u8c61\\uff1a<\\/b>\\r\\n

\\r\\nstruct hash_entry {\\r\\n    int id;            \\/* we'll use this field as the key *\\/\\r\\n    char name[10];\\r\\n    UT_hash_handle hh; \\/* makes this structure hashable *\\/\\r\\n};\\r\\n\\r\\nstruct hash_entry *users = NULL;\\r\\n\\r\\nvoid add_user(struct hash_entry *s) {\\r\\n    HASH_ADD_INT(users, id, s);\\r\\n}\\r\\n<\\/pre>\\r\\n<\\/p>\\r\\n\\r\\n

2. \\u5728\\u54c8\\u5e0c\\u8868\\u4e2d\\u67e5\\u627e\\u4e00\\u4e2a\\u5bf9\\u8c61\\uff1a<\\/b>\\r\\n

\\r\\nstruct hash_entry *find_user(int user_id) {\\r\\n    struct hash_entry *s;\\r\\n    HASH_FIND_INT(users, &user_id, s);\\r\\n    return s;\\r\\n}\\r\\n<\\/pre>\\r\\n<\\/p>\\r\\n\\r\\n

3. \\u4ece\\u54c8\\u5e0c\\u8868\\u4e2d\\u5220\\u9664\\u4e00\\u4e2a\\u5bf9\\u8c61\\uff1a<\\/b>\\r\\n

\\r\\nvoid delete_user(struct hash_entry *user) {\\r\\n    HASH_DEL(users, user);  \\r\\n}\\r\\n<\\/pre>\\r\\n<\\/p>\"],\"csharp\":[\"C#\",\"

C# 10<\\/a> \\u8fd0\\u884c\\u5728 .NET 6 \\u4e0a<\\/p>\\r\\n\\r\\n

\\u60a8\\u7684\\u4ee3\\u7801\\u5728\\u7f16\\u8bd1\\u65f6\\u9ed8\\u8ba4\\u5f00\\u542f\\u4e86debug\\u6807\\u8bb0(\\/debug:pdbonly<\\/code>)\\u3002<\\/p>\"],\"javascript\":[\"JavaScript\",\"

\\u7248\\u672c\\uff1aNode.js 16.13.2<\\/code><\\/p>\\r\\n\\r\\n

\\u60a8\\u7684\\u4ee3\\u7801\\u5728\\u6267\\u884c\\u65f6\\u5c06\\u5e26\\u4e0a --harmony<\\/code> \\u6807\\u8bb0\\u6765\\u5f00\\u542f \\u65b0\\u7248ES6\\u7279\\u6027<\\/a>\\u3002<\\/p>\\r\\n\\r\\n

lodash.js<\\/a> \\u5e93\\u5df2\\u7ecf\\u9ed8\\u8ba4\\u88ab\\u5305\\u542b\\u3002<\\/p>\\r\\n\\r\\n

\\u5982\\u9700\\u4f7f\\u7528\\u961f\\u5217\\/\\u4f18\\u5148\\u961f\\u5217\\uff0c\\u60a8\\u53ef\\u4f7f\\u7528 datastructures-js\\/priority-queue<\\/a> \\u548c datastructures-js\\/queue<\\/a>\\u3002<\\/p>\"],\"ruby\":[\"Ruby\",\"

\\u4f7f\\u7528Ruby 3.1<\\/code>\\u6267\\u884c<\\/p>\\r\\n\\r\\n

\\u4e00\\u4e9b\\u5e38\\u7528\\u7684\\u6570\\u636e\\u7ed3\\u6784\\u5df2\\u5728 Algorithms \\u6a21\\u5757\\u4e2d\\u63d0\\u4f9b\\uff1ahttps:\\/\\/www.rubydoc.info\\/github\\/kanwei\\/algorithms\\/Algorithms<\\/p>\"],\"swift\":[\"Swift\",\"

\\u7248\\u672c\\uff1aSwift 5.5.2<\\/code><\\/p>\\r\\n\\r\\n

\\u6211\\u4eec\\u901a\\u5e38\\u4fdd\\u8bc1\\u66f4\\u65b0\\u5230 Apple\\u653e\\u51fa\\u7684\\u6700\\u65b0\\u7248Swift<\\/a>\\u3002\\u5982\\u679c\\u60a8\\u53d1\\u73b0Swift\\u4e0d\\u662f\\u6700\\u65b0\\u7248\\u7684\\uff0c\\u8bf7\\u8054\\u7cfb\\u6211\\u4eec\\uff01\\u6211\\u4eec\\u5c06\\u5c3d\\u5feb\\u66f4\\u65b0\\u3002<\\/p>\"],\"golang\":[\"Go\",\"

\\u7248\\u672c\\uff1aGo 1.17<\\/code><\\/p>\\r\\n\\r\\n

\\u652f\\u6301 https:\\/\\/godoc.org\\/github.com\\/emirpasic\\/gods<\\/a> \\u7b2c\\u4e09\\u65b9\\u5e93\\u3002<\\/p>\"],\"python3\":[\"Python3\",\"

\\u7248\\u672c\\uff1aPython 3.10<\\/code><\\/p>\\r\\n\\r\\n

\\u4e3a\\u4e86\\u65b9\\u4fbf\\u8d77\\u89c1\\uff0c\\u5927\\u90e8\\u5206\\u5e38\\u7528\\u5e93\\u5df2\\u7ecf\\u88ab\\u81ea\\u52a8 \\u5bfc\\u5165\\uff0c\\u5982array<\\/a>, bisect<\\/a>, collections<\\/a>\\u3002 \\u5982\\u679c\\u60a8\\u9700\\u8981\\u4f7f\\u7528\\u5176\\u4ed6\\u5e93\\u51fd\\u6570\\uff0c\\u8bf7\\u81ea\\u884c\\u5bfc\\u5165\\u3002<\\/p>\\r\\n\\r\\n

\\u5982\\u9700\\u4f7f\\u7528 Map\\/TreeMap \\u6570\\u636e\\u7ed3\\u6784\\uff0c\\u60a8\\u53ef\\u4f7f\\u7528 sortedcontainers<\\/a> \\u5e93\\u3002<\\/p>\"],\"scala\":[\"Scala\",\"

\\u7248\\u672c\\uff1aScala 2.13<\\/code><\\/p>\"],\"kotlin\":[\"Kotlin\",\"

\\u7248\\u672c\\uff1aKotlin 1.3.10<\\/code><\\/p>\"],\"rust\":[\"Rust\",\"

\\u7248\\u672c\\uff1arust 1.58.1<\\/code><\\/p>\\r\\n\\r\\n

\\u652f\\u6301 crates.io \\u7684 rand<\\/a><\\/p>\"],\"php\":[\"PHP\",\"

PHP 8.1<\\/code>.<\\/p>\\r\\n\\r\\n

With bcmath module.<\\/p>\"],\"typescript\":[\"TypeScript\",\"

TypeScript 4.5.4<\\/p>\\r\\n\\r\\n

Compile Options: --alwaysStrict --strictBindCallApply --strictFunctionTypes --target ES2020<\\/p>\"],\"racket\":[\"Racket\",\"

Racket CS<\\/a> v8.3<\\/p>\\r\\n\\r\\n

\\u4f7f\\u7528 #lang racket<\\/p>\\r\\n\\r\\n

\\u5df2\\u9884\\u5148 (require data\\/gvector data\\/queue data\\/order data\\/heap). \\u82e5\\u9700\\u4f7f\\u7528\\u5176\\u5b83\\u6570\\u636e\\u7ed3\\u6784\\uff0c\\u53ef\\u81ea\\u884c require\\u3002<\\/p>\"],\"erlang\":[\"Erlang\",\"Erlang\\/OTP 24.2\"],\"elixir\":[\"Elixir\",\"Elixir 1.13.0 with Erlang\\/OTP 24.2\"]}", "book": null, "isSubscribed": false, "isDailyQuestion": false, "dailyRecordStatus": null, "editorType": "CKEDITOR", "ugcQuestionId": null, "style": "LEETCODE", "exampleTestcases": "[197,130,1]\n[235,140,4]", "__typename": "QuestionNode" } } }