problem is there is more than one set of characters
you have hiragana, katakana and kanji first 2 arent too bad but 3rd one kanji there is thousands of characters then you need to take into account spoken language vs written language and conversational language and so forth