utf 8 - Elixir: Counting Frequency of Words in Text File in Hangul Alphabet -
but working data written in hangul. have word frequency script have used english txt files, script fails when pass utf-8 txt file containing hangul characters. specifically, seems read characters blank spaces. results, stored in .csv file:
, 290668 1, 2 2, 5 3d, 1 4, 1 55, 1 6, 1 6mm, 2 709, 2 710, 1 d, 1 j, 87 k, 1 m, 14 p, 19 pd100, 1 y, 1
considering text in file contains none of these characters, seems problem. how make code read hangul? current code:
defmodule wordfrequency def wordcount(readfile) readfile |> words |> count |> tocsv end defp words(file) file |> file.stream! |> stream.map(&string.trim_trailing(&1)) |> stream.map(&string.split(&1,~r{[^a-za-z0-9_]})) |> enum.to_list |> list.flatten |> enum.map(&string.downcase(&1)) end defp count(words) when is_list(words) enum.reduce(words, %{}, &update_count/2) end defp update_count(word, acc) map.update acc, string.to_atom(word), 1, &(&1 + 1) end defp tocsv(map) file.open("wordfreqkor.csv", [:write, :utf8], fn(file) -> enum.each(map, &io.write(file, enum.join(tuple.to_list(&1), ", ")<>"\n")) end) end end wordfrequency.wordcount("myfile.txt")
thanks advice!
Comments
Post a Comment