utf 8 - Elixir: Counting Frequency of Words in Text File in Hangul Alphabet -

March 15, 2011

but working data written in hangul. have word frequency script have used english txt files, script fails when pass utf-8 txt file containing hangul characters. specifically, seems read characters blank spaces. results, stored in .csv file:

, 290668 1, 2 2, 5 3d, 1 4, 1 55, 1 6, 1 6mm, 2 709, 2 710, 1 d, 1 j, 87 k, 1 m, 14 p, 19 pd100, 1 y, 1

considering text in file contains none of these characters, seems problem. how make code read hangul? current code:

defmodule wordfrequency    def wordcount(readfile)      readfile      |> words      |> count      |> tocsv   end    defp words(file)     file     |> file.stream!     |> stream.map(&string.trim_trailing(&1))     |> stream.map(&string.split(&1,~r{[^a-za-z0-9_]}))     |> enum.to_list     |> list.flatten     |> enum.map(&string.downcase(&1))   end    defp count(words) when is_list(words)     enum.reduce(words, %{}, &update_count/2)   end    defp update_count(word, acc)     map.update acc, string.to_atom(word), 1, &(&1 + 1)   end    defp tocsv(map)     file.open("wordfreqkor.csv", [:write, :utf8], fn(file) ->       enum.each(map, &io.write(file, enum.join(tuple.to_list(&1), ", ")<>"\n"))     end)   end  end  wordfrequency.wordcount("myfile.txt")

thanks advice!

Search This Blog

MOno

utf 8 - Elixir: Counting Frequency of Words in Text File in Hangul Alphabet -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -

How to understand 2 main() functions after using uftrace to profile the C++ program? -