r - Read lines that do not begin with ]dos -
i have import csv file thousands of lines. in file, header appears several times. header begins following 4 characters: ]dos
. readlines excluding lines beginning ]dos
.
the file looks like
n[ dos date dos heure dos nom du patient ex pat n[ dos date dos heure dos nom du patient ex pat 7061283778 02-03-17 12h54 mr montaldo jimena 02-03-17 7061283777 02-03-17 12h54 mme montaldo jimena 03-03-17 7061283790 02-03-17 12h54 mme montaldo jimena 02-03-17 7061283779 02-03-17 12h55 mr montaldo jimena 02-03-17 7061300309 02-03-17 12h55 mme montaldo jimena 02-03-17 7061294068 02-03-17 12h56 mme montaldo jimena 03-03-17 7061283782 02-03-17 12h56 mr montaldo jimena 02-03-17 n[ dos date dos heure dos nom du patient ex pat 7061283781 02-03-17 12h56 mlle montaldo jimena 02-03-17 7061300311 02-03-17 12h57 mme montaldo jimena 02-03-17
as can see header appears 3 times.
i've approach posted @jaap, think file dirty, in sense that:
- it csv file separator ; instead of , (because french).
- several columns have empty values.
- perhaps there unknown characters (not sure).
i got following error message:
df <- read.table(text = txt, header =false) error in scan(file = file, = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 44 elements in addition: warning message: closing unused connection 3 (extraction.csv)
i have prepare program in r used knows nothing programming or data analysis. person use regularly program in order results data. person not able clean data himself.
this how file looks if open excel:
based on example data, suppose want exclude lines start n[ dos
.
you can data in dataframe in several steps:
- read datafile
readlines
. (because don't have file, i've usedtextconnection
; use follows:readlines('name_of_your_file.txt')
read in text) - remove lines start
n[ dos
grepl
-function. - read remaining text
read.table
.
complete code:
txtcon <- textconnection('n[ dos date dos heure dos nom du patient ex pat n[ dos date dos heure dos nom du patient ex pat 7061283778 02-03-17 12h54 mr montaldo jimena 02-03-17 7061283777 02-03-17 12h54 mme montaldo jimena 03-03-17 7061283790 02-03-17 12h54 mme montaldo jimena 02-03-17 7061283779 02-03-17 12h55 mr montaldo jimena 02-03-17 7061300309 02-03-17 12h55 mme montaldo jimena 02-03-17 7061294068 02-03-17 12h56 mme montaldo jimena 03-03-17 7061283782 02-03-17 12h56 mr montaldo jimena 02-03-17 n[ dos date dos heure dos nom du patient ex pat 7061283781 02-03-17 12h56 mlle montaldo jimena 02-03-17 7061300311 02-03-17 12h57 mme montaldo jimena 02-03-17') txt <- readlines(txtcon) txt <- txt[!grepl(pattern = '^n\\[ dos', txt)] df <- read.table(text = txt, header = false)
this results in following dataframe:
> df v1 v2 v3 v4 v5 v6 v7 1 7061283778 02-03-17 12h54 mr montaldo jimena 02-03-17 2 7061283777 02-03-17 12h54 mme montaldo jimena 03-03-17 3 7061283790 02-03-17 12h54 mme montaldo jimena 02-03-17 4 7061283779 02-03-17 12h55 mr montaldo jimena 02-03-17 5 7061300309 02-03-17 12h55 mme montaldo jimena 02-03-17 6 7061294068 02-03-17 12h56 mme montaldo jimena 03-03-17 7 7061283782 02-03-17 12h56 mr montaldo jimena 02-03-17 8 7061283781 02-03-17 12h56 mlle montaldo jimena 02-03-17 9 7061300311 02-03-17 12h57 mme montaldo jimena 02-03-17
looking @ resulting dataframe, suppose want of columns in 1 column. possible approach be:
df2 <- data.frame(id = df$v1, datetime1 = strptime(paste(df$v2, df$v3), '%d-%m-%y %hh%m'), datetime2 = as.date(df$v7, '%d-%m-%y'), name = paste(df$v4, df$v5, df$v6))
whicht results in:
> df2 id datetime1 datetime2 name 1 7061283778 2017-03-02 12:54:00 2017-03-02 mr montaldo jimena 2 7061283777 2017-03-02 12:54:00 2017-03-03 mme montaldo jimena 3 7061283790 2017-03-02 12:54:00 2017-03-02 mme montaldo jimena 4 7061283779 2017-03-02 12:55:00 2017-03-02 mr montaldo jimena 5 7061300309 2017-03-02 12:55:00 2017-03-02 mme montaldo jimena 6 7061294068 2017-03-02 12:56:00 2017-03-03 mme montaldo jimena 7 7061283782 2017-03-02 12:56:00 2017-03-02 mr montaldo jimena 8 7061283781 2017-03-02 12:56:00 2017-03-02 mlle montaldo jimena 9 7061300311 2017-03-02 12:57:00 2017-03-02 mme montaldo jimena
Comments
Post a Comment