r - Read lines that do not begin with ]dos -


i have import csv file thousands of lines. in file, header appears several times. header begins following 4 characters: ]dos. readlines excluding lines beginning ]dos.

the file looks like

n[ dos  date dos    heure dos   nom du patient  ex pat n[ dos  date dos    heure dos   nom du patient  ex pat 7061283778  02-03-17    12h54   mr montaldo jimena          02-03-17 7061283777  02-03-17    12h54   mme montaldo jimena             03-03-17 7061283790  02-03-17    12h54   mme montaldo jimena             02-03-17 7061283779  02-03-17    12h55   mr montaldo jimena              02-03-17 7061300309  02-03-17    12h55   mme montaldo jimena             02-03-17 7061294068  02-03-17    12h56   mme montaldo jimena             03-03-17 7061283782  02-03-17    12h56   mr montaldo jimena              02-03-17 n[ dos  date dos    heure dos   nom du patient  ex pat 7061283781  02-03-17    12h56   mlle montaldo jimena            02-03-17 7061300311  02-03-17    12h57   mme montaldo jimena             02-03-17 

as can see header appears 3 times.


i've approach posted @jaap, think file dirty, in sense that:

  • it csv file separator ; instead of , (because french).
  • several columns have empty values.
  • perhaps there unknown characters (not sure).

i got following error message:

df <- read.table(text = txt, header =false) error in scan(file = file, = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 44 elements in addition: warning message: closing unused connection 3 (extraction.csv)

i have prepare program in r used knows nothing programming or data analysis. person use regularly program in order results data. person not able clean data himself.

this how file looks if open excel:

how file looks if open excel

based on example data, suppose want exclude lines start n[ dos.

you can data in dataframe in several steps:

  1. read datafile readlines. (because don't have file, i've used textconnection; use follows: readlines('name_of_your_file.txt') read in text)
  2. remove lines start n[ dos grepl-function.
  3. read remaining text read.table.

complete code:

txtcon <- textconnection('n[ dos  date dos    heure dos   nom du patient  ex pat n[ dos  date dos    heure dos   nom du patient  ex pat 7061283778  02-03-17    12h54   mr montaldo jimena          02-03-17 7061283777  02-03-17    12h54   mme montaldo jimena             03-03-17 7061283790  02-03-17    12h54   mme montaldo jimena             02-03-17 7061283779  02-03-17    12h55   mr montaldo jimena              02-03-17 7061300309  02-03-17    12h55   mme montaldo jimena             02-03-17 7061294068  02-03-17    12h56   mme montaldo jimena             03-03-17 7061283782  02-03-17    12h56   mr montaldo jimena              02-03-17 n[ dos  date dos    heure dos   nom du patient  ex pat 7061283781  02-03-17    12h56   mlle montaldo jimena            02-03-17 7061300311  02-03-17    12h57   mme montaldo jimena             02-03-17')  txt <- readlines(txtcon)  txt <- txt[!grepl(pattern = '^n\\[ dos', txt)]  df <- read.table(text = txt, header = false) 

this results in following dataframe:

> df           v1       v2    v3   v4       v5     v6       v7 1 7061283778 02-03-17 12h54   mr montaldo jimena 02-03-17 2 7061283777 02-03-17 12h54  mme montaldo jimena 03-03-17 3 7061283790 02-03-17 12h54  mme montaldo jimena 02-03-17 4 7061283779 02-03-17 12h55   mr montaldo jimena 02-03-17 5 7061300309 02-03-17 12h55  mme montaldo jimena 02-03-17 6 7061294068 02-03-17 12h56  mme montaldo jimena 03-03-17 7 7061283782 02-03-17 12h56   mr montaldo jimena 02-03-17 8 7061283781 02-03-17 12h56 mlle montaldo jimena 02-03-17 9 7061300311 02-03-17 12h57  mme montaldo jimena 02-03-17 

looking @ resulting dataframe, suppose want of columns in 1 column. possible approach be:

df2 <- data.frame(id = df$v1,                    datetime1 = strptime(paste(df$v2, df$v3), '%d-%m-%y %hh%m'),                   datetime2 = as.date(df$v7, '%d-%m-%y'),                   name = paste(df$v4, df$v5, df$v6)) 

whicht results in:

> df2           id           datetime1  datetime2                 name 1 7061283778 2017-03-02 12:54:00 2017-03-02   mr montaldo jimena 2 7061283777 2017-03-02 12:54:00 2017-03-03  mme montaldo jimena 3 7061283790 2017-03-02 12:54:00 2017-03-02  mme montaldo jimena 4 7061283779 2017-03-02 12:55:00 2017-03-02   mr montaldo jimena 5 7061300309 2017-03-02 12:55:00 2017-03-02  mme montaldo jimena 6 7061294068 2017-03-02 12:56:00 2017-03-03  mme montaldo jimena 7 7061283782 2017-03-02 12:56:00 2017-03-02   mr montaldo jimena 8 7061283781 2017-03-02 12:56:00 2017-03-02 mlle montaldo jimena 9 7061300311 2017-03-02 12:57:00 2017-03-02  mme montaldo jimena 

Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -