regex - Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract -
i have dataframe , try string, on of column contain string df looks like
member_id,event_path,event_time,event_duration 30595,"2016-03-30 12:27:33",yandex.ru/,1 30595,"2016-03-30 12:31:42",yandex.ru/,0 30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%d1%84%d0%b8%d0%bb%d1%8c%d0%bc%d1%8b+%d0%be%d0%bd%d0%bb%d0%b0%d0%b9%d0%bd&suggest_reqid=168542624144922467267026838391360&csg=3381%2c3938%2c2%2c3%2c1%2c0%2c0,0 30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%d1%84%d0%b8%d0%bb%d1%8c%d0%bc%d1%8b+%d0%be%d0%bd%d0%bb%d0%b0%d0%b9%d0%bd&suggest_reqid=168542624144922467267026838391360&csg=3381%2c3938%2c2%2c3%2c1%2c0%2c0,0 30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%d1%84%d0%b8%d0%bb%d1%8c%d0%bc%d1%8b+%d0%be%d0%bd%d0%bb%d0%b0%d0%b9%d0%bd&suggest_reqid=168542624144922467267026838391360&csg=3381%2c3938%2c2%2c3%2c1%2c0%2c0,0 30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%d1%84%d0%b8%d0%bb%d1%8c%d0%bc%d1%8b+%d0%be%d0%bd%d0%bb%d0%b0%d0%b9%d0%bd&suggest_reqid=168542624144922467267026838391360&csg=3381%2c3938%2c2%2c3%2c1%2c0%2c0,0 30595,"2016-03-30 12:31:49",kinogo.co/,1 30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0 and df urls
url 003\.ru\/[a-za-z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix 003\.ru\/[a-za-z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_ 003\.ru\/sonyxperia 003\.ru\/[a-za-z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony 003\.ru\/[a-za-z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5d5bbr_23 1click\.ru\/sonyxperia 1click\.ru\/[a-za-z0-9-_%$#?.:+=|()]+\/chasy-motorola i use
urls = pd.read_csv('relevant_url1.csv', error_bad_lines=false) substr = urls.url.values.tolist() data = pd.read_csv('data_nts2.csv', error_bad_lines=false, chunksize=50000) result = pd.dataframe() i, df in enumerate(data): res = df[df['event_time'].str.contains('|'.join(substr), regex=true)] but return me
userwarning: pattern has match groups. groups, use str.extract. how can fix that?
at least 1 of regex patterns in urls must use capturing group. str.contains returns true or false each row in df['event_time'] -- not make use of capturing group. thus, userwarning alerting regex uses capturing group match not used.
if wish remove userwarning find , remove capturing group regex pattern(s). not shown in regex patterns posted, must there in actual file. parentheses outside of character classes.
alternatively, suppress particular userwarning putting
import warnings warnings.filterwarnings("ignore", 'this pattern has match groups') before call str.contains.
here simple example demonstrates problem (and solution):
# import warnings # warnings.filterwarnings("ignore", 'this pattern has match groups') # uncomment suppress userwarning import pandas pd df = pd.dataframe({ 'event_time': ['gouda', 'stilton', 'gruyere']}) urls = pd.dataframe({'url': ['g(.*)']}) # capturing group, there userwarning # urls = pd.dataframe({'url': ['g.*']}) # without capturing group, there no userwarning. uncommenting line avoids userwarning. substr = urls.url.values.tolist() df[df['event_time'].str.contains('|'.join(substr), regex=true)] prints
script.py:10: userwarning: pattern has match groups. groups, use str.extract. df[df['event_time'].str.contains('|'.join(substr), regex=true)] removing capturing group regex pattern:
urls = pd.dataframe({'url': ['g.*']}) avoids userwarning.
Comments
Post a Comment