Separate string column into multiple columns in Pandas

In general iterating over the contents of a pandas series or dataframe is bad for performance and instead vectorized approaches are recommended when possible:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided…

See the docs.

That said, for the “non-botched” entries you can do this:

df['date_time'] = df['date_time/full_company_name'].str.slice(0, 10)
df['full_company_name'] = df['date_time/full_company_name'].str.slice(10, None)
df.drop('date_time/full_company_name', axis=1, inplace=True)

For botched entries, if there is no consistent pattern, it’s gonna be hard to come up with a programmatic approach, but something that works for your examples and could be extended with other known entry errors is using extract with a regular expression:

# The first capture group is either a Y-M-D date, the NaN string or a sequence of digits
# The second one is any string
pattern = r'(\d{4}-\d{2}-\d{2}|NaN|0|\d+)(.+)'

df['date_time/full_company_name'].str.extract(pattern)

Output:

            0                             1
0  2020-05-19                 Lopez-Wallace
1  2020-05-12                   Smith-Simon
2  2020-10-02                   Jenkins Inc
3  2020-07-06                   Moore-Weiss
4           0       Lopez, Barton and Jones
5         NaN  Brown, Singleton and Harrell
6       84635                   Ball-Thomas

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top