Joining DataFrames in Pandas

In this tutorial, you’ll learn various ways in which multiple DataFrames could be merged in python using Pandas library.

Have you ever tried solving a Kaggle challenge? If yes, you might have noticed that in most of the challenges, the data provided to you is present in multiple files, with some of the columns present in more than one files. Well, what is the first thing that comes to your mind? Join them of course!

Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should master because in almost all the cases data comes from multiple source and files. You may need to bring all the data in one place by some sort of join logic and then start your analysis. People who work with SQL like query languages might know the importance of this task. Even if you want to build some machine learning models on some data, you may need to merge multiple csv files together in a single DataFrame.

Thankfully you have the most popular library in python, pandas to your rescue! pandas provides various facilities for easily combining together Series, DataFrames, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

In this tutorial, you will practice a few standard techniques. More specifically, you will learn to:

Concatenate DataFrames along row and column.
Merge DataFrames on specific keys by different join logics like left-join, inner-join, etc.
Time-series friendly merging provided in pandas

Along the way, you will also learn a few tricks which you require before and after joining.

Concatenate DataFrames

Start by importing the library you will be using throughout the tutorial: pandas

import pandas as pd

You will be performing all the operations in this tutorial on the dummy DataFrames that you will create. To create a DataFrame you can use python dictionary like:

dummy_data1 = {
        'id': ['1', '2', '3', '4', '5'],
        'Feature1': ['A', 'C', 'E', 'G', 'I'],
        'Feature2': ['B', 'D', 'F', 'H', 'J']}

Here the keys of the dictionary dummy_data1 are the column names and the values in the list are the data corresponding to each observation or row. To transform this into a pandas DataFrame, you will use the DataFrame() function of pandas, along with its columns argument to name your columns:

df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])

df1

id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J

As you can notice, you now have a DataFrame with 3 columns id, Feature1, and Feature2. There is an additional un-named column which pandas intrinsically creates as the row labels. Similar to the previous DataFrame df1, you will create two more DataFrames df2 and df3 :

dummy_data2 = {
        'id': ['1', '2', '6', '7', '8'],
        'Feature1': ['K', 'M', 'O', 'Q', 'S'],
        'Feature2': ['L', 'N', 'P', 'R', 'T']}

df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])

df2

id	Feature1	Feature2
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

dummy_data3 = {
        'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}

df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])

df3

id	Feature3
0	1	12
1	2	13
2	3	14
3	4	15
4	5	16
5	7	17
6	8	15
7	9	12
8	10	13
9	11	23

To simply concatenate the DataFrames along the row you can use the concat() function in pandas. You will have to pass the names of the DataFrames in a list as the argument to the concat() function:

df_row = pd.concat([df1, df2])

df_row

id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

You can notice that the two DataFrames df1 and df2 are now concatenated into a single DataFrame df_row along the row. However, the row labels seem to be wrong! If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index as True while calling the concat() function:

df_row_reindex = pd.concat([df1, df2], ignore_index=True)

df_row_reindex

id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
5	1	K	L
6	2	M	N
7	6	O	P
8	7	Q	R
9	8	S	T

Now the row labels are correct!

pandas also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame. You can achieve the same by passing additional argument keys specifying the label names of the DataFrames in a list. Here you will perform the same concatenation with keys as x and y for DataFrames df1 and df2respectively.

frames = [df1,df2]
df_keys = pd.concat(frames, keys=['x', 'y'])

df_keys

id	Feature1	Feature2
x	0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
y	0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. You can retrieve the data of DataFrame df2 which had the label y by using the loc method:

df_keys.loc['y']

id	Feature1	Feature2
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

You can also pass a dictionary to concat(), in which case the dictionary keys will be used for the keys argument (unless other keys are specified):

pieces = {'x': df1, 'y': df2}

df_piece = pd.concat(pieces)

df_piece

id	Feature1	Feature2
x	0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
y	0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

It is worth noting that concat() makes a full copy of the data, and continuosly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)

To concatenate DataFrames along column, you can specify the axis parameter as 1 :

df_col = pd.concat([df1,df2], axis=1)

df_col

id	Feature1	Feature2	id	Feature1	Feature2
0	1	A	B	1	K	L
1	2	C	D	2	M	N
2	3	E	F	6	O	P
3	4	G	H	7	Q	R
4	5	I	J	8	S	T

Merge DataFrames

Another ubiquitous operation related to DataFrames is the merging operation. Two DataFrames might hold different kinds of information about the same entity and linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like concat(), merge() , join(), etc. In this section, you will practice using merge() function of pandas.

You can join DataFrames df_row (which you created by concatenating df1 and df2 along the row) and df3 on the common column (or key) id. To do so, pass the names of the DataFrames and an additional argument on as the name of the common column, here id, to the merge()function:

df_merge_col = pd.merge(df_row, df3, on='id')

df_merge_col

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15

You can notice that the DataFrames are now merged into a single DataFrame based on the common values present in the id column of both the DataFrames. For example, here id value 1was present with both A, B and K, L in the DataFrame df_row hence this id got repeated twice in the final DataFrame df_merge_col with repeated value 12 of Feature3 which came from DataFrame df3.

It might happen that the column on which you want to merge the DataFrames have different names (unlike in this case). For such merges, you will have to specify the arguments left_on as the left DataFrame name and right_on as the right DataFrame name, like :

df_merge_difkey = pd.merge(df_row, df3, left_on='id', right_on='id')

df_merge_difkey

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15

You can also append rows to a DataFrame by passing a Series or dict to append() function which returns a new DataFrame:

add_row = pd.Series(['10', 'X1', 'X2', 'X3'],
                    index=['id','Feature1', 'Feature2', 'Feature3'])

df_add_row = df_merge_col.append(add_row, ignore_index=True)

df_add_row

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15
9	10	X1	X2	X3

Join DataFrames

In this section, you will practice the various join logics available to merge pandas DataFrames based on some common column/key. The logic behind these joins is very much the same that you have in SQL when you join tables.

Full Outer Join

The FULL OUTER JOIN combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

df_outer = pd.merge(df1, df2, on='id', how='outer')

df_outer

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

You can notice that the resulting DataFrame had all the entries from both the tables with NaNvalues for missing matches on either side. However, one more thing to notice is the suffix which got appended to the column names to show which column came from which DataFrame. The default suffixes are x and y, however, you can modify them by specifying the suffixesargument in the merge() function:

df_suffix = pd.merge(df1, df2, left_on='id',right_on='id',how='outer',suffixes=('_left','_right'))

df_suffix

id	Feature1_left	Feature2_left	Feature1_right	Feature2_right
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

Inner Join

The INNER JOIN produces only the set of records that match in both DataFrame A and DataFrame B. You have to pass inner in the how argument of merge() function to do inner join:

df_inner = pd.merge(df1, df2, on='id', how='inner')

df_inner

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N

Right Join

The RIGHT JOIN produces a complete set of records from DataFrame B (right DataFrame), with the matching records (where available) in DataFrame A (left DataFrame). If there is no match, the right side will contain null. You have to pass right in the how argument of merge() function to do right join:

df_right = pd.merge(df1, df2, on='id', how='right')

df_right

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	6	NaN	NaN	O	P
3	7	NaN	NaN	Q	R
4	8	NaN	NaN	S	T

Left Join

The LEFT JOIN produces a complete set of records from DataFrame A (left DataFrame), with the matching records (where available) in DataFrame B (right DataFrame). If there is no match, the left side will contain null. You have to pass left in the how argument of merge() function to do left join:

df_left = pd.merge(df1, df2, on='id', how='left')

df_left

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN

Joining on index

Sometimes you may have to perform the join on the indexes or the row labels. To do so, you have to specify right_index (for the indexes of the right DataFrame) and left_index (for the indexes of the left DataFrame) as True :

df_index = pd.merge(df1, df2, right_index=True, left_index=True)

df_index

id_x	Feature1_x	Feature2_x	id_y	Feature1_y	Feature2_y
0	1	A	B	1	K	L
1	2	C	D	2	M	N
2	3	E	F	6	O	P
3	4	G	H	7	Q	R
4	5	I	J	8	S	T

Time-series friendly merging

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the merge_asof() function. The merge_asof() is similar to an ordered left-join except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.

For example, you might have trades and quotes, and you want to asof merge them. Here the left DataFrame is chosen as trades and right DataFrame as quotes. They are asof merged on key time and group-wise merged by their ticker symbol.

trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.038',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048']),
    'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
    'price': [51.95, 51.95,720.77, 720.92, 98.00],
    'quantity': [75, 155,100, 100, 100]},
    columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.030',
                            '20160525 13:30:00.041',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.049',
                            '20160525 13:30:00.072',
                            '20160525 13:30:00.075']),
    'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
    'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
    'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
    columns=['time', 'ticker', 'bid', 'ask'])

trades

time	ticker	price	quantity
0	2016-05-25 13:30:00.023	MSFT	51.95	75
1	2016-05-25 13:30:00.038	MSFT	51.95	155
2	2016-05-25 13:30:00.048	GOOG	720.77	100
3	2016-05-25 13:30:00.048	GOOG	720.92	100
4	2016-05-25 13:30:00.048	AAPL	98.00	100

quotes

time	ticker	bid	ask
0	2016-05-25 13:30:00.023	GOOG	720.50	720.93
1	2016-05-25 13:30:00.023	MSFT	51.95	51.96
2	2016-05-25 13:30:00.030	MSFT	51.97	51.98
3	2016-05-25 13:30:00.041	MSFT	51.99	52.00
4	2016-05-25 13:30:00.048	GOOG	720.50	720.93
5	2016-05-25 13:30:00.049	AAPL	97.99	98.01
6	2016-05-25 13:30:00.072	GOOG	720.50	720.88
7	2016-05-25 13:30:00.075	MSFT	52.01	52.03

df_merge_asof = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker')

df_merge_asof

time	ticker	price	quantity	bid	ask
0	2016-05-25 13:30:00.023	MSFT	51.95	75	51.95	51.96
1	2016-05-25 13:30:00.038	MSFT	51.95	155	51.97	51.98
2	2016-05-25 13:30:00.048	GOOG	720.77	100	720.50	720.93
3	2016-05-25 13:30:00.048	GOOG	720.92	100	720.50	720.93
4	2016-05-25 13:30:00.048	AAPL	98.00	100	NaN	NaN

If you observe carefully, you can notice the reason behind NaN appearing in the AAPL ticker row. Since the right DataFrame quotes didn’t have any time value less than 13:30:00.048 (the timein the left table) for AAPL ticker, NaNs were introduced in the bid and ask columns.

You can also set a predefined tolerance level for time column. Suppose you only want asof merge within 2ms between the quote time and the trade time, then you will have to specify tolerance argument:

df_merge_asof_tolerance = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker',
              tolerance=pd.Timedelta('2ms'))

df_merge_asof_tolerance

time	ticker	price	quantity	bid	ask
0	2016-05-25 13:30:00.023	MSFT	51.95	75	51.95	51.96
1	2016-05-25 13:30:00.038	MSFT	51.95	155	NaN	NaN
2	2016-05-25 13:30:00.048	GOOG	720.77	100	720.50	720.93
3	2016-05-25 13:30:00.048	GOOG	720.92	100	720.50	720.93
4	2016-05-25 13:30:00.048	AAPL	98.00	100	NaN	NaN

Notice the difference between the above and previous result. Rows are not merged if the time tolerance didn’t match 2ms.

Conclusion

Hurray! You have come to the end of the tutorial. In this tutorial, you learned to concatenate and merge DataFrames based on several logics using the concat() and merge() functions of pandaslibrary. Towards the end, you also practiced the special function merge_asof() for merging Time-series DataFrames. Along the way, you also learned to play with indexes of the DataFrames. There are several other options you can explore for joining DataFrames in pandas, and I encourage you to look at its fantastic documentation. Happy exploring!

Żdródło:
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas

Bez kategorii

Joining DataFrames in Pandas

by root • 13 września 2019 • 0 Comments

Concatenate DataFrames

Merge DataFrames

Join DataFrames

Full Outer Join

Inner Join

Right Join

Left Join

Joining on index

Time-series friendly merging

Conclusion

Dodaj komentarz Anuluj pisanie odpowiedzi

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15
9	10	X1	X2	X3

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

id	Feature1_left	Feature2_left	Feature1_right	Feature2_right
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15
9	10	X1	X2	X3

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

id	Feature1_left	Feature2_left	Feature1_right	Feature2_right
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

Bez kategorii

Joining DataFrames in Pandas

by root • 13 września 2019 • 0 Comments

Concatenate DataFrames

Merge DataFrames

Join DataFrames

Full Outer Join

Inner Join

Right Join

Left Join

Joining on index

Time-series friendly merging

Conclusion

Post navigation

Dodaj komentarz Anuluj pisanie odpowiedzi

id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15
9	10	X1	X2	X3

id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

id	Feature1_left	Feature2_left	Feature1_right	Feature2_right
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T