Jakiś problem?

Joining DataFrames in Pandas

In this tutorial, you’ll learn various ways in which multiple DataFrames could be merged in python using Pandas library.

Have you ever tried solving a Kaggle challenge? If yes, you might have noticed that in most of the challenges, the data provided to you is present in multiple files, with some of the columns present in more than one files. Well, what is the first thing that comes to your mind? Join them of course!

Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should master because in almost all the cases data comes from multiple source and files. You may need to bring all the data in one place by some sort of join logic and then start your analysis. People who work with SQL like query languages might know the importance of this task. Even if you want to build some machine learning models on some data, you may need to merge multiple csv files together in a single DataFrame.

Thankfully you have the most popular library in python, pandas to your rescue! pandas provides various facilities for easily combining together Series, DataFrames, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

In this tutorial, you will practice a few standard techniques. More specifically, you will learn to:

  • Concatenate DataFrames along row and column.
  • Merge DataFrames on specific keys by different join logics like left-join, inner-join, etc.
  • Time-series friendly merging provided in pandas

Along the way, you will also learn a few tricks which you require before and after joining.

Concatenate DataFrames

Start by importing the library you will be using throughout the tutorial: pandas

import pandas as pd

You will be performing all the operations in this tutorial on the dummy DataFrames that you will create. To create a DataFrame you can use python dictionary like:

dummy_data1 = {
        'id': ['1', '2', '3', '4', '5'],
        'Feature1': ['A', 'C', 'E', 'G', 'I'],
        'Feature2': ['B', 'D', 'F', 'H', 'J']}

Here the keys of the dictionary dummy_data1 are the column names and the values in the list are the data corresponding to each observation or row. To transform this into a pandas DataFrame, you will use the DataFrame() function of pandas, along with its columns argument to name your columns:

df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])

df1
idFeature1Feature2
01AB
12CD
23EF
34GH
45IJ

As you can notice, you now have a DataFrame with 3 columns idFeature1, and Feature2. There is an additional un-named column which pandas intrinsically creates as the row labels. Similar to the previous DataFrame df1, you will create two more DataFrames df2 and df3 :

dummy_data2 = {
        'id': ['1', '2', '6', '7', '8'],
        'Feature1': ['K', 'M', 'O', 'Q', 'S'],
        'Feature2': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])

df2
idFeature1Feature2
01KL
12MN
26OP
37QR
48ST
dummy_data3 = {
        'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}
df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])

df3
idFeature3
0112
1213
2314
3415
4516
5717
6815
7912
81013
91123

To simply concatenate the DataFrames along the row you can use the concat() function in pandas. You will have to pass the names of the DataFrames in a list as the argument to the concat() function:

df_row = pd.concat([df1, df2])

df_row
idFeature1Feature2
01AB
12CD
23EF
34GH
45IJ
01KL
12MN
26OP
37QR
48ST

You can notice that the two DataFrames df1 and df2 are now concatenated into a single DataFrame df_row along the row. However, the row labels seem to be wrong! If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index as True while calling the concat() function:

df_row_reindex = pd.concat([df1, df2], ignore_index=True)

df_row_reindex
idFeature1Feature2
01AB
12CD
23EF
34GH
45IJ
51KL
62MN
76OP
87QR
98ST

Now the row labels are correct!

pandas also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame. You can achieve the same by passing additional argument keys specifying the label names of the DataFrames in a list. Here you will perform the same concatenation with keys as x and y for DataFrames df1 and df2respectively.

frames = [df1,df2]
df_keys = pd.concat(frames, keys=['x', 'y'])

df_keys
idFeature1Feature2
x01AB
12CD
23EF
34GH
45IJ
y01KL
12MN
26OP
37QR
48ST

Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. You can retrieve the data of DataFrame df2 which had the label y by using the loc method:

df_keys.loc['y']
idFeature1Feature2
01KL
12MN
26OP
37QR
48ST

You can also pass a dictionary to concat(), in which case the dictionary keys will be used for the keys argument (unless other keys are specified):

pieces = {'x': df1, 'y': df2}

df_piece = pd.concat(pieces)

df_piece
idFeature1Feature2
x01AB
12CD
23EF
34GH
45IJ
y01KL
12MN
26OP
37QR
48ST

It is worth noting that concat() makes a full copy of the data, and continuosly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)

To concatenate DataFrames along column, you can specify the axis parameter as 1 :

df_col = pd.concat([df1,df2], axis=1)

df_col
idFeature1Feature2idFeature1Feature2
01AB1KL
12CD2MN
23EF6OP
34GH7QR
45IJ8ST

Merge DataFrames

Another ubiquitous operation related to DataFrames is the merging operation. Two DataFrames might hold different kinds of information about the same entity and linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like concat()merge() , join(), etc. In this section, you will practice using merge() function of pandas.

You can join DataFrames df_row (which you created by concatenating df1 and df2 along the row) and df3 on the common column (or key) id. To do so, pass the names of the DataFrames and an additional argument on as the name of the common column, here id, to the merge()function:

df_merge_col = pd.merge(df_row, df3, on='id')

df_merge_col
idFeature1Feature2Feature3
01AB12
11KL12
22CD13
32MN13
43EF14
54GH15
65IJ16
77QR17
88ST15

You can notice that the DataFrames are now merged into a single DataFrame based on the common values present in the id column of both the DataFrames. For example, here id value 1was present with both AB and KL in the DataFrame df_row hence this id got repeated twice in the final DataFrame df_merge_col with repeated value 12 of Feature3 which came from DataFrame df3.

It might happen that the column on which you want to merge the DataFrames have different names (unlike in this case). For such merges, you will have to specify the arguments left_on as the left DataFrame name and right_on as the right DataFrame name, like :

df_merge_difkey = pd.merge(df_row, df3, left_on='id', right_on='id')

df_merge_difkey
idFeature1Feature2Feature3
01AB12
11KL12
22CD13
32MN13
43EF14
54GH15
65IJ16
77QR17
88ST15

You can also append rows to a DataFrame by passing a Series or dict to append() function which returns a new DataFrame:

add_row = pd.Series(['10', 'X1', 'X2', 'X3'],
                    index=['id','Feature1', 'Feature2', 'Feature3'])

df_add_row = df_merge_col.append(add_row, ignore_index=True)

df_add_row
idFeature1Feature2Feature3
01AB12
11KL12
22CD13
32MN13
43EF14
54GH15
65IJ16
77QR17
88ST15
910X1X2X3

Join DataFrames

In this section, you will practice the various join logics available to merge pandas DataFrames based on some common column/key. The logic behind these joins is very much the same that you have in SQL when you join tables.

Full Outer Join

The FULL OUTER JOIN combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

df_outer = pd.merge(df1, df2, on='id', how='outer')

df_outer
idFeature1_xFeature2_xFeature1_yFeature2_y
01ABKL
12CDMN
23EFNaNNaN
34GHNaNNaN
45IJNaNNaN
56NaNNaNOP
67NaNNaNQR
78NaNNaNST

You can notice that the resulting DataFrame had all the entries from both the tables with NaNvalues for missing matches on either side. However, one more thing to notice is the suffix which got appended to the column names to show which column came from which DataFrame. The default suffixes are x and y, however, you can modify them by specifying the suffixesargument in the merge() function:

df_suffix = pd.merge(df1, df2, left_on='id',right_on='id',how='outer',suffixes=('_left','_right'))

df_suffix
idFeature1_leftFeature2_leftFeature1_rightFeature2_right
01ABKL
12CDMN
23EFNaNNaN
34GHNaNNaN
45IJNaNNaN
56NaNNaNOP
67NaNNaNQR
78NaNNaNST

Inner Join

The INNER JOIN produces only the set of records that match in both DataFrame A and DataFrame B. You have to pass inner in the how argument of merge() function to do inner join:

df_inner = pd.merge(df1, df2, on='id', how='inner')

df_inner
idFeature1_xFeature2_xFeature1_yFeature2_y
01ABKL
12CDMN

Right Join

The RIGHT JOIN produces a complete set of records from DataFrame B (right DataFrame), with the matching records (where available) in DataFrame A (left DataFrame). If there is no match, the right side will contain null. You have to pass right in the how argument of merge() function to do right join:

df_right = pd.merge(df1, df2, on='id', how='right')

df_right
idFeature1_xFeature2_xFeature1_yFeature2_y
01ABKL
12CDMN
26NaNNaNOP
37NaNNaNQR
48NaNNaNST

Left Join

The LEFT JOIN produces a complete set of records from DataFrame A (left DataFrame), with the matching records (where available) in DataFrame B (right DataFrame). If there is no match, the left side will contain null. You have to pass left in the how argument of merge() function to do left join:

df_left = pd.merge(df1, df2, on='id', how='left')

df_left
idFeature1_xFeature2_xFeature1_yFeature2_y
01ABKL
12CDMN
23EFNaNNaN
34GHNaNNaN
45IJNaNNaN

Joining on index

Sometimes you may have to perform the join on the indexes or the row labels. To do so, you have to specify right_index (for the indexes of the right DataFrame) and left_index (for the indexes of the left DataFrame) as True :

df_index = pd.merge(df1, df2, right_index=True, left_index=True)

df_index
id_xFeature1_xFeature2_xid_yFeature1_yFeature2_y
01AB1KL
12CD2MN
23EF6OP
34GH7QR
45IJ8ST

Time-series friendly merging

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the merge_asof() function. The merge_asof() is similar to an ordered left-join except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.

For example, you might have trades and quotes, and you want to asof merge them. Here the left DataFrame is chosen as trades and right DataFrame as quotes. They are asof merged on key time and group-wise merged by their ticker symbol.

trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.038',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048']),
    'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
    'price': [51.95, 51.95,720.77, 720.92, 98.00],
    'quantity': [75, 155,100, 100, 100]},
    columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.030',
                            '20160525 13:30:00.041',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.049',
                            '20160525 13:30:00.072',
                            '20160525 13:30:00.075']),
    'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
    'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
    'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
    columns=['time', 'ticker', 'bid', 'ask'])
trades
timetickerpricequantity
02016-05-25 13:30:00.023MSFT51.9575
12016-05-25 13:30:00.038MSFT51.95155
22016-05-25 13:30:00.048GOOG720.77100
32016-05-25 13:30:00.048GOOG720.92100
42016-05-25 13:30:00.048AAPL98.00100
quotes
timetickerbidask
02016-05-25 13:30:00.023GOOG720.50720.93
12016-05-25 13:30:00.023MSFT51.9551.96
22016-05-25 13:30:00.030MSFT51.9751.98
32016-05-25 13:30:00.041MSFT51.9952.00
42016-05-25 13:30:00.048GOOG720.50720.93
52016-05-25 13:30:00.049AAPL97.9998.01
62016-05-25 13:30:00.072GOOG720.50720.88
72016-05-25 13:30:00.075MSFT52.0152.03
df_merge_asof = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker')

df_merge_asof
timetickerpricequantitybidask
02016-05-25 13:30:00.023MSFT51.957551.9551.96
12016-05-25 13:30:00.038MSFT51.9515551.9751.98
22016-05-25 13:30:00.048GOOG720.77100720.50720.93
32016-05-25 13:30:00.048GOOG720.92100720.50720.93
42016-05-25 13:30:00.048AAPL98.00100NaNNaN

If you observe carefully, you can notice the reason behind NaN appearing in the AAPL ticker row. Since the right DataFrame quotes didn’t have any time value less than 13:30:00.048 (the timein the left table) for AAPL ticker, NaNs were introduced in the bid and ask columns.

You can also set a predefined tolerance level for time column. Suppose you only want asof merge within 2ms between the quote time and the trade time, then you will have to specify tolerance argument:

df_merge_asof_tolerance = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker',
              tolerance=pd.Timedelta('2ms'))

df_merge_asof_tolerance
timetickerpricequantitybidask
02016-05-25 13:30:00.023MSFT51.957551.9551.96
12016-05-25 13:30:00.038MSFT51.95155NaNNaN
22016-05-25 13:30:00.048GOOG720.77100720.50720.93
32016-05-25 13:30:00.048GOOG720.92100720.50720.93
42016-05-25 13:30:00.048AAPL98.00100NaNNaN

Notice the difference between the above and previous result. Rows are not merged if the time tolerance didn’t match 2ms.

Conclusion

Hurray! You have come to the end of the tutorial. In this tutorial, you learned to concatenate and merge DataFrames based on several logics using the concat() and merge() functions of pandaslibrary. Towards the end, you also practiced the special function merge_asof() for merging Time-series DataFrames. Along the way, you also learned to play with indexes of the DataFrames. There are several other options you can explore for joining DataFrames in pandas, and I encourage you to look at its fantastic documentation. Happy exploring!

Żdródło:
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *