找回密码
 立即注册
首页 python python-基础 查看内容

使用df.apply处理异常


我正在使用tld python库使用apply函数从代理请求日志中获取第一级域。

当我遇到一个奇怪的请求,tld不知道如何处理像'http:1 CON'或'http:/login.cgi%00'我遇到如下错误信息:
TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url, 
fail_silently, fix_protocol, search_public, search_private, **kwargs)
    385         fix_protocol=fix_protocol,
    386         search_public=search_public,
--> 387         search_private=search_private
    388     )
    389 

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
    289             return None, None, parsed_url
    290         else:
--> 291             raise TldBadUrl(url=url)
    292 
    293     domain_parts = domain_name.split('.')


与此同时,我一直在使用许多行来除去这些,如下面的代码,但在这个数据集中有数百或数千个:
request_2 = request_1[request_1['request'] != 'http:1 CON']
request_2 = request_1[request_1['request'] != 'http:/login.cgi%00']


数据帧:
request
request_url                                    count
0 https://login.microsoftonline.com            24521
1 https://dt.adsafeprotected.com               11521
2 https://googleads.g.doubleclick.net          6252
3 https://fls-na.amazon.com                    65225
4 https://v10.vortex-win.data.microsoft.com    7852222
5 https://ib.adnxs.com                         12


代码:
from tld import get_tld
from tld import get_fld
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
import numpy as np

request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column 
request = request[pd.notnull(request['request'])]
#Reset index
request.reset_index(drop=True)
#Find the urls that contain IP addresses and exclude them from the new dataframe
request_1 = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request_1 = request_1.reset_index(drop=True)
#Appply the get_fld lib on the request column
new_fld_column = request_2['request'].apply(get_fld)

无论如何都要保持此错误不被触发,而是将那些会出错的错误添加到单独的数据帧中?

解决方法


如果您可以围绕try-except子句包装函数,则可以通过使用NaN查询这些行来确定哪些行出错:
import tld
from tld import get_fld

def try_get_fld(x):
    try: 
        return get_fld(x)
    except tld.exceptions.TldBadUrl: 
        return np.nan

print(df)
                                 request_url    count
0          https://login.microsoftonline.com    24521
1             https://dt.adsafeprotected.com    11521
2        https://googleads.g.doubleclick.net     6252
3                  https://fls-na.amazon.com    65225
4  https://v10.vortex-win.data.microsoft.com  7852222
5                       https://ib.adnxs.com       12
6                                 http:1 CON       10
7                         http:/login.cgi%00      200

df['flds'] = df['request_url'].apply(try_get_fld)
print(df['flds'])
0    microsoftonline.com
1    adsafeprotected.com
2        doubleclick.net
3             amazon.com
4          microsoft.com
5              adnxs.com
6                    NaN
7                    NaN
Name: flds, dtype: object

faulty_url_df = df[df['flds'].isna()]
print(faulty_url_df)

          request_url  count flds
6          http:1 CON     10  NaN
7  http:/login.cgi%00    200  NaN

分享至 : QQ空间
收藏

0 个回复

您需要登录后才可以回帖 登录 | 立即注册