在PySpark SQL中的datetime范围过滤器
在PySpark SQL中的datetime范围过滤器
如何正确使用时间戳字段对数据帧进行过滤?
我尝试了不同的日期格式和过滤形式,但没有帮助:要么pyspark返回0个对象,要么抛出错误,表示它不理解日期时间格式。
目前我做到了这一点:
from pyspark import SparkContext from pyspark.sql import SQLContext from django.utils import timezone from django.conf import settings from myapp.models import Collection sc = SparkContext("local", "DjangoApp") sqlc = SQLContext(sc) url = "jdbc:postgresql://%(HOST)s/%(NAME)s?user=%(USER)s&password=%(PASSWORD)s" % settings.DATABASES['default'] sf = sqlc.load(source="jdbc", url=url, dbtable='myapp_collection')
时间戳字段的范围:
system_tz = timezone.pytz.timezone(settings.TIME_ZONE) date_from = datetime.datetime(2014, 4, 16, 18, 30, 0, 0, tzinfo=system_tz) date_to = datetime.datetime(2015, 6, 15, 18, 11, 59, 999999, tzinfo=system_tz)
尝试1:
date_filter = "my_col >= '%s' AND my_col <= '%s'" % ( date_from.isoformat(), date_to.isoformat() ) sf = sf.filter(date_filter) sf.count() # 输出:0
尝试2:
sf = sf.filter(sf.my_col >= date_from).filter(sf.my_col <= date_to) sf.count() # 报错:org.postgresql.util.PSQLException: ERROR: syntax error at or near "18"
尝试3:
sf = sf.filter("my_col BETWEEN '%s' AND '%s'" % (date_from.isoformat(), date_to.isoformat())) sf.count() # 报错:org.postgresql.util.PSQLException: ERROR: syntax error at or near "18"
然而,数据表中确实存在数据:
django_filters = { 'my_col__gte': date_from, 'my_col__lte': date_to } Collection.objects.filter(**django_filters).count() # 输出:1093436
或者可以这样:
django_range_filter = {'my_col__range': (date_from, date_to)} Collection.objects.filter(**django_range_filter).count() # 输出:1093436