如何将自由形式的街道/邮政地址解析为文本之外的内容,并将其解析为组件


136

我们主要在美国开展业务,并且正在尝试通过将所有地址字段合并到一个文本区域中来改善用户体验。但是有一些问题:

  • 用户输入的地址可能不正确或格式不正确
  • 该地址必须分为几部分(街道,城市,州等)以处理信用卡付款
  • 用户可以输入的不只是他们的地址(例如他们的姓名或公司名称)
  • Google可以这样做,但是服务条款和查询限制是禁止的,尤其是在预算紧张的情况下

显然,这是一个常见问题:

有没有一种方法可以将地址与周围的文本隔离开来并将其分成几部分?是否存在用于解析地址的正则表达式?


下面的答案更有用,因为它们不会忽略全局问题-解决不适合通用模式的问题。
Marc Maxmeister

Answers:


290

在地址验证公司工作时,我经常看到这个问题。我将答案发布在这里,以使正在寻找相同问题的程序员更容易使用。我所在的公司处理了数十亿个地址,在此过程中我们学到了很多东西。

首先,我们需要了解有关地址的一些信息。

地址不正规

这意味着正则表达式不可用。我已经看到了所有这些,从以非常特定的格式匹配地址的简单正则表达式到:

/ \ s +(\ d {2,5} \ s +)(?![a | p] m \ b)(([[a-zA-Z | \ s +] {1,5}){1,2}) ?([\ s |,|。] +)?(([[a-zA-Z | \ s +] {1,30}){1,4})(法院| ct |街道| st | drive | dr |车道| ln | road | rd | blvd)([\ s | ,, |。|;] +)?(([[a-zA-Z | \ s +] {1,30}){1,2})([ \ s |,|。] +)?\ b(AK | AL | AR | AZ | CA | CO | CT | DC | DE | FL | GA | GU | HI | IA | ID | IL | IN | KS | KY | LA | MA | MD | ME | MI | MN | MO | MS | MT | NC | ND | NE | NH | NJ | NM | NV | NY | OH | OK | OR | PA | RI | SC | SD | TN | TX | UT | VA | VI | VT | WA | WI | WV | WY)([\ s |,|。] +)?(\ s + \ d {5})?([\ s |,|。] +)/ i

......到其中900+线类文件生成在运行一个超大质量的正则表达式更加匹配。我不推荐使用这些方法(例如,上面的正则表达式有很多不足之处)。没有简单的魔术公式可以使它起作用。在理论和通过理论,它不可能用一个正则表达式匹配的地址。

USPS出版物28记录了可能的多种地址格式及其所有关键字和变量。最糟糕的是,地址经常是模棱两可的。单词可能意味着不止一件事(“ St”可以是“ Saint”或“ Street”),而且我敢肯定有些单词是他们发明的。(谁知道“ Stravenue”是街道后缀?)

您需要一些真正了解地址的代码,如果该代码确实存在,则这是一个商业秘密。但是,如果您真的喜欢的话,您可以自己动手做。

地址出现意外的形状和大小

以下是一些人为(但完整)的地址:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203

即使这些可能是有效的:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345

5)  205 1105 14 90210

显然,这些不是标准化的。不能保证标点和换行符。这是怎么回事:

  1. 数字1是完整的,因为它包含街道地址以及城市和州。有了这些信息,就可以识别出足够的地址,并且可以认为它是“可交付的”(经过某种标准化)。

  2. 数字2是完整的,因为它还包含街道地址(带有辅助/单元号)和5位数的邮政编码,足以识别地址。

  3. 数字3是完整的邮政信箱格式,因为它包含邮政编码。

  4. 数字4也完整,因为邮政编码是唯一的,这意味着私人实体或公司已购买了该地址空间。唯一的邮政编码适用于大批量或集中交付空间。邮递区号12345的所有内容都将发送给纽约州斯克内克塔迪的通用电气。该示例不会特别涉及任何人,但是USPS仍然可以提供它。

  5. 不管您相信与否,数字5也是完整的。仅使用这些数字,就可以对所有可能地址的数据库进行解析时发现完整地址。当您将每个数字视为一个组件时,填写缺少的方向,辅助指示符和ZIP + 4代码很简单。完全扩展和标准化的外观如下所示:

205 N 1105 W公寓14

比佛利山庄CA 90210-5221

地址数据不是您自己的

在大多数向授权供应商提供官方地址数据的国家/地区中,地址数据本身属于管理机构。在美国,USPS拥有地址。尽管每个国家对所有权的强制执行或定义有所不同,但加拿大邮政,皇家邮政和其他组织也是如此。知道这一点很重要,因为它通常禁止对地址数据库进行逆向工程。您必须小心如何获取,存储和使用数据。

Google Maps是快速修复地址的常用工具,但是TOS却是禁止的。例如,您不能在不显示Google Map的情况下使用其数据或API,并且只能用于非商业目的(除非您付费),并且不能存储数据(临时缓存除外)。说得通。Google的数据是世界上最好的。然而,谷歌地图并没有验证地址。如果地址不存在,它仍然会显示你所在地址是,如果它确实存在(尝试在自己的街道;使用门牌号码,你知道不存在)。有时这很有用,但请注意。

Nominatim的使用政策也受到类似的限制,尤其是对于大批量和商业用途,并且数据主要来自免费资源,因此维护得不好(开放项目的性质就是这样),但是,这仍然适合您的需求。它得到了一个伟大社区的支持。

USPS本身有一个API,但是它下降了很多,并且没有任何保证或支持。也可能很难使用。有些人可以毫无节制地使用它。但是很容易错过USPS要求您仅使用其API来确认要通过它们发送的地址的情况。

人们期望地址很难

不幸的是,我们已经使我们的社会期望地址变得复杂。互联网上有很多关于UX的好文章,但是事实是,如果您有一个包含各个字段的地址表,那就是用户期望的,即使这样会使不适合使用这种情况的边缘情况下的地址更加困难。格式化表格所期望的格式,或者表格需要不应该的字段。或者用户不知道将地址的特定部分放在哪里。

这些天我可以继续检查结帐表格的不良UX,但我只想说,将地址合并到一个字段中将是一个可喜的变化-人们将能够输入他们认为合适的地址,而不是尝试找出冗长的表格。但是,此更改将是意外的,并且用户一开始可能会发现它有点刺耳。请注意。

通过将国家/地区字段放在地址前面,可以减轻部分痛苦。当他们首先填写国家/地区字段时,您就会知道如何显示表格。也许您有一种处理单字段美国地址的好方法,因此,如果他们选择“美国”,则可以将表格简化为一个字段,否则显示组成部分字段。只是要考虑的事情!

现在我们知道为什么很难。你能为这个做什么?

USPS通过称为CASS™认证的过程向供应商颁发许可证,以向客户提供经过验证的地址。这些供应商可以访问USPS数据库,该数据库每月更新一次。他们的软件必须符合严格的标准才能获得认证,而且他们通常不需要同意上述限制性条款。

有许多经过CASS认证的公司可以处理列表或使用API​​:Melissa Data,Experian QAS和SmartyStreets。

(由于对“广告”不屑一顾,因此我在这里已经截断了答案。完全由您决定是否适合您。)

真相:真的,伙计们,我不在这些公司中任职。这不是广告。


1
南美(乌拉圭)的地址呢?:D
巴特·卡利克斯托

11
@Brian-也许是因为用户为阅读问题和答案的人提供了很多有用的信息,而不管他们是否选择使用其公司的产品。
Zarepheth 2014年

7
@Brian这些网站是内容抓取工具。他们正在努力获取SERP排名。我从未见过他们。我从未在其他任何地方之前或之后发布过此内容。
马特

2
@khuderm我刚读完您的评论时注意到,所有反对的评论都消失了;不知道如何/何时发生。但是无论如何,请查看我的答案的编辑历史记录,您会发现直接引用了可能对您有所帮助的美国地址提取器。我在上一份工作时创建了它,但是它是专有代码,所以我不能共享它……但是它们确实存在。希望是有帮助的。
马特

2
哎呀。抱歉@Matt。好吧,我已经开始关注您的问题以及Github。你真是令人印象深刻。
塞卡(Sayka)2016年

28

libpostal:一个开放源代码库,用于解析地址,并训练来自OpenStreetMap,OpenAddresses和OpenCage的数据。

https://github.com/openvenues/libpostal有关它的更多信息

其他工具/服务:


13

有很多街道地址解析器。它们有两种基本风格-一种具有地名和街道名称的数据库,另一种则没有。

正则表达式街道地址解析器可以成功达到大约95%的成功率,而不会带来太多麻烦。然后,您开始遇到异常情况。CPAN中的Perl就是“ Geo :: StreetAddress :: US”,它就是那样的好。有开放的Python和Javascript端口。我在Python中有一个改进的版本,通过处理更多的案例可以使成功率略有提高。但是,要使最后3%正确无误,您需要数据库来帮助消除歧义。

具有3位邮政编码以及美国州名和缩写的数据库对您有很大的帮助。当解析器看到一致的邮政编码和州名时,它可以开始锁定格式。这在美国和英国非常有效。

正确的街道地址解析从头开始,然后向后进行。USPS系统就是这样做的。结尾处的地址最不明确,因为相对容易识别国家名称,城市名称和邮政编码。街道名称通常可以隔离。街道上的位置最难以解析。在那里您会遇到诸如“五楼”和“钉书亭”之类的东西。那就是数据库大有帮助的时候。


还有CPAN模块Lingua:​​EN :: AddressParse。虽然不是“地球:: ::的StreetAddress美国慢,它提供了较高的成功率。
金瑞安

8

更新:Geocode.xyz现在在全球范围内均可使用。有关示例,请参见https://geocode.xyz

对于美国,墨西哥和加拿大,请访问geocoder.ca

例如:

输入:在Main和Arthur Kill Ry New York相交附近发生的事情

输出:

<geodata>
  <latt>40.5123510000</latt>
  <longt>-74.2500500000</longt>
  <AreaCode>347,718</AreaCode>
  <TimeZone>America/New_York</TimeZone>
  <standard>
    <street1>main</street1>
    <street2>arthur kill</street2>
    <stnumber/>
    <staddress/>
    <city>STATEN ISLAND</city>
    <prov>NY</prov>
    <postal>11385</postal>
    <confidence>0.9</confidence>
  </standard>
</geodata>

您也可以在Web界面中检查结果,或以Json或Jsonp的形式获取输出。例如。我正在纽约123 Main Street附近寻找餐厅


如何使用openaddress实现地址解析系统?您是否在使用蛮力策略?
Nithin K Anil's

1
“蛮力”是什么意思?将文本分解为可能的地址字符串的所有可能组合,并将每个文本与地址数据库进行比较是不切实际的,并且比该系统花费更多的时间来提供答案。Openaddresss是用于为算法构建地址格式的“训练集”的数据源之一。它使用此信息来解析非结构化文本中的地址。
Ervin Ruci

2
另一个类似的系统是Geo :: libpostal(perltricks.com/article/announcing-geo--libpostal)他们还使用openstreetmap和openaddresses来动态构建地址模板
Ervin Ruci

我刚刚在数百个实际地址上测试了geocode.xyz的geoparser(以文本形式发送,返回位置)。鉴于与谷歌地图的API侧的侧面,以及全球地址集,geocode.xyzscantext方法失败的大部分时间。它总是选择“美国日内瓦”而不是“瑞士日内瓦”,并且通常对美国有偏见。
Marc Maxmeister

这取决于上下文。geocode.xyz/?scantext=Geneva,%20Switzerland将产生:Match Location Geneva,Switzerland,CH置信度得分:0.8,而geocode.xyz/?scantext=Geneva,%20USA将产生Match Location Geneva,美国置信度得分:1.0另外,您可以按如下方式进行区域偏差:geocode.xyz/?scantext=Geneva,%20USA®ion=CH
Ervin Ruci

4

没有代码?耻辱!

这是一个简单的JavaScript地址解析器。对于Matt在以上论文中给出的每个单一原因,这都非常糟糕(我几乎100%同意:地址是复杂的类型,人为犯错;最好在可以承受的范围内将其外包和自动化)。

但是我没有哭,而是决定尝试:

此代码可以正常解析大多数Esri结果findAddressCandidate以及其他一些(反向)地理编码器,它们返回单行地址,其中街道/城市/州由逗号分隔。您可以根据需要扩展或编写特定于国家/地区的解析器。或仅将其用作案例研究,以说明这项练习的挑战性或对JavaScript的理解程度。我承认我只花了大约30分钟的时间(未来的迭代可以添加缓存,zip验证,状态查找以及用户位置上下文),但是它适用于我的用例:最终用户看到将地理编码搜索响应解析为4的表格文本框。如果地址解析出现错误(除非源数据不佳,否则这种情况很少发生),没什么大不了的-用户可以对其进行验证和修复!(但对于自动化解决方案,可以丢弃/忽略或将其标记为错误,因此开发人员可以支持新格式或修复源数据。)

/* 
address assumptions:
- US addresses only (probably want separate parser for different countries)
- No country code expected.
- if last token is a number it is probably a postal code
-- 5 digit number means more likely
- if last token is a hyphenated string it might be a postal code
-- if both sides are numeric, and in form #####-#### it is more likely
- if city is supplied, state will also be supplied (city names not unique)
- zip/postal code may be omitted even if has city & state
- state may be two-char code or may be full state name.
- commas: 
-- last comma is usually city/state separator
-- second-to-last comma is possibly street/city separator
-- other commas are building-specific stuff that I don't care about right now.
- token count:
-- because units, street names, and city names may contain spaces token count highly variable.
-- simplest address has at least two tokens: 714 OAK
-- common simple address has at least four tokens: 714 S OAK ST
-- common full (mailing) address has at least 5-7:
--- 714 OAK, RUMTOWN, VA 59201
--- 714 S OAK ST, RUMTOWN, VA 59201
-- complex address may have a dozen or more:
--- MAGICICIAN SUPPLY, LLC, UNIT 213A, MAGIC TOWN MALL, 13 MAGIC CIRCLE DRIVE, LAND OF MAGIC, MA 73122-3412
*/

var rawtext = $("textarea").val();
var rawlist = rawtext.split("\n");

function ParseAddressEsri(singleLineaddressString) {
  var address = {
    street: "",
    city: "",
    state: "",
    postalCode: ""
  };

  // tokenize by space (retain commas in tokens)
  var tokens = singleLineaddressString.split(/[\s]+/);
  var tokenCount = tokens.length;
  var lastToken = tokens.pop();
  if (
    // if numeric assume postal code (ignore length, for now)
    !isNaN(lastToken) ||
    // if hyphenated assume long zip code, ignore whether numeric, for now
    lastToken.split("-").length - 1 === 1) {
    address.postalCode = lastToken;
    lastToken = tokens.pop();
  }

  if (lastToken && isNaN(lastToken)) {
    if (address.postalCode.length && lastToken.length === 2) {
      // assume state/province code ONLY if had postal code
      // otherwise it could be a simple address like "714 S OAK ST"
      // where "ST" for "street" looks like two-letter state code
      // possibly this could be resolved with registry of known state codes, but meh. (and may collide anyway)
      address.state = lastToken;
      lastToken = tokens.pop();
    }
    if (address.state.length === 0) {
      // check for special case: might have State name instead of State Code.
      var stateNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];

      // check remaining tokens from right-to-left for the first comma
      while (2 + 2 != 5) {
        lastToken = tokens.pop();
        if (!lastToken) break;
        else if (lastToken.endsWith(",")) {
          // found separator, ignore stuff on left side
          tokens.push(lastToken); // put it back
          break;
        } else {
          stateNameParts.unshift(lastToken);
        }
      }
      address.state = stateNameParts.join(' ');
      lastToken = tokens.pop();
    }
  }

  if (lastToken) {
    // here is where it gets trickier:
    if (address.state.length) {
      // if there is a state, then assume there is also a city and street.
      // PROBLEM: city may be multiple words (spaces)
      // but we can pretty safely assume next-from-last token is at least PART of the city name
      // most cities are single-name. It would be very helpful if we knew more context, like
      // the name of the city user is in. But ignore that for now.
      // ideally would have zip code service or lookup to give city name for the zip code.
      var cityNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];

      // assumption / RULE: street and city must have comma delimiter
      // addresses that do not follow this rule will be wrong only if city has space
      // but don't care because Esri formats put comma before City
      var streetNameParts = [];

      // check remaining tokens from right-to-left for the first comma
      while (2 + 2 != 5) {
        lastToken = tokens.pop();
        if (!lastToken) break;
        else if (lastToken.endsWith(",")) {
          // found end of street address (may include building, etc. - don't care right now)
          // add token back to end, but remove trailing comma (it did its job)
          tokens.push(lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken);
          streetNameParts = tokens;
          break;
        } else {
          cityNameParts.unshift(lastToken);
        }
      }
      address.city = cityNameParts.join(' ');
      address.street = streetNameParts.join(' ');
    } else {
      // if there is NO state, then assume there is NO city also, just street! (easy)
      // reasoning: city names are not very original (Portland, OR and Portland, ME) so if user wants city they need to store state also (but if you are only ever in Portlan, OR, you don't care about city/state)
      // put last token back in list, then rejoin on space
      tokens.push(lastToken);
      address.street = tokens.join(' ');
    }
  }
  // when parsing right-to-left hard to know if street only vs street + city/state
  // hack fix for now is to shift stuff around.
  // assumption/requirement: will always have at least street part; you will never just get "city, state"  
  // could possibly tweak this with options or more intelligent parsing&sniffing
  if (!address.city && address.state) {
    address.city = address.state;
    address.state = '';
  }
  if (!address.street) {
    address.street = address.city;
    address.city = '';
  }

  return address;
}

// get list of objects with discrete address properties
var addresses = rawlist
  .filter(function(o) {
    return o.length > 0
  })
  .map(ParseAddressEsri);
$("#output").text(JSON.stringify(addresses));
console.log(addresses);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea>
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
13212 E SPRAGUE AVE, FAIR VALLEY, MD 99201
1005 N Gravenstein Highway, Sebastopol CA 95472
A. P. Croll &amp; Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 Shawnee Road, Greenwood, DE 19950
144 Kings Highway, S.W. Dover, DE 19901
Intergrated Const. Services 2 Penns Way Suite 405, New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
Nichols Excavation 2742 Pulaski Hwy, Newark, DE 19711
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway, Dover, DE 19901
P.O. Box 778, Dover, DE 19903
714 S OAK ST
714 S OAK ST, RUM TOWN, VA, 99201
3142 E SPRAGUE AVE, WHISKEY VALLEY, WA 99281
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
</textarea>
<div id="output">
</div>


免责声明:我的客户拥有自己的地址数据并运行自己的Esri服务器。如果您是从Google,OSM,ArcGisOnline或任何地方获取数据的,请确保可以存储和使用它(许多服务都对存储方式和存储时间有限制)
没必要,

上面的第一个答案令人信服地证明,如果您要处理的是全局地址列表,则此问题无法用正则表达式解决。200个国家有太多例外。在我的测试中,您可以相当可靠地从字符串中确定国家/地区,然后为每个国家/地区查找特定的正则表达式-这可能是更好的API的工作方式。
Marc Maxmeister


2

美国地址的另一种选择是YAddress(由我工作的公司制造)。

这个问题的许多答案都建议使用地理编码工具作为解决方案。重要的是不要混淆地址解析和地址编码。她们不一样。虽然地理编码器可能会将地址分解为组件,这是附带的好处,但它们通常依赖于非标准地址集。这意味着,由地址解析器解析的地址可能与官方地址不同。例如,Google地理编码API在曼哈顿将其称为“第六大道”,而USPS将其称为“美洲大道”。


2

对于美国地址解析,

我更喜欢使用pip中仅适用于usaddress的usaddress软件包

python3 -m pip install usaddress

PyPi 文档

这对于我在美国的地址来说效果很好。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# address_parser.py
import sys
from usaddress import tag
from json import dumps, loads

if __name__ == '__main__':
    tag_mapping = {
        'Recipient': 'recipient',
        'AddressNumber': 'addressStreet',
        'AddressNumberPrefix': 'addressStreet',
        'AddressNumberSuffix': 'addressStreet',
        'StreetName': 'addressStreet',
        'StreetNamePreDirectional': 'addressStreet',
        'StreetNamePreModifier': 'addressStreet',
        'StreetNamePreType': 'addressStreet',
        'StreetNamePostDirectional': 'addressStreet',
        'StreetNamePostModifier': 'addressStreet',
        'StreetNamePostType': 'addressStreet',
        'CornerOf': 'addressStreet',
        'IntersectionSeparator': 'addressStreet',
        'LandmarkName': 'addressStreet',
        'USPSBoxGroupID': 'addressStreet',
        'USPSBoxGroupType': 'addressStreet',
        'USPSBoxID': 'addressStreet',
        'USPSBoxType': 'addressStreet',
        'BuildingName': 'addressStreet',
        'OccupancyType': 'addressStreet',
        'OccupancyIdentifier': 'addressStreet',
        'SubaddressIdentifier': 'addressStreet',
        'SubaddressType': 'addressStreet',
        'PlaceName': 'addressCity',
        'StateName': 'addressState',
        'ZipCode': 'addressPostalCode',
    }
    try:
        address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
    except:
        with open('failed_address.txt', 'a') as fp:
            fp.write(sys.argv[1] + '\n')
        print(dumps({}))
    else:
        print(dumps(dict(address)))

运行address_parser.py

 python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
 {"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}


0

我来晚了,这是几年前我为澳大利亚编写的Excel VBA脚本。可以轻松对其进行修改以支持其他国家。我在这里建立了C#代码的GitHub存储库。我已经将其托管在我的网站上,您可以在这里下载:http : //jeremythompson.net/rocks/ParseAddress.xlsm

战略

对于任何邮政编码为数字或可以与RegEx匹配的国家/地区,我的策略都非常有效:

  1. 首先,我们检测被认为是第一行的名字和姓氏。通过取消选中复选框(如下所示,称为“名称为第一行”),可以轻松跳过名称并从地址开始。

  2. 接下来,可以安全地期望由街道和数字组成的地址位于郊区,而St,Pde,Ave,Av,Rd,Cres,环路等是分隔符。

  3. 检测郊区与州甚至国家/地区之间的关系可能会欺骗最复杂的解析器,因为可能会发生冲突。为了克服这个问题,我使用了PostCode查找,它是基于以下事实:剥离了街道和公寓/单元号以及PoBox,Ph,Fax,Mobile等后,仅会保留PostCode号。这很容易与regEx匹配,然后查找郊区和国家/地区。

您的国家邮政局服务将免费提供带有郊区和州的邮政编码列表,您可以将其存储在excel工作表,db表,text / json / xml文件等中。

  1. 最后,由于某些邮政编码具有多个郊区,因此我们检查地址中出现了哪个郊区。

在此处输入图片说明

VBA代码

免责声明,我知道这段代码并不完美,甚至编写得还不错,但是它很容易转换为任何编程语言并可以在任何类型的应用程序中运行。根据您所在的国家和法规,该策略的答案是正确的,以该代码为例:

Option Explicit

Private Const TopRow As Integer = 0

Public Sub ParseAddress()
Dim strArr() As String
Dim sigRow() As String
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim Stat As String
Dim SpaceInName As Integer
Dim Temp As String
Dim PhExt As String

On Error Resume Next

Temp = ActiveSheet.Range("Address")

'Split info into array
strArr = Split(Temp, vbLf)

'Trim the array
For i = 0 To UBound(strArr)
strArr(i) = VBA.Trim(strArr(i))
Next i

'Remove empty items/rows    
ReDim sigRow(LBound(strArr) To UBound(strArr))
For i = LBound(strArr) To UBound(strArr)
    If Trim(strArr(i)) <> "" Then
        sigRow(j) = strArr(i)
        j = j + 1
    End If
Next i
ReDim Preserve sigRow(LBound(strArr) To j)

'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
i = TopRow
If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then

SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
Else
 If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
End If

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
Else
  If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
End If
sigRow(i) = ""
End If

'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 8
    If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then

    'Find the position of the street in order to get the suburb
    SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1

    'If its a po box then add 5 chars
    If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5

    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
    Else
      If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
    End If
    'Trim the Street, Number leaving the Suburb if its exists on the same line
    sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
    sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")

    GoTo PastAddress:
    End If
    Next j
End If
Next i
PastAddress:

'Mobile
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 3
    Temp = Mb(j)
        If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
        Else
          If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
        End If
    sigRow(i) = ""
    GoTo PastMobile:
    End If
    Next j
End If
Next i
PastMobile:

'Phone
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 1
    Temp = Ph(j)
        If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then

            'TODO: Detect the intl or national extension here.. or if we can from the postcode.
            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
            Else
              If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
            End If

        sigRow(i) = ""
        GoTo PastPhone:
        End If
    Next j
End If
Next i
PastPhone:


'Email
For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        'replace with regEx search
        If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
        Dim email As String
        email = sigRow(i)
        email = Replace(VBA.UCase(email), "EMAIL:", "")
        email = Replace(VBA.UCase(email), "E-MAIL:", "")
        email = Replace(VBA.UCase(email), "E:", "")
        email = Replace(VBA.UCase(Trim(email)), "E ", "")
        email = VBA.LCase(email)

            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Email") = email
            Else
              If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
            End If
        sigRow(i) = ""
        Exit For
        End If
    End If
Next i

'Now the only remaining items will be the postcode, suburb, country
'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code

'Join the string and filter out the Post Code
Temp = Join(sigRow, vbCrLf)
Temp = Trim(Temp)

For i = 1 To Len(Temp)

Dim postCode As String
postCode = VBA.Mid(Temp, i, 4)

'In Australia PostCodes are 4 digits
If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then

    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("PostCode") = postCode
    Else
      If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
    End If

    'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
    Dim mySuburbArray As Range
    Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")

    Dim suburbs As String
    For j = 1 To mySuburbArray.Columns(1).Cells.Count
    If mySuburbArray.Cells(j, 1) = postCode Then
        'Check if the suburb is listed in the address
        If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then

        'Set the Suburb and State
        ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
        Stat = mySuburbArray.Cells(j, 3)
        ActiveSheet.Range("State") = Stat

        'Knowing the State - for Australia we can get the telephone Ext
        PhExt = PhExtension(VBA.UCase(Stat))
        ActiveSheet.Range("PhExt") = PhExt

        'remove the phone extension from the number
        Dim prePhone As String
        prePhone = ActiveSheet.Range("Phone")
        prePhone = Replace(prePhone, PhExt & " ", "")
        prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
        prePhone = Replace(prePhone, "(" & PhExt & ")", "")
        ActiveSheet.Range("Phone") = prePhone
        Exit For
        End If
    End If
    Next j
Exit For
End If
Next i

End Sub


Private Function PhExtension(ByVal State As String) As String
Select Case State
Case Is = "NSW"
PhExtension = "02"
Case Is = "QLD"
PhExtension = "07"
Case Is = "VIC"
PhExtension = "03"
Case Is = "NT"
PhExtension = "04"
Case Is = "WA"
PhExtension = "05"
Case Is = "SA"
PhExtension = "07"
Case Is = "TAS"
PhExtension = "06"
End Select
End Function

Private Function Ph(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Ph = "PH"
Case Is = 1
Ph = "PHONE"
'Case Is = 2
'Ph = "P"
End Select
End Function

Private Function Mb(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Mb = "MB"
Case Is = 1
Mb = "MOB"
Case Is = 2
Mb = "CELL"
Case Is = 3
Mb = "MOBILE"
'Case Is = 4
'Mb = "M"
End Select
End Function

Private Function Fax(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Fax = "FAX"
Case Is = 1
Fax = "FACSIMILE"
'Case Is = 2
'Fax = "F"
End Select
End Function

Private Function State(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
State = "NSW"
Case Is = 1
State = "QLD"
Case Is = 2
State = "VIC"
Case Is = 3
State = "NT"
Case Is = 4
State = "WA"
Case Is = 5
State = "SA"
Case Is = 6
State = "TAS"
End Select
End Function

Private Function Street(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Street = " ST"
Case Is = 1
Street = " RD"
Case Is = 2
Street = " AVE"
Case Is = 3
Street = " AV"
Case Is = 4
Street = " CRES"
Case Is = 5
Street = " LOOP"
Case Is = 6
Street = "PO BOX"
Case Is = 7
Street = " STREET"
Case Is = 8
Street = " ROAD"
Case Is = 9
Street = " AVENUE"
Case Is = 10
Street = " CRESENT"
Case Is = 11
Street = " PARADE"
Case Is = 12
Street = " PDE"
Case Is = 13
Street = " LANE"
Case Is = 14
Street = " COURT"
Case Is = 15
Street = " BLVD"
Case Is = 16
Street = "P.O. BOX"
Case Is = 17
Street = "P.O BOX"
Case Is = 18
Street = "PO BOX"
Case Is = 19
Street = "POBOX"
End Select
End Function
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.