regex to match any combination of text of indeterminate length

-1

A Windows SyncToy logfile contains several thousand lines of the form:

xxx ... C:\zzz. xxx ...

and

xxx ... zzz\. xxx ...

where xxx can be a string including any printable character including spaces and/or whitespace

and zzz can be a string including any printable character including spaces, backslashes, numerics, alphas (any case), . character, underscore, em-dash, en-dash

Each line will always contain a string zzz. as above, which may start with the characters C:\ followed by a string of indeterminate length (but let's say with a maximum of 256 chars) and ending with a . character; but it may not always start with C:\, it may simply start with some printable characters.

zzz will always start at character (column) 41

As you will recognise, C:\zzz. follows the pattern of an absolute pathname of a file under Windows (7 to be exact) with a trailing . character, but not always a terminating backslash.

So a typical line would be:

Error: Cannot read from the source file Error: Cannot read from the source file AppData\Roaming\Microsoft\Crypto\RSA\S-1-5-21-981944830-553675151-235582288-1001\. Access is denied. (Exception from HRESULT: 0x80070005 (E_ACCESSDENIED)) 

Another would be:

Error: Cannot read from the source file C:\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db. The process cannot access the file because it is being used by another process. (Exception from HRESULT: 0x80070020) Copying C:\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db to G:\gc\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db 

My requirement is to extract each full pathname from each line. So in the first example above, my desired output would be

AppData\Roaming\Microsoft\Crypto\RSA\S-1-5-21-981944830-553675151-235582288-1001\.

and in the second:

C:\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db.

Clearly I can cut the first 40 characters off each line, but this nevertheless leaves me with a string to match which is of indeterminate length, and may contain any or all of spaces, alphanumerics, . characters, underscores and backslashes.

I am familiar with simple regexes but I can't find a way to construct the ones I need to use grep (or sed or awk or whatever the most appropriate tool is) to extract the strings I want.

The files will come from Win7 but will probably get manipulated in Linux. Extended regex tools are available.

If there is an easier way to handle this than using Linux text tools and regex I'll be happy to follow that up too.

linux
regex
grep
sed
awk
asked on Super User Aug 4, 2017 by pdeeh

1 Answer

-1
[^\\]* (\S*\\\S*)

With this regex, the highlighted parts from the text blow will be captured in the first group.

Error: Cannot read from the source file Error: Cannot read from the source file AppData\Roaming\Microsoft\Crypto\RSA\S-1-5-21-981944830-553675151-235582288-1001. Access is denied. (Exception from HRESULT: 0x80070005 (E_ACCESSDENIED))

Error: Cannot read from the source file C:\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db. The process cannot access the file because it is being used by another process. (Exception from HRESULT: 0x80070020) Copying C:\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db to G:\gc\Users\zamenhof\AppData\Local\Microsoft\Windows\Explorer\thumbcache_256.db

Explanation: (more or less copy/pasted form regex101.com)

[^\\]* Match zero or more characters, excluding the backslash
\\ matches the character \ literally
* Quantifier — Matches between zero and unlimited times, as many times as possible

(space) matches the character (space) literally

1st Capturing Group (\S*\\\S*)
\S* matches any non-whitespace character
* Quantifier — Matches between zero and unlimited times, as many times as possible
\\ matches the character \ literally
\S* matches any non-whitespace character
* Quantifier — Matches between zero and unlimited times, as many times as possible

Learn: To experiment with regular expressions, you can take advantage of websites such as regex101.com or regexr.com.

Tools: You don't mention which tools you are going to use, but here's a perl example:

perl -lane 'print $1 if /[^\\]* (\S*\\\S*)/' file.txt
answered on Super User Aug 7, 2017 by simlev • edited Aug 7, 2017 by simlev

User contributions licensed under CC BY-SA 3.0